Interspeech
This year's Interspeech will be held in Graz, Austria, whose famed clock tower was built in the mid-1500s
Photo courtesy of Getty Images

The 16 Alexa-related papers at this year’s Interspeech

At next week’s Interspeech, the largest conference on the science and technology of spoken-language processing, Alexa researchers have 16 papers, which span the five core areas of Alexa functionality: device activation, or recognizing speech intended for Alexa and other audio events that require processing; automatic speech recognition (ASR), or converting the speech signal into text; natural-language understanding, or determining the meaning of customer utterances; dialogue management, or handling multiturn conversational exchanges; and text-to-speech, or generating natural-sounding synthetic speech to convey Alexa’s responses. Two of the papers are also more-general explorations of topics in machine learning.

Device Activation

Model Compression on Acoustic Event Detection with Quantized Distillation
Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang

The researchers combine two techniques to shrink neural networks trained to detect sounds by 88%, with no loss in accuracy. One technique, distillation, involves using a large, powerful model to train a leaner, more-efficient one. The other technique, quantization, involves using a fixed number of values to approximate a larger range of values.

Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification
Chieh-Chi Kao, Ming Sun, Yixin Gao, Shiv Vitaladevuni, Chao Wang

Convolutional neural nets (CNNs) were originally designed to look for the same patterns in every block of pixels in a digital image. But they can also be applied to acoustic signals, which can be represented as two-dimensional mappings of time against frequency-based “features”. By restricting an audio-processing CNN’s search only to the feature ranges where a particular pattern is likely to occur, the researchers make it much more computationally efficient. This could make audio processing more practical for power-constrained devices.

A Study for Improving Device-Directed Speech Detection toward Frictionless Human-Machine Interaction
Che-Wei Huang, Roland Maas, Sri Harish Mallidi, Björn Hoffmeister

This paper is an update of prior work on detecting device-directed speech, or identifying utterances intended for Alexa. The researchers find that labeling dialogue turns (distinguishing initial utterances from subsequent utterances) and using signal representations based on Fourier transforms rather than mel-frequencies improve accuracy. They also find that, among the features extracted from speech recognizers that the system considers, confusion networks, which represent word probabilities at successive sentence positions, have the most predictive power.

Automatic Speech Recognition (ASR)

Acoustic Model Bootstrapping Using Semi-Supervised Learning
Langzhou Chen, Volker Leutnant

The researchers propose a method for selecting machine-labeled utterances for semi-supervised training of an acoustic model, the component of an ASR system that takes an acoustic signal as input. First, for each training sample, the system uses the existing acoustic model to identify the two most probable word-level interpretations of the signal at each position in the sentence. Then it finds examples in the training data that either support or contradict those probability estimates, which it uses to adjust the uncertainty of the ASR output. Samples that yield significant reductions in uncertainty are preferentially selected for training.

Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings
Prakhar Swarup, Roland Maas, Sri Garimella, Sri Harish Mallidi, Björn Hoffmeister

Speech recognizers assign probabilities to different interpretations of acoustic signals, and these probabilities can serve as inputs to a machine learning model that assesses the recognizer’s confidence in its classifications. The resulting confidence scores can be useful to other applications, such as systems that select machine-labeled training data for semi-supervised learning. The researchers append embeddings — fixed-length vector representations — of both the raw acoustic input and the speech recognizer’s best estimate of the word sequence to the inputs to a confidence-scoring network. The result: a 6.5% reduction in equal-error rate (the error rate that results when the false-negative and false-positive rates are set as equal).

Multi-Dialect Acoustic Modeling Using Phone Mapping and Online I-Vectors
Harish Arsikere, Ashtosh Sapru, Sri Garimella

Multi-dialect acoustic models, which help convert multi-dialect speech signals to words, are typically neural networks trained on pooled multi-dialect data, with separate output layers for each dialect. The researchers show that mapping the phones — the smallest phonetic units of speech — of each dialect to those of the others offers comparable results with shorter training times and better parameter sharing. They also show that recognition accuracy can be improved by adapting multi-dialect acoustic models, on the fly, to a target speaker.

Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion
Alex Sokolov, Tracy Rohlin, Ariya Rastrow

Grapheme-to-phoneme models, which translate written words into their phonetic equivalents (“echo” to “E k oU”), enable speech recognizers to handle words they haven’t seen before. The researchers train a single neural model to handle grapheme-to-phoneme conversion in 18 languages. The results are comparable to those of state-of-the-art single-language models for languages with abundant training data and better for languages with sparse data. Multilingual models are more flexible and easier to maintain in production environments.

Scalable Multi Corpora Neural Language Models for ASR
Anirudh Raju, Denis Filimonov, Gautam Tiwari, Guitang Lan, Ariya Rastrow

Language models, which compute the probability of a given sequence of words, help distinguish between different interpretations of speech signals. Neural language models promise greater accuracy than existing models, but they’re difficult to incorporate into real-time speech recognition systems. The researchers describe several techniques to make neural language models practical, from a technique for weighting training samples from out-of-domain data sets to noise contrastive estimation, which turns the calculation of massive probability distributions into simple binary decisions.

Natural-Language Understanding

Neural Named Entity Recognition from Subword Units
Abdalghani Abujabal, Judith Gaspers

Named-entity recognition is crucial to voice-controlled systems — as when you tell Alexa “Play ‘Spirit’ by Beyoncé”. A neural network that recognizes named entities typically has dedicated input channels for every word in its vocabulary. This has two drawbacks: (1) the network grows extremely large, which makes it slower and more memory intensive, and (2) it has trouble handling unfamiliar words. The researchers trained a named-entity recognizer that instead takes subword units — characters, phonemes, and bytes — as inputs. It offers comparable performance with a vocabulary of only 332 subwords, versus 74,000-odd words.

Dialogue Management

HyST: A Hybrid Approach for Flexible and Accurate Dialogue State Tracking
Rahul Goel, Shachi Paul, Dilek Hakkani-Tür

Dialogue-based computer systems need to track “slots” — types of entities mentioned in conversation, such as movie names — and their values — such as Avengers: Endgame. Training a machine learning system to decide whether to pull candidate slot values from prior conversation or compute a distribution over all possible slot values improves slot-tracking accuracy by 24% over the best-performing previous system.

Towards Universal Dialogue Act Tagging for Task-Oriented Dialogues
Shachi Paul, Rahul Goel, Dilek Hakkani-Tür

Dialogue-based computer systems typically classify utterances by “dialogue act” — such as requesting, informing, and denying — as a way of gauging progress toward a conversational goal. As a first step in developing a system that will automatically label dialogue acts in human-human conversations (to, in turn, train a dialogue-act classifier), the researchers create a “universal tagging scheme” for dialogue acts. They use this scheme to reconcile the disparate tags used in different data sets.

Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations
Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tür

The researchers report a new data set, which grew out of the Alexa Prize competition and is intended to advance research on AI agents that engage in social conversations. Pairs of workers recruited through Mechanical Turk were given information on topics that arose frequently during Alexa Prize interactions and asked to converse about them, documenting the sources of their factual assertions. The researchers used the resulting data set to train a knowledge-grounded response generation network, and they report automated and human evaluation results as state-of-the-art baselines.

Text-to-Speech

Towards Achieving Robust Universal Neural Vocoding
Jaime Lorenzo Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal

A vocoder is the component of a speech synthesizer that takes the frequency-spectrum snapshots generated by other components and fills in the information necessary to convert them to audio. The researchers trained a neural-network-based vocoder using data from 74 speakers of both genders in 17 languages. The resulting “universal vocoder” outperformed speaker-specific vocoders, even on speakers and languages it had never encountered before and unusual tasks such as synthesized singing.

Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-to-Speech
Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, Thomas Drugman

The researchers present a new technique for transferring prosody (intonation, stress, and rhythm) from a recording to a synthesized voice, enabling the user to choose whose voice will read recorded content, with inflections preserved. Where earlier prosody transfer systems used spectrograms — frequency spectrum snapshots — as inputs, the researchers’ system uses easily normalized prosodic features extracted from the raw audio.

Machine Learning

Two Tiered Distributed Training Algorithm for Acoustic Modeling
Pranav Ladkat, Oleg Rybakov, Radhika Arava, Sree Hari Krishnan Parthasarathi,I-Fan Chen, Nikko Strom

When neural networks are trained on large data sets, the training needs to be distributed, or broken up across multiple processors. A novel combination of two state-of-the-art distributed-learning algorithms — GTC and BMUF — achieves both higher accuracy and more-efficient training then either, when learning is distributed to 128 parallel processors.

BMUF-GTC.gif._CB436386414_.gif
The researchers' new method splits distributed processors into groups, and within each group, the processors use the highly accurate GTC method to synchronize their models. At regular intervals, designated representatives from all the groups use a different method — BMUF — to share their models and update them accordingly. Finally, each representative broadcasts its updated model to the rest of its group.
Animation by Nick Little

One-vs-All Models for Asynchronous Training: An Empirical Analysis
Rahul Gupta, Aman Alok, Shankar Ananthakrishnan

A neural network can be trained to perform multiple classifications at once: it might recognize multiple objects in an image, or assign multiple topic categories to a single news article. An alternative is to train a separate “one-versus-all” (OVA) classifier for each category, which classifies data as either in the category or out of it. The advantage of this approach is that each OVA classifier can be re-trained separately as new data becomes available. The researchers present a new metric that enables comparison of multiclass and OVA strategies, to help data scientists determine which is more useful for a given application.

Research areas

Related content

LU, Luxembourg
Are you a MS student interested in a 2026 internship in the field of machine learning, deep learning, generative AI, large language models and speech technology, robotics, computer vision, optimization, operations research, quantum computing, automated reasoning, or formal methods? If so, we want to hear from you! We are looking for a customer obsessed Data Scientist Intern who can innovate in a business environment, building and deploying machine learning models to drive step-change innovation and scale it to the EU/worldwide. If this describes you, come and join our Data Science teams at Amazon for an exciting internship opportunity. If you are insatiably curious and always want to learn more, then you’ve come to the right place. You can find more information about the Amazon Science community as well as our interview process via the links below; https://www.amazon.science/ https://amazon.jobs/content/en/career-programs/university/science Key job responsibilities As a Data Science Intern, you will have following key job responsibilities: • Work closely with scientists and engineers to architect and develop new algorithms to implement scientific solutions for Amazon problems. • Work on an interdisciplinary team on customer-obsessed research • Experience Amazon's customer-focused culture • Create and Deliver Machine Learning projects that can be quickly applied starting locally and scaled to EU/worldwide • Build and deploy Machine Learning models using large data-sets and cloud technology. • Create and share with audiences of varying levels technical papers and presentations • Define metrics and design algorithms to estimate customer satisfaction and engagement A day in the life At Amazon, you will grow into the high impact person you know you’re ready to be. Every day will be filled with developing new skills and achieving personal growth. How often can you say that your work changes the world? At Amazon, you’ll say it often. Join us and define tomorrow. Some more benefits of an Amazon Science internship include; • All of our internships offer a competitive stipend/salary • Interns are paired with an experienced manager and mentor(s) • Interns receive invitations to different events such as intern program initiatives or site events • Interns can build their professional and personal network with other Amazon Scientists • Interns can potentially publish work at top tier conferences each year About the team Applicants will be reviewed on a rolling basis and are assigned to teams aligned with their research interests and experience prior to interviews. Start dates are available throughout the year and durations can vary in length from 3-6 months for full time internships. This role may available across multiple locations in the EMEA region (Austria, France, Germany, Ireland, Israel, Italy, Luxembourg, Netherlands, Poland, Romania, Spain and the UK). Please note these are not remote internships.
US, CA, Sunnyvale
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Applied Scientist; to support the development and implementation of Generative AI (GenAI) algorithms and models for supervised fine-tuning, and advance the state of the art with Large Language Models (LLMs), As an Applied Scientist, you will play a critical role in supporting the development of GenAI technologies that can handle Amazon-scale use cases and have a significant impact on our customers' experiences. Key job responsibilities - Collaborate with cross-functional teams of engineers and scientists to identify and solve complex problems in GenAI - Design and execute experiments to evaluate the performance of different algorithms and models, and iterate quickly to improve results - Think big about the arc of development of GenAI over a multi-year horizon, and identify new opportunities to apply these technologies to solve real-world problems - Communicate results and insights to both technical and non-technical audiences, including through presentations and written reports
US, CA, San Francisco
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Member of Technical Staff with a strong deep learning background, to build industry-leading Generative Artificial Intelligence (GenAI) technology with Large Language Models (LLMs) and multimodal systems. Key job responsibilities As a Member of Technical Staff with the AGI team, you will lead the development of algorithms and modeling techniques, to advance the state of the art with LLMs. You will lead the foundational model development in an applied research role, including model training, dataset design, and pre- and post-training optimization. Your work will directly impact our customers in the form of products and services that make use of GenAI technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in LLMs. About the team The AGI team has a mission to push the envelope in GenAI with LLMs and multimodal systems, in order to provide the best-possible experience for our customers.
US, CA, San Francisco
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Member of Technical Staff with a strong deep learning background, to build industry-leading Generative Artificial Intelligence (GenAI) technology with Large Language Models (LLMs) and multimodal systems. Key job responsibilities As a Member of Technical Staff with the AGI team, you will lead the development of algorithms and modeling techniques, to advance the state of the art with LLMs. You will lead the foundational model development in an applied research role, including model training, dataset design, and pre- and post-training optimization. Your work will directly impact our customers in the form of products and services that make use of GenAI technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in LLMs. About the team The AGI team has a mission to push the envelope in GenAI with LLMs and multimodal systems, in order to provide the best-possible experience for our customers.
US, CA, San Francisco
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Member of Technical Staff with a strong deep learning background, to build industry-leading Generative Artificial Intelligence (GenAI) technology with Large Language Models (LLMs) and multimodal systems. Key job responsibilities As a Member of Technical Staff with the AGI team, you will lead the development of algorithms and modeling techniques, to advance the state of the art with LLMs. You will lead the foundational model development in an applied research role, including model training, dataset design, and pre- and post-training optimization. Your work will directly impact our customers in the form of products and services that make use of GenAI technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in LLMs. About the team The AGI team has a mission to push the envelope in GenAI with LLMs and multimodal systems, in order to provide the best-possible experience for our customers.
US, CA, San Francisco
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Member of Technical Staff with a strong deep learning background, to build industry-leading Generative Artificial Intelligence (GenAI) technology with Large Language Models (LLMs) and multimodal systems. Key job responsibilities As a Member of Technical Staff with the AGI team, you will lead the development of algorithms and modeling techniques, to advance the state of the art with LLMs. You will lead the foundational model development in an applied research role, including model training, dataset design, and pre- and post-training optimization. Your work will directly impact our customers in the form of products and services that make use of GenAI technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in LLMs. About the team The AGI team has a mission to push the envelope in GenAI with LLMs and multimodal systems, in order to provide the best-possible experience for our customers.
US, CA, San Francisco
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Member of Technical Staff with a strong deep learning background, to build industry-leading Generative Artificial Intelligence (GenAI) technology with Large Language Models (LLMs) and multimodal systems. Key job responsibilities As a Member of Technical Staff with the AGI team, you will lead the development of algorithms and modeling techniques, to advance the state of the art with LLMs. You will lead the foundational model development in an applied research role, including model training, dataset design, and pre- and post-training optimization. Your work will directly impact our customers in the form of products and services that make use of GenAI technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in LLMs. About the team The AGI team has a mission to push the envelope in GenAI with LLMs and multimodal systems, in order to provide the best-possible experience for our customers.
US, CA, Sunnyvale
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! We are looking for a self-motivated, passionate and resourceful Sr. Applied Scientists with Recommender System or Search Ranking or Ads Ranking experience to bring diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. You will spend your time as a hands-on machine learning practitioner and a research leader. You will play a key role on the team, building and guiding machine learning models from the ground up. At the end of the day, you will have the reward of seeing your contributions benefit millions of Amazon.com customers worldwide. Key job responsibilities - Develop AI solutions for various Prime Video Recommendation/Search systems using Deep learning, GenAI, Reinforcement Learning, and optimization methods; - Work closely with engineers and product managers to design, implement and launch AI solutions end-to-end; - Design and conduct offline and online (A/B) experiments to evaluate proposed solutions based on in-depth data analyses; - Effectively communicate technical and non-technical ideas with teammates and stakeholders; - Stay up-to-date with advancements and the latest modeling techniques in the field; - Publish your research findings in top conferences and journals. About the team Prime Video Recommendation/Search Science team owns science solution to power search experience on various devices, from sourcing, relevance, ranking, to name a few. We work closely with the engineering teams to launch our solutions in production.
US, WA, Seattle
We are open to hiring candidates to work out of one of the following locations: San Francisco, CA, USA | Santa Clara, CA, USA | Seattle, WA, USA | Sunnyvale, CA, USA Amazon is seeking an innovative and high-judgement Senior Applied Scientist to join the Privacy Engineering team in the Amazon Privacy Services org. We own products and programs that deliver technical innovation for ensuring compliance with high-impact, urgent regulation across Amazon services worldwide. The Senior Applied Scientist will contribute to the strategic direction for Amazon’s privacy practices while building/owning the compliance approach for individual regulations such as General Data Protection Regulation (GDPR), DMA, Quebec 25 etc. This will require helping to frame, and participating in, high judgment debates and decision making across senior business, technology, legal, and public policy leaders. A great candidate will have a unique combination of experience with innovative data governance technology, high judgement in system architecture decisions and ability to set detailed technical design from ambiguous compliance requirements. You will drive foundational, cross-service decisions, set technical requirements, oversee technical design, and have end to end accountability for delivering technical changes across dozens of different systems. You will have high engagement with WW senior leadership via quarterly reviews, annual organizational planning, and s-team goal updates. Key job responsibilities * Develop information retrieval benchmarks related to code analysis and invent algorithms to optimize identification of privacy requirements and controls. * Develop semantic and syntactic code analysis tools to assess privacy implementations within application code, and automatic code replacement tools to enhance privacy implementations. * Leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in generative artificial intelligence for privacy compliance. * Collaborate with other science and engineering teams as well as business stakeholders to maximize the velocity and impact of your contributions. A day in the life Amazon Privacy Services own products and programs that deliver technical innovation for ensuring Privacy Amazon services worldwide. We are hiring an innovative and high-judgement Senior Applied Scientist to develop AI solutions for builders across Amazon’s consumer and digital businesses including but not limited to Amazon.com, Amazon Ads, Amazon Go, Prime Video, Devices and more. Our ideal candidate is creative, has excellent problem-solving skills, a solid understanding of computer science fundamentals, deep learning and a customer-focused mindset. The Senior Scientist will serve as the resident expert on the development of AI agents for privacy. They build on their experiences to develop LLMs to develop AI implementations across privacy workflows. They will have responsibilities to mentor junior scientists and engineers develop AI skills. About the team Diverse Experiences Amazon Security values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why Amazon Security? At Amazon, security is central to maintaining customer trust and delivering delightful customer experiences. Our organization is responsible for creating and maintaining a high bar for security across all of Amazon’s products and services. We offer talented security professionals the chance to accelerate their careers with opportunities to build experience in a wide variety of areas including cloud, devices, retail, entertainment, healthcare, operations, and physical stores Inclusive Team Culture In Amazon Security, it’s in our nature to learn and be curious. Ongoing DEI events and learning experiences inspire us to continue learning and to embrace our uniqueness. Addressing the toughest security challenges requires that we seek out and celebrate a diversity of ideas, perspectives, and voices. Training & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, training, and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why flexible work hours and arrangements are part of our culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve.
US, WA, Seattle
Here at Amazon, we embrace our differences. We are committed to furthering our culture of diversity and inclusion of our teams within the organization. How do you get items to customers quickly, cost-effectively, and—most importantly—safely, in less than an hour? And how do you do it in a way that can scale? Our teams of hundreds of scientists, engineers, aerospace professionals, and futurists have been working hard to do just that! We are delivering to customers, and are excited for what’s to come. Check out more information about Prime Air on the About Amazon blog (https://www.aboutamazon.com/news/transportation/amazon-prime-air-delivery-drone-reveal-photos). If you are seeking an iterative environment where you can drive innovation, apply state-of-the-art technologies to solve real world delivery challenges, and provide benefits to customers, Prime Air is the place for you. Come work on the Amazon Prime Air Team! We are seeking a highly skilled Navigation Scientist to help develop advanced algorithms and software for our Prime Air delivery drone program. In this role, you will conduct comprehensive navigation analysis to support cross-functional decision-making, define system architecture and requirements, contribute to the development of flight algorithms, and actively identify innovative technological opportunities that will drive significant enhancements to meet our customers' evolving demands. Export Control License: This position may require a deemed export control license for compliance with applicable laws and regulations. Placement is contingent on Amazon’s ability to apply for and obtain an export control license on your behalf.