-
2024State-of-the-art speech models may exhibit suboptimal performance in specific population subgroups. Detecting these challenging subgroups is crucial to enhance model robustness and fairness. Traditional methods for subgroup identification typically rely on demographic information such as age, gender, and origin. However, collecting such sensitive data at deployment time can be impractical or unfeasible
-
EACL 2024 Workshop on Linguistic Annotation2024Recent developments in active learning algorithms for NLP tasks show promising results in terms of reducing labelling complexity. In this paper we extend this effort to imbalanced datasets; we bridge between the active learning approach of obtaining diverse and informative examples, and the heuristic of class balancing used in imbalanced datasets. We develop a novel tune-free weighting technique that can
-
arXiv2024We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion- parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed
-
WSDM 20242024Products on e-commerce websites are usually organized based on seller-provided product attributes. Customers looking for a product typically have certain needs or product use-cases in mind, for e.g., a headphone for gym classes, or a printer for a small business. However, they often struggle to map these use-cases to product attributes and subsequently fail to find the product they need. In this talk, we
-
AAAI 20242024Large language models (LLMs) have shown remarkable generalization capability with exceptional performance in various language modeling tasks. However, they still exhibit inherent limitations in precisely capturing and returning grounded knowledge. While existing work has explored utilizing knowledge graphs (KGs) to enhance language modeling via joint training and customized model architectures, applying
Related content
-
April 18, 2019Last year, Amazon announced the beta release of Alexa Guard, a new service that lets customers who are leaving the house instruct their Echo devices to listen for glass breaking or smoke and carbon dioxide alarms going off. At this year’s International Conference on Acoustics, Speech, and Signal Processing, our team is presenting several papers on sound detection. I wrote about one of them a few weeks ago, a new method for doing machine learning with unbalanced data sets.
-
April 11, 2019Multiband dynamics processing, which separately modifies volume in different frequency bands of an audio signal, is known to improve listeners’ audio experiences. But in the context of voice-controlled systems like the Amazon Echo family of products, it can also improve automatic speech recognition by making echo cancellation easier.
-
April 8, 2019Transfer learning is the technique of adapting a machine learning model trained on abundant data to a new context in which training data is sparse. On the Alexa team, we’ve explored transfer learning as a way to bootstrap new functions and to add new classification categories to existing machine learning systems.
-
April 4, 2019Customer interactions with Alexa are constantly growing more complex, and on the Alexa science team, we strive to stay ahead of the curve by continuously improving Alexa’s speech recognition system. Increasingly, keeping pace with Alexa’s expanding capabilities will require automating the learning process, through techniques such as semi-supervised learning, which leverages a small amount of annotated data to extract information from a much larger store of unannotated data.
-
April 1, 2019The idea of using arrays of microphones to improve automatic speech recognition (ASR) is decades old. The acoustic signal generated by a sound source reaches multiple microphones with different time delays. This information can be used to create virtual directivity, emphasizing a sound arriving from a direction of interest and diminishing signals coming from other directions. In voice recognition, one of the more popular methods for doing this is known as “beamforming”.
-
Animation by Nick LittleMarch 28, 2019Audio watermarking is the process of adding a distinctive sound pattern — undetectable to the human ear — to an audio signal to make it identifiable to a computer. It’s one of the ways that video sites recognize copyrighted recordings that have been posted illegally. To identify a watermark, a computer usually converts a digital file into an audio signal, which it processes internally.