Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding

2022
Download Copy BibTeX
Copy BibTeX
End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio semantics data. In this paper, we consider an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the ‘acoustic’ and ‘text’ embeddings. We propose using different multi-modal losses to explicitly align the acoustic embedding to the text embeddings (obtained via a semantically powerful pre-trained BERT model) in the latent space. We train the CMLS model on two publicly available E2E datasets and one internal dataset, across different cross-modal losses. Our proposed triplet loss function achieves the best performance. It achieves a relative improvement of 22.1% over an E2E model without a cross-modal space and a relative improvement of 2.8% over a previously published CMLS model using L2 loss on our internal dataset.
Research areas

Latest news

IN, TS, Hyderabad
Welcome to the Worldwide Returns & ReCommerce team (WWR&R) at Amazon.com. WWR&R is an agile, innovative organization dedicated to ‘making zero happen’ to benefit our customers, our company, and the environment. Our goal is to achieve the three zeroes: zero cost of returns, zero waste, and zero defects. We do this by developing products and driving truly innovative operational excellence to help customers keep what they buy, recover returned and damaged product value, keep thousands of tons of waste from landfills, and create the best customer returns experience in the world. We have an eye to the future – we create long-term value at Amazon by focusing not just on the bottom line, but on the planet. We are building the most sustainableRead more