Recently at our Seattle headquarters, Amazon had the pleasure of hosting Iceland’s President, H. E. Guðni Th. Jóhannesson, along with a delegation spanning Icelandic government officials, business leaders, and academics. It was truly an honor to meet with them.
The president’s visit to the region was part of a broader mission to preserve the Icelandic language in the digital age through its integration into all forms of technology. In this post, we’d like to highlight some of the exciting and innovative work Iceland has spearheaded in an effort to accelerate the digital integration of Icelandic. We have found these to be strong, collaborative tools, and we hope others do, too.
Since 2019, Iceland’s government has been funding a five-year language technology program for Icelandic, which has led to an impressive set of artifacts relevant to text-to-speech, speech recognition, and natural-language-processing. These include parallel datasets, pronunciation lexicons, text normalization mappings, speech data, treebanks, tokenizers, named-entity recognizers, and modeling recipes. These tools have important applications in all languages, particularly those with relatively small amounts of data for training machine learning models.
The program’s strategy is multifaceted, targeting everything from fundamental research to customer-facing products. Its five core research areas are language resources, speech recognition, speech synthesis, machine translation, and spelling and grammar checking.
A list of selected resources appears below. We hope that you will use these resources in your own work, and we encourage you to keep an eye on the program’s progress.
Additionally, we’d like to highlight some work that Amazon has been doing for language expansion and low-data natural-language processing.
We recently launched the MASSIVE dataset, competition, and workshop, which will help advance the state of the art for multilingual natural-language understanding, for Icelandic and 50 other languages.
Amazon Translate has expanded into 75 languages, and Amazon Polly supports 33 languages, both including Icelandic. Language expansion and support is a consistent effort across many Amazon services and products.
We’ve also been busy in core scientific research, including research in cross-lingual transfer learning, zero-shot transfer learning, multilingual training data generation, adversarial advertisement detection, text normalization for new languages in text-to-speech systems, and continuous improvement with machine translation. These are just a few examples. If you’d like to join us in tackling similar challenges, please visit our careers page.
The prevailing sentiment during our meeting with the Icelandic presidential delegation was one of optimism — optimism that developers everywhere can leverage recent and upcoming advances in artificial intelligence to accelerate the integration of Icelandic and other languages into all types of technology.
Keep building.
Resources
Here are some resources provided to us by the Icelandic delegation that you may find useful:
- An overview of the program and past work.
- Parallel text-speech database for TTS (Talrómur): The first part of the database (Talrómur 1) consists of 220 hours of studio-quality recordings from four female and four male voices. Each voice donor recorded between 10 and 30 hours of data, which should be sufficient to build a voice that sounds like that donor. The data is available under a Creative Commons 4.0 BY license.
- Talrómur 2: 80 hours of studio-quality recordings from 20 female and 20 male voices. Each voice donor recorded approximately two hours of data. While two hours might not be enough to create a voice from scratch based on a specific voice donor, it should be possible to join the voices in this dataset (and, indeed, in Talrómur 1) to create a voice that is a unique mix of the voices in the dataset. The data is available under a Creative Commons 4.0 BY license.
- Icelandic pronunciation dictionary: A manually verified pronunciation lexicon containing almost 50,000 unique word forms transcribed in four pronunciation variants, often including a clear and a less formal transcription (reading pronunciation vs. casual-speech pronunciation). The repository contains transcription rules and guidelines followed in the project. The dictionary is available under a Creative Commons 4.0. BY license.
- Text normalization corpus: A corpus of 40,000 sentences, manually normalized for TTS (an example of a normalization task in TTS is converting, e.g., “$30” to “thirty dollars”).
- Text preprocessing for TTS: A text-preprocessing pipeline connecting standalone modules for text cleaning, text normalization, phrasing, and grapheme-to-phoneme (g2p) conversion. The front-end pipeline and all submodules are available under an Apache 2.0 license.
- Recipes for Icelandic TTS: Open-source TTS recipes for Icelandic have been made available as part of the Language Technology Programme for Icelandic (LTPI). A traditional unit selection recipe implemented in Festival is available here under an Apache 2.0 license.
- Neural-TTS recipe: Implemented in FastSpeech. Available under Apache 2.0 license.
- Talrómur 1 baseline models, train/test splits, and alignments
- Parallel text-speech database for ASR (Samrómur): The Samrómur crowd-sourcing platform is derived from the Mozilla Common Voice project. It is based on read prompts from volunteers and totals over 2,300 hours of data. The crowdsourcing statistics can be seen here. A concurrent verification effort has led to publications (under Creative Commons 4.0 BY licenses) that can, for example, be found here. A similar dataset of 152 hours of adult voices was collected around 2011 and is available here.
- Parliamentary speech data: 542 hours of clean and verified speeches from the Icelandic parliament.
Other speech databases
- 193 hours of television and radio speech data
- 21 hours of transcribed conversations
- 51 hours of transcribed university lectures
- 20 hours of read queries
- 131 hours of children’s speech
Resources for ASR language modeling
Other tools and recipes for ASR