Neural machine translation (NMT) systems typically return a single translation for each input text segment. This means that when the input segment is ambiguous, the model must choose a translation from among various valid options, without regard to the intended use case or target audience. For example, a translator taking English inputs will often need to choose among multiple levels of formality or grammatical register — the tu and vous of French, for instance, or the tú and usted of Spanish — in the output.
In the past, training NMT models with formality control has relied on large labeled datasets. Creating high-quality labeled translations for many diverse languages is time consuming and expensive, and these earlier efforts were limited to specific languages.
This year, we released a new multidomain dataset, CoCoA-MT, with phrase-level annotations of formality and grammatical gender in six diverse language pairs, to help build NMT systems that do a better job of inferring formality. We also developed a method for using limited data to train machine translation models that can control the formality of their outputs. We presented both the dataset and the methodology at this year’s meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2022).
We also released CoCoA-MT as part of a shared task at this year’s International Workshop on Spoken Language Technologies (IWSLT 2022), where we invited the scientific community to tackle formality control in machine translation systems using limited or no data.
Culture and context
Without formality control, leaving models to choose between different valid translation options can lead to translations that are inconsistent or that can be perceived as rude or jarring in certain scenarios (e.g., business, customer service, gaming chat) or for speakers from certain cultures. For example, when asking “Are you sure?” in German, a customer support agent would use the formal register — “Sind Sie sich sicher?” — while in game chat players would talk to each other using an informal register — “Bist du dir sicher?”.
How formality is expressed grammatically and lexically can vary widely by language. In many Indo-European languages (e.g., German, Hindi, Italian, and Spanish), the formal and informal registers are distinguished by the second-person pronouns and/or corresponding verb agreement. Japanese and Korean have more extensive means of expressing polite, respectful, and humble speech, including morphological markings on the main verb and on some nouns and adjectives; specific lexical choices; and longer sentences.
With CoCoA-MT and our new training methodology, we aim to develop NMT models that are responsive to a wide range of formality indicators.
Learning to control formality with limited data
For the initial release of CoCoA-MT, we focused on six language pairs (English→ {German, Spanish, French, Hindi, Italian, Japanese}) across three spoken-language domains: customer support chat, topical chat, and telephone conversations. We asked professional translators to generate formal and informal translations from English source segments. Informal translations were post-edited from formal translations, where translators were instructed to make the minimal necessary changes (e.g., changing verb inflections, swapping pronouns). Translators additionally annotated phrases to indicate formality level, and we were able to use those annotations to develop a segment-level metric for measuring formality accuracy.
To leverage a small amount of labeled contrastive data, we propose framing formality control as a transfer learning problem. Our method begins with a generic NMT model, which is fine-tuned on contrastive examples from the CoCoA-MT dataset.
In our paper, we show that this fine-tuning strategy can successfully control the formality of a generic NMT system without losing generic quality. We also showed that the formality-controlled system was effective in out-of-domain settings (i.e., settings not matching the training domain).
What’s next?
We’ve seen that contrastive labeled data and transfer learning are means to effectively train models with a limited amount of data while preserving generic quality and generalizing to unseen domains. But challenges remain, especially as we look to expand formality customization to all 75 languages supported by Amazon Translate. For now, Amazon Translate customers can take advantage of this latest research and control the formality level when translating into French, German, Hindi, Italian, Japanese, Spain Spanish, Mexican Spanish, Korean, Dutch, Canadian French, and Portugal Portuguese. Researchers can look forward to IWSLT 2023, where we will organize a shared task on formality control in collaboration with the University of Maryland, College Park.
Acknowledgments: Anna Currey