Google’s massive translation work pinpoints where it’s goofing

c201c7dd-7cfe-44db-8c07-bb63391efdce.png

Scores for languages ​​when translating from English and back to English, correlated to the number of example sentences the language has. Towards the right side, more example sentences yield better scores. There are outliers, like English in Cyrillic, which has very few examples but translates well.

Bapna et al., 2022

What do you do after collecting writing samples for a thousand languages ​​for translation, and humans still rate the resulting translations as a failure?

Examine the chess, of course.

And that’s the interesting work that machine learning scientists at Google reported on this month in a massive research paper on multilingual translation, “Building Machine Translation Systems for the Next Thousand Languages.”

“Despite enormous advances in low-resource machine translation, the number of languages ​​for which widely available, general-purpose machine translation systems have been built has been limited to around 100, which is a small fraction of the more than 7000+ languages ​​that are spoken in the world today,” write lead author Ankur Bapna and colleagues.

The article describes a project to create a dataset of over a thousand languages, including so-called low-resource languages, those that have very few documents to use as examples for machine learning training. .

Also: DeepMind: why is AI so good at language? It’s something in the language itself

If it is easy to collect billions of example sentences for English and more than one hundred million example sentences for Icelandic, for example, the Kalaallisut language, spoken by approximately 56,000 people in Greenland, has less than one million, and the Kelantan-Pattani Malay language, spoken by about five million people in Malaysia and Thailand, has less than 10,000 readily available example sentences.

To compile a dataset for machine translation of these low-resource languages, Bapna and two dozen colleagues first created a tool to scour the Internet and identify texts in low-resource languages. The authors use a number of machine learning techniques to extend a system called LangID, which includes techniques for identifying whether web text belongs to a given language. It’s a fairly complex process of eliminating false positives.

After scouring the web with LangID techniques, the authors were able to assemble “a dataset with corpora for 1503 low-resource languages, ranging from one sentence (Mape) to 83 million sentences (Sabah Malay).”

The scientists narrowed this list down to 1,057 languages ​​”where we recovered over 25,000 monolingual sentences (before deduplication)”, and combined this group of samples with the much larger data for 83 “high-resource languages” such than English.

Also: AI: the pattern is not in the data, it is in the machine

They then tested their dataset by running experiments to translate between the languages ​​in that set. They used various versions of the ubiquitous Transformer neural network for language modeling. In order to test the performance of the translations, the authors focused on translation to and from English with 38 languages ​​for which they obtained examples of true translations, including Kalaallisut

This is where the most interesting part comes in. The authors asked human reviewers who are native speakers of low-resource languages ​​to rate the quality of translations for 28 languages ​​on a scale of zero to six, with 0 being “nonsense or bad language.” and 6 perfect.”

Also: Tower of Babel open source Facebook, Klingon not supported

The results are not great. Out of 28 languages ​​translated from English, 13 were rated below 4 on the scale in terms of translation quality. This would imply that almost half of the translations from English to the target were poor.

The authors have a fascinating discussion starting on page 23 about what seems to have gone wrong in these translations with low ratings.

“The biggest takeaway is that the automatic metrics overestimate performance on related dialects,” they write, meaning that the scores the machine assigns to translations, such as the widely used BLEU score, tend to give credit where the neural network just translates into bad language. it’s like another language. For example, “Nigerian Pidgin (pcm), a dialect of English, had very high BLEU and CHRF scores, around 35 and 60 respectively. Language’, with trusted native speakers confirming the translations were unusable.”

“What’s happening here that the pattern translates to (a corrupted version of) the wrong dialect, but it’s close enough at the n-gram character level” for the auto reference to give it a high rating, they observe.

“This is the result of a data pollution problem”, they deduce, “since these languages ​​are so close to other languages ​​that are much more widespread on the web […] training data is much more likely to be mixed with corrupt versions of the higher-resource language or with other varieties. »

f73ce2de-4104-4921-a720-29c93f039eb2.png

Examples of translations with correct terms in blue and translation errors in yellow. The left column displays the code into which the language is translated, using standard BCP-47 markup.

Bapna et al., 2022

Also: Google uses MLPerf competition to showcase performance of gigantic version of BERT language model

And then there are what the authors call “characteristic error modes” in translations, such as “translating nouns that appear in similar distribution contexts in training data”, such as replacing “relatively common nouns like “tiger” by another type of animal word, they note, “showing that the model learned the distribution context in which that name appears, but was unable to acquire the exact mappings of a language to another with sufficient detail in this category”.

Such a problem occurs with “animal names, colors and times of day”, and “was also a problem with adjectives, but we observed few such errors with verbs. Sometimes, words were translated into phrases that could be considered culturally analogous concepts – for example, translating “cheese and butter” to “curd and yoghurt” when translating from Sanskrit.”

Also: Google’s latest language machine puts the focus back on language

The authors argue for close collaboration with native speakers:

We stress that, whenever possible, it is important to try to build relationships with native speakers and members of these communities, rather than just interacting with them as remote workers. For this work, the authors contacted members of as many communities as possible, having conversations with over 100 members of these communities, many of whom were active in this project.

An appendix thanks a long list of these native speakers.

Despite the failures cited, the authors conclude that the work has notable successes. In particular, using the LangID approach to crawl the web, “we are able to create an unlabeled multilingual text dataset containing over a million phrases for over 200 languages ​​and over 100,000 phrases in more than 400 languages”.

And working with Transformer models convinces them that “it is possible to build high-quality, practical MT models for long-tail languages ​​using the approach described in this work.”