Meta’s machine translation journey

there is around 7000 languages spoken around the world, but most translation models focus on English and other popular languages. This excludes much of the world from the benefit of having access to content, technologies and other benefits of being online. Tech giants are trying to bridge this gap. Just a few days ago, Meta announced plans to release a Universal Speech Translator to translate speech from one language to another in real time. This announcement comes as no surprise to anyone who follows the company closely. Meta has been dedicated to bringing innovations in machine translations for some time.

Let’s take a quick look at the main highlights of his machine translation journey.


NMT Translation Scaling

Meta used neural machine translation (NMT) to automatically translate text in posts and comments. NMT models are useful for learning from large-scale monolingual data, and Meta was able to train an NMT model in 32 minutes. This was a drastic reduction in training time from 24 hours.

LASER in open source

In 2018, Meta also open source the LASER toolkit (Language-Agnostic Sentence Representations). It works with over 90 languages ​​written in 28 different alphabets. LASER calculates multilingual sentence embeddings for instant cross-linguistic transfer. It also works on low resource languages. Meta stated that “LASER achieves these results by integrating all languages ​​jointly into a single shared space (rather than having a separate model for each).”


Wav2vec: unsupervised pre-training for speech recognition

Today, to access various benefits of technology like GPS, virtual assistants basically need voice recognition technology. But most of them are dependent on English, and a large portion of people who don’t speak the language or who speak it with an unrecognizable accent are left out of using such a simple and important method of accessing information and services. Wav2vec wanted to solve this problem. Here, unsupervised pre-training for speech recognition was the focus of Meta. Wav2vec is trained on untagged audio data.

Meta adds: “The wav2vec model is trained by predicting the voice units for the masked portions of speech audio. It learns basic units of 25 ms duration to enable learning of high-level contextualized representations. »

Thanks to this, Meta has been able to create speech recognition systems that perform much better than the best semi-supervised methods, although it can have 100 times less labeled training data.


M2M-100: Multilingual machine translation

2020 was an important year for Meta, where it released different models that advanced machine translation technology. M2M-100 was one of them. It’s a multilingual machine translation (MMT) which translates between any pair of 100 languages ​​without relying on English as an intermediary. M2M-100 is trained on a total of 2,200 language directions. This model wants to improve the quality of translations worldwide, especially those that speak languages ​​with low Meta claimed resources.

CoVoST: multilingual speech-to-text translation

CoVoST is a multilingual corpus of speech-to-text translation from 11 languages ​​into English. What makes it unique is that CoVoST covers over 11,000 speakers and over 60 accents. Meta claims that it is the “first end-to-end multilingual many-to-one model for spoken language translation”.


FLORES 101: low-resource languages

Following in the footsteps of M2M-100, in the first half of last year, Meta opened FLORES-101. This is a many-to-many assessment dataset that covers 101 languages ​​worldwide, with a focus on low-resource languages ​​that still lack large datasets. Meta added, “FLORES-101 is the missing piece, the tool that allows researchers to quickly test and improve multilingual translation models like M2M-100.”



In 2022, Meta released data2vec, calling it “the first high-performance self-supervised algorithm that works for multiple modalities”. It was applied to speech, text, and images separately and outperformed previous best single-use algorithms for computer vision and speech. Data2vec does not rely on contrastive learning or reconstruction of the input example.