Microsoft is pursuing AI at scale with great ambition to enable the next generation of AI experiences. The Microsoft Translator ZCode team is working with Microsoft Project Turing and Microsoft Research Asia to advance language and multilingual support at the heart of this initiative. We continue to push the boundaries with multilingual models to support various language scenarios within Microsoft. Last summer, we announced our Expert Large Scale Multilingual Mix Model with DeepSpeed that can outperform individual large scale bilingual models. Recently, the latest Turing Universal Language Representation Model (T-ULRv5), a model created by Microsoft, is again state-of-the-art and topped the Google XTREME public ranking at this time. More recently, Microsoft announced the largest parameter model Megatron-Turing NLG 530B.
The annual Machine Translation Conference (aka WMT 2021) wrapped up last week in beautiful Punta Cana, Dominican Republic. WMT brings together researchers from across the machine translation field, from both industry and academia, to participate in a series of shared tasks, each setting a benchmark in an important area of machine translation to push the field to new frontiers.
The Microsoft Translator ZCode team, together with the Turing team and Microsoft Research Asia, participated in the “Large Scale Multilingual Translation” track, which consisted of a full translation task between all 10,000 directions in 101 languages, and two small tasks: one focusing on 5 Central and Southern European languages, and one on 5 Southeast Asian languages. The Microsoft ZCode-DeltaLM model won all three tasks by huge margins, including an incredible gain of more than 10 points over the M2M100 model in the large task rated on 10,000 language pairs. (Results of the WMT 2021 shared task on large-scale multilingual machine translation, Wenzek et al, WMT 2021).
Figure 1: Official results (BLUE scores) on full task and small task1 at the WMT 2021 Large-Scale Multilingual Translation Shared Task
The ZCode-DeltaLM approach
In this blog post, let’s take a look under the hood at the winning Microsoft ZCode-DeltaLM model. Our starting point was DeltaLM (DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders), the latest in Microsoft’s increasingly powerful series of massively multilingual pre-trained language models.
DeltaLM is an encoder-decoder model, but instead of training from scratch, it is initialized from a pre-trained state-of-the-art only encoder model, specifically (TULRv3 ). While the encoder initialization is straightforward, the decoder is less so because it adds cross-attention to the encoder’s self-attention. DeltaLM solves this problem with a new intertwined architecture, where self-attention and cross-attention alternate between layers, with self-attention used in odd layers and cross-attention used in even layers. With this interleaving, the structure of the decoder matches the encoder, and so it can also be initialized in the same way from TULRv3.
DeltaLM is complemented by ZCode’s powerful multitasking learning: Multitasking Learning for Multilingual Neural Machine Translation. Our models show that the combination of multitasking and multilingual learning can significantly improve training for large-scale pre-trained language models. Such a multitasking multilingual learning paradigm takes advantage of inductive bias and regularization of multiple tasks and languages simultaneously to perform better on various downstream tasks. We use the translation task, the auto-encoder denoising task and the translation scope corruption task as shown in the figure below.
Winning the Massively Multilingual Translation Track
To build our winning massively multilingual translation system (Microsoft’s Multilingual Machine Translation Systems for WMT21 Shared Task), we started with zCode-DeltaLM, and added some tricks.
We apply progressive learning, first training a model with 24 encoder layers and 12 decoder layers, then continue learning with 12 more encoder layers, resulting in a 36-layer deep encoder. To cover all language pairs, we generate double pseudo-parallel data where both sides of the parallel data are synthetic, translated by the model from English. We also apply iterative back-translation to generate synthetic data. We apply curriculum learning, starting with the noisy training dataset and then reducing it to a clean subset. We reweight the translation objective to favor parallel data over back-translation and pseudo-parallel double data. We apply temperature sampling to balance the language pairs. For each language pair, we choose, depending on the development set, whether we prefer direct translation or pivot translation via English.
Putting it all together, we knew we had an amazing massively multilingual system, but the official blind test results exceeded our expectations. We scored 2.5-9 BLUE points ahead of the next competitor and 10-21 BLUE points ahead of the base model M2M-175. In the development test, we compared the larger M2M-615 model, which we also beat by 10 to 18 points.
Beyond translation: universal language generation
Although we are excited about the big win of WMT 2021, what is even more exciting is that unlike other competitors, our ZCode-DeltaLM model is not just a translation model, but rather a general model of pre-trained encoder-decoder language, usable for all sorts of generation tasks beyond translation. This really allows our models to perform well on various multilingual natural language generation tasks.
We’ve hit new SOTA in many popular GEM Benchmark build tasks, including Wikilingua (summary), text simplification (WikiAuto), and structure-to-text (WebNLG). The DeltaLM-ZCode model greatly outperforms much larger models such as mT5 XL (3.7B) which is also trained on much larger data. This demonstrated the efficiency and versatility of the models, resulting in strong performance in many tasks.
Figure 2. Performance (RL scores) of ZCode-DeltaLM on text synthesis and simplification tasks in the GEM benchmark
Multilingual machine translation has reached a point where it works very well, overtaking bilingual systems, both in low and high resource languages. Mixture models of experts (MoE) have proven to be a very good fit for scaling such models, as shown in GShard. We explore how to effectively scale these models with Mixture of Experts: Scalable and Effective MoE Training for Multitasking Multilingual Models. MoE models with massive multilingual data and unsupervised multitasking provide an unprecedented opportunity for these models to deliver truly universal systems that can further enable the Microsoft Translator team to break down language barriers across the globe, as well as supporting a variety of natural language generation tasks.
We would like to thank Francisco Guzman and his team who collected the massively multilingual FLORES test set and organized this WMT track with such a large scale evaluation.