Here is the latest European attempt at end-to-end voice translation

If there was ever a shortlist of projects that had the potential to produce a Babel Fish-like translation device, it would probably be on it.

Supported by European university, private sector and government, the project is called ELITR (pronounced “eh-lee-ter”), also known as Live European Translator. The project was born out of the need to provide subtitles for a EUROSAI congress last May.

EUROSAI is the European Organization of Supreme Audit Institutions; and the Supreme Audit Office of the Czech Republic launched the project to help translate speech in real time from six source languages ​​into 43 target languages: 24 EU languages, plus 19 EUROSAI languages ​​( e.g., Armenian, Russian, Bosnian, Georgian, Hebrew, Kazakh, Norwegian, Luxembourgish).


In one ELITR demo videoOndřej Bojar, an associate professor at Charles University, said the project also considers the possibility of “going directly from source speech to target language with an end-to-end spoken language translation system.”

In short, speech-to-speech translation (S2ST). For ELITR, however, Bojar told Slator, “We stop at target text. We don’t include the final text-to-speech – although we certainly can.

Slator 2021 AI Data Market Report

Slator 2021 AI Data Market Report

Data and Research, Slator Reports

44 pages on how LSPs enter and evolve in AI Data-as-a-service. Market overview, AI use cases, platforms, case studies, sales insights.

S2ST has become something of a brass ring in research and big tech – as approached by Apple, Google (via the so-called “Translatotron”; SlatorPro) and prominent Japanese researchers, who uploaded a box to tools for this on GitHub. Chinese search giant Baidu has even drawn attention to claims about it; and, of course, there’s a whole graveyard of translation gimmicks from companies that tried to market S2ST.

Admittedly, the production pipeline for ELITR currently relies on two independent stages, namely automatic speech recognition (ASR) and machine translation (MT) and, according to Bojar, “we are actually quite good at both of these stages. ” (as evidenced by a paper published on June 17, 2021; and two others published in September and October 2020).

“We are also investigating the possibilities of going directly from source speech to target language with an end-to-end spoken language translation system” – Ondrej Bojar, Associate Professor, Charles University

End-to-end voice translation is part of the long-term vision, as stated in a recent paper published on the portal of the Association for Computational Linguistics. “The goal of a practically usable simultaneous spoken language translation (SLT) system is getting closer,” wrote the authors from Charles University, Karlsruhe Institute of Technology, University of Edinburgh and Italian automatic speech recognition (ASR) provider PerVoice. SLT also encompasses offline spoken language systems, the authors said.

The authors (Bojar, among them) mentioned two problems with the current system that have not yet been solved.

  • Intonation – which cannot be taken into account because punctuation prediction does not have access to sound; and
  • Segmentation faults – i.e. machine translation systems tend to “normalize the order of words”, thereby reducing fluency in a stream of spoken sentences.

Therefore, “for the future, we are considering three approaches”, Bojar et al. added: (1) MT training on sentence chunks, (2) including sound input in punctuation prediction, or (3) end-to-end neural SLT. »

The University of Edinburgh and the Karlsruhe Institute of Technology worked alongside Charles University on ELITR. ASR provider PerVoice and Germany-based video conferencing platform alfaview also participated in the project. Does this mean that marketing plans are on the drawing board?

Bojar told Slator, “For a research institute in a university, commercialization is always something that takes an unbearable amount of time, but we are definitely very open to many forms of collaboration.”