2017 Machine Translation Quality Evaluation

The language services industry now has an array of machine translation options. The goal of Lilt Labs is to provide reproducible and unbiased evaluations of these options using public datasets and a rigorous methodology.

The 2017 evaluation is intended to assess machine translation quality in a prototypical translation workflow. Therefore, it includes not only an evaluation of baseline translation quality, but also the quality of domain adapted systems where available. Domain adaptation and neural networks are the two most exciting recent developments in commercially available machine translation. We evaluate the relative impact of both of these technologies and validate that both offer substantial translation quality improvements.

This 2017 evaluation compares several popular and publicly available commercial systems:

  • Google — Translations from Google’s phrase-based API, the standard until 2016.
  • Google Neural (GNMT) — Translations from Google’s neural machine translation API.
  • Microsoft — Translation from the Microsoft Translator API, which may include neural features.
  • Microsoft Adapted — Translations from the Microsoft Translator API using a relevant translation memory for domain adaptation via Translator Hub.
  • Systran Neural — Systran’s “Pure Neural MT” system accessed via the online demo.
  • SDL — Translations from a baseline SDL “AdaptiveMT” system, applied via pre-translation in Trados Studio 2017.
  • SDL Adapted — Translations produced by an AdaptiveMT system using a relevant translation memory for domain adaptation in Trados Studio 2017.

We also include three results from our own systems:

  • Lilt — Translations from Lilt before any translation memory is uploaded or the system is used.
  • Lilt Adapted — Translations from Lilt using a relevant translation memory for domain adaptation.
  • Lilt Interactive — Translations from Lilt using a relevant translation memory for domain adaptation and corrected translations for each confirmed segment.

Due to the amount of manual labor required, it was infeasible for us to evaluate an “SDL Interactive” in which the system adapts incrementally to corrected translations. We spent several hours attempting to simulate the process of a translator through automation software for Windows, but were still unable to translate most of the evaluation data. We welcome readers to contact us if they wish to help with this comparison.

In order to run a fair comparison between the different translation solutions, we chose to use language data that:

  1. Is representative of typical paid translation work
  2. Is not used in the training data for any of the competing translation systems
  3. Whose reference translations were not produced by post-editing from one of the competing machine translation solutions
  4. Is large enough to permit model adaptation

It has proven difficult to find public data sets that fulfill these constraints, as we do not have control or knowledge of what data other vendors possess. Anything that can be scraped from the web will eventually find its way into some of the systems. Constraints b) and c) are necessary for a meaningful evaluation – otherwise the results would be heavily skewed towards one of the competitors and not be representative of new translation work.

After several months of search and evaluation, we found SwissAdmin [1], which is a corpus of press releases from the Swiss Federal Administration and is publicly available. It does not appear to be used as training data by any of the systems evaluated.

The majority of documents present in this dataset were originally written in German and then translated by the Swiss federal translation service, which guarantees high quality and should not be biased by any particular commercial machine translation system. The test sets consisted of the last 1300 segments of the 2013 articles for English-German and the last 1320 segments for English-French, excluding sentences with more than 200 words. As translation memory, we selected all remaining articles published from 2011 to 2013, totaling between 18,000 and 19,000 segments. We provide scripts for generating the adaptation and test data.

Translation quality is measured using the BLEU metric, the most common evaluation metric in machine translation research, which measures the similarity between proposed translations and reference translations. Higher numbers correspond to better translations. We decided to use BLEU because:

  • it is cheap and fast to compute on large amounts of data, and statistical significance tests require an evaluation on a large number of segments.
  • It correlates well with human judgement when comparing the output of different statistical machine translation systems.
  • There are nearly 15 years of historical results against which to compare the systems.

Despite being expensive and time-consuming for a sufficient amount of data, we are planning to confirm the automatic metrics with a human evaluation at a later point.

The results are summarized in the following figures.
image00image01In both language pairs, Google’s neural machine translation system provided the best results, with Lilt’s interactive system close behind. The difference between these two systems was not statistically significant for either language pair on this dataset. However, the effectiveness of adaptation varies substantially across systems, language pairs and especially domains. On more repetitive and terminology-heavy translation tasks we typically observe improvements of more than 10 BLEU points by adaptation. In these cases the technique used for domain adaptation is a more important consideration for translators than the underlying neural or phrase-based technology.

BLEU scores measure translation prediction accuracy, but can be challenging to interpret. They correlate with other measures of prediction accuracy. For example, when used interactively, Lilt’s English-German domain adapted system (27.7% BLEU) guesses the next word correctly 34.5% of the time.

We hope to help human users of machine translation understand the potential impact of technologies available to them, so that they can choose their tools based on accurate and reliable analysis. Machine translation cannot be evaluated effectively by pasting a few sample sentences into a web page and reading the result — variability is just too high. Our evaluation on over 1,000 segments, chosen carefully to be representative of professional translation work, clearly shows that the new technologies of neural and adaptive translation are not just hype, but provide substantial improvements in machine translation quality.



[1] Yves Scherrer, Luka Nerima, Lorenza Russo, Maria Ivanova, Eric Wehrli (2014).

SwissAdmin – A multilingual tagged parallel corpus of press releases.

Proceedings of LREC 2014. Reykjavik, Iceland.


Appendix A

The following data sets were tested for our evaluation but discarded when it became obvious that at least one of the systems used it for training:

  • JRC-Acquis
  • PANACEA English-French
  • IULA Spanish-English Technical Corpus
  • MuchMore Springer Bilingual Corpus
  • WMT Biomedical task
  • Autodesk Post-editing Corpus
  • PatTR
  • Travel domain data (from booking.com and elsewhere) crawled by Lilt


Appendix B

Training the adapted systems can be replicated as follows.

  • Lilt Adapted: Upload the adaptation data as the only translation memory and score the initial suggestions
  • Lilt Interactive: As above, but also simulate the translation process. For each segment take the initial suggestion for scoring, confirm with the reference translation and move on to the next segment. This way, the machine suggestion for the n-th segment has learned on the first n-1 translations.
  • Microsoft Adapted: Use Microsoft Translator Hub to create a new system specifying the adaptation data for training. No additional tuning or test data is uploaded, so that Microsoft automatically selects part of the training set for these purposes.
  • SDL Adapted: Add the adaptation data to a new local translation memory. Add a new AdaptiveMT engine specific to your project and pre-translate the test set. We assume that the local translation memory data is propagated to the AdaptiveMT engine for online retraining.

4 thoughts on “2017 Machine Translation Quality Evaluation

  1. It is difficult to accept this as an objective test as it is unclear how the adaptation was done for non-Lilt vendors especially for the SDL adaptive technology.

    If the results you present are true then there is very little benefit to using anybody other than Google.

    1. Thanks, Kirti, for this valuable feedback. I have added more information on the different adaptation processes in Appendix B.

      Your final conclusion can indeed be drawn for this particular data set. On slightly more repetitive and terminology-heavy domains we can usually observe larger improvements of more than 10% BLEU absolute by adaptation. In those cases we expect that all adapted systems would outperform Google’s NMT. I have clarified the article.

  2. “Finally, the fact that the source data starts as Swiss German, rather than regular German may also be a minor problem. The differences between these German variants appear to be most pronounced when it is spoken rather than written, but Schriftsprache (written Swiss German) does seem to have some differences with standard high German. Wikipedia does state that: “Swiss German is intelligible to speakers of other Alemannic dialects, but poses greater difficulty in total comprehension to speakers of Standard German. Swiss German speakers on TV or in films are thus usually dubbed or subtitled if shown in Germany.”

    – You seem to be confusing the spoken dialect (which the Wikipedia quote is about), written renderings of this, and “Received Standard” Swiss-German texts.

    The (substantial) differences between this and German Hochdeutsch are mainly a matter of vocabulary. This is the result of centuries of more-or-less separate development, particularly at all levels of government (which is presumably the main field of press releases by the Swiss Federal government). A German reading the City pages of the Neue Zürcher Zeitung is likely to need a specialized dictionary to understand the articles properly….

    So yes, this will distort results from systems that do not distinguish texts by country of origin and regional variant of a language.

Leave a Reply

Your email address will not be published. Required fields are marked *