Machine Translation 101 – Part 3

How to Perform Machine Translation Evaluation

Performing machine translation evaluation is a key part of your workflow. After all, without assessing the results, how can you know if your machine translation model is doing a good job? 

Done properly, machine translation evaluation helps you to understand how closely your model relates to your translation domain, and how your end users feel about the translation service that it delivers. 

In this article, we’ll walk you through some of the different approaches to evaluation and discuss a range of the key challenges. 

The Importance of Machine Translation Evaluation

Machine translation evaluation plays an essential role in the development of machine translation models. Performing an evaluation is critical for determining how effective your existing model is, as well as estimating how much post-editing is needed, negotiating prices with your customers, and managing their expectations. 

Human vs Automatic Machine Translation Evaluation

Machine translation evaluation is a tricky endeavor because natural languages are highly ambiguous. Much of their complexity relates to how each person interprets language differently. With so many possibilities, it’s challenging from a computational perspective to arrive at an evaluation score. 

In machine translation evaluation, you ideally compare the target sentence with the ‘gold standard’ sentence. But a single gold standard sentence is difficult to define. A sentence can be translated in many possible ways that can all convey the same meaning. That’s problematic for humans as well as computers. When a human translates a text, opinions on translation quality will likely differ from one reader to another.

When evaluating machine translation, you can use both manual and automatic evaluation approaches. Let’s take a closer look at each one. 

The Manual Approach 

Using professional human translators provides the best results in terms of measuring quality and analyzing errors. It allows you to easily evaluate key metrics such as adequacy and fluency scores, post-editing measures, human ranking of translations at sentence-level, and task-based evaluations. 

Human translators evaluate machine translation in a number of ways. The first is by assigning a rating to the overall quality of the target translation. This is usually done on a scale of 1-10 (or a percentage), ranging from ‘very bad quality’ to ‘flawless quality.’ 

Another way to evaluate machine translation is by its adequacy, i.e. how much of the source text meaning has been retained in the target text. This is normally rated on a scale from ‘no meaning retained’ all the way through to ‘all meaning retained’. Evaluating adequacy requires human evaluators to be fluent in both languages.

Fluency is another useful metric for human translators to judge the quality of a translation. Scales usually range from ‘incomprehensible’ to ‘fluent’. Evaluating fluency  only involves the target text, removing the need for the translator to know both source and target language. 

Adequacy and fluency are the most common ways to perform manual machine translation analysis. Human evaluators may also use error analysis to identify and classify errors in machine translated text. The exact process depends on the language, but generally speaking, evaluators will look for types of errors such as ‘missing words,’ ‘incorrect word order,’ ‘added words,’ or ‘wrong part of speech.’ 

One of the main issues with the manual approach stems from the subjective nature of human judgment. As a result, it can be hard to achieve a good level of intra-rater (consistency of the same human evaluator) and inter-rater (consistency across multiple evaluators) agreement. On top of that, there are no standardized metrics for human evaluation. It’s also costly and time-consuming, especially when bilingual evaluators are required. 

The Automatic Approach

To avoid these common problems with the manual approach to machine translation evaluation, researchers have developed a range of automatic approaches, such as BLEU and METEOR. Each has its own advantages and disadvantages. Researchers in the field are constantly improving machine translation evaluation metrics, as well as creating new ones. As they’re the most often used, here are some more details about how BLEU and METEOR work.  

BLEU (Bilingual Evaluation Understudy) is the most common automatic approach to machine translation evaluation at present. It focuses on precision-based features. BLEU works by comparing the machine translation to a number of reference translations. It gives a higher score (on a scale of 0 to 1) when the machine translation text shares a lot of strings with the reference translation text. When the score is closer to 1, the more similar the machine translation is to a human translation. But the ambiguity of natural languages presents a major challenge for BLEU, as most languages have many alternative translation options, rather than a single ‘gold standard’. 

METEOR (Metric for Evaluation of Translation with Explicit Ordering) is another popular machine translation evaluation method. It builds on the capabilities of BLEU by evaluating recall as well as precision. The main objective of METEOR is to score machine translation on a sentence level, which better reflects how a human would judge translation quality. What’s more, METEOR aims to tackle the issue of ambiguous reference translations by using flexible word matching, which can account for synonyms and morphological variants. 

Combining Humans and Machines 

It’s common practice to add a human layer to automatic machine translation to double-check the results for accuracy before releasing them to end users. In some  cases, such as translating a product manual for software, machine translation could handle the bulk of the work, saving time and money. Professional human translation could be added as a final step to improve accuracy. The machine translation performance could then be evaluated in terms of the human translator, either by the number of words, number of characters, or the amount of time that the translator spent. 

Improving the End User Experience 

In addition to text-based metrics, end user experience is another critical factor in machine translation evaluation. Here, quality should be assessed first on accuracy and second on fluency. This task usually falls to a native speaker of the target language, who will be able to discern subtle nuances that may otherwise be missed. 

For example, a sentence that contains correctly translated words may still lack the specific format that would make sense to a native speaker. In this situation, accuracy is correct, but the fluency level isn’t high enough. On the other hand, the target sentence might look like a correct sentence at first glance, but closer examination reveals that its meaning doesn’t accurately reflect the source. 

It’s important for some companies to assess the quality of their translation service from the customer perspective. For some use cases, no margin of error is acceptable, such as when translating legal or medical documents. In other situations, the user may be able to accept a higher margin of error, such as when Facebook and Twitter offer automatically generated on-site translations


Of course, one way to improve your machine translation evaluation ratings is to use top quality off-the-shelf datasets to train your models in the first place. DefinedCrowd offers a robust library of pre-collected, high-quality datasets, sourced, annotated and validated by a global crowd of over 300,000 people. What’s more, our crowdsourced Translation Validation Tasks help you easily add human evaluation into your workflow.