Machine Translation 101 – Part 1

How to Create a Perfect Dataset for Machine Translation Models

Data is the lifeblood of any successful machine learning model, and machine translation models are no exception. Without relevant and properly labelled data, even the most sophisticated model will be unable to achieve reliable results. 

Sometimes, getting hold of the right data can be the most challenging part of a project, especially if you’re trying to do something entirely new – such as creating machine translation for rare languages

In this article we’ll discuss why perfect data matters so much, what can go wrong if you don’t have it, and the best ways to plan out your machine translation project and put together the perfect dataset for it. 

Why Perfect Datasets Matter for Machine Translation 

Good quality data is essential for every machine learning model. Without the right data, the model won’t work as expected. In machine translation, the model learns patterns from the data. In particular, it learns how to translate text from the source to the target and how to arrange those words in a sentence to convey the appropriate meaning from the source sentence. 

What does that look like without a perfect dataset? Well, if you provide a machine translation engine with badly translated examples, those examples will have an impact on the resulting model. They will teach the model that those patterns are good translations, when in reality they are not. 

And, just a small portion of inaccurate data can have a big impact on overall model functionality. The model will be less certain about what’s good and what’s bad in terms of translation, with potentially disastrous results. That’s why it’s so vital to start with a perfect dataset when training your machine translation models.  

Imperfect Data Causes Mistranslations and Confusion 

There are two important angles to consider here. First, the model itself. For example, the model could mistranslate an idiomatic expression, having already learned a mistranslation during training. That would result in obviously machine-generated output, not the human translator level quality that’s required. 

Second, let’s consider the overall subject matter domain of the translation in question. Imagine a set of words that are used in a specific domain, such as news or parliamentary dialogue. Those domains use different sets of words to convey different types of meaning. To avoid confusion, it’s important not only to have accurate data, but also to have data that’s directly relevant, both to the subject matter and the wider domain that you wish to translate from. 

When trying to translate the FAQ section of a website, for example, it’s important for the machine translation model to understand language about the brand, the products, what the customers’ main problems are, and so on. That’s why it’s necessary to have training data that touches upon all of those areas. In this way, you can ensure that the machine translation engine is as accurate as possible. 

Gathering Suitable Datasets for Machine Translation

There are a number of ways to collect suitable training data. At first, it’s best to start with easily available data resources that also have sufficient volume. Open source machine translation datasets can be an excellent starting point, providing translation pairs (consisting of pairs of words: source in one language and target in the other). 

To translate from Portuguese to English or English to Portuguese, you would need a specific set of translation pairs. To create another translation from French to English, you’d need a different set of pairs to train your machine translation model. 

Open source data is often problematic in terms of its quality. Domain quality is an important consideration here. It’s common to find open source datasets containing translation pairs for domains such as software manuals, as those tend to be translated into many languages. Parliamentary proceedings, such as those from the European Union, are also commonly found in open source repositories. 

But the main problem with those datasets is they’re domain-specific and usually not suited to a company’s business needs. What’s more, relying solely on open-source datasets can be too limiting if your business requires low resource or unusual languages. Let’s say you want to translate Finnish to Chinese. It’s unlikely you’ll find good quality, domain specific, high volume data in open source repositories. 

If open source repositories don’t have what you need, then the next best way is to build your own training dataset. One way to do this is by scraping bilingual text from the web. For the above Finnish to Chinese example, you could find a website that translates its text between the two. You could then scrape this website, create translation pairs and assemble them into your own dataset. 

The obvious challenge here is not only finding suitable websites for your purposes, but also handling the inconvenience and messiness of the scraping and assembling process. That’s why scraping is unlikely to be the most efficient way to gather good quality data for your machine translation projects.

Another way to get suitable training data is to purchase a tailored, pre-assembled dataset from a company that specializes in providing high-quality domain-specific data, such as DefinedCrowd, which can supply datasets for any language in the world. This way saves you a great deal of time and energy in hunting and processing suitable data from various online sources. 

Creating Perfect Datasets for All Languages 

Creating perfect training datasets for languages with non-Latin scripts requires a specific approach, and can benefit from the support of computational linguists. Let’s take a look at Chinese and Arabic. For Chinese, one major challenge is how to tokenize the words. Chinese parsers don’t handle white space tokenization in the same way that European language parsers do. 

In Arabic, the language is not only written right-to-left but it also has a different morphology. The words are blended together into sentences without any spaces between each word. When creating your training data corpora, you’ll need to address challenges like these to make sure the end results are reliable. 

At DefinedCrowd, we have a team of computational linguists to handle tasks like this. They can figure out how to present translation pairs from these languages in ways that the computer will easily understand. This enables them to create more effective pre-processing tools. 

Those tools include tokenizers, which split the sentences into individual words (‘tokens’), sentencizers, which collect a text and chunk it into sentences, and normalizers, which provide consistency across details such as date formats. To achieve  good quality translation, it’s critical to understand all of these details for each of the languages that your model will be trained on. 

Defining Key Parameters for a Perfect Dataset 

Before starting the data collection process, many people ask: how much training data do I need? The simple answer is: as much as you can afford. Generally, the more data to train on, the more machine translation models can improve. But the data must be of high quality.

Also, it’s essential to define the objectives of your model. What exactly do you want it to achieve? Once that’s decided, you can start by getting a certain amount of training data annotated. If that goes well, expand your capacity and add more data. 

Secondly, you should also consider exactly what constitutes ‘high quality’ data in terms of your particular project. Translation accuracy will always be critical, but you should also make sure you’re using training data that’s tailored to the specific nuances of your subject matter domain.


Training data can make or break your machine translation model. To make sure you have perfect data every time, it’s worth investing in reliable, accurate datasets created by fully trained human translators. DefinedCrowd offers DefinedData, a robust library of pre-collected, high-quality datasets, sourced, annotated and validated by a global crowd of over 300,000 people. 

Stay tuned for Machine Translation Part 2: How to do data cleaning for machine translation! Or, if you don’t want to wait, get Part 2 and Part 3 in the full whitepaper here!