Machine Translation 101 – Part 2

How to do Data Cleaning for Machine Translation

There’s no getting around it, cleaning your data is a critical step in any machine learning workflow. The popular saying in computer science, “garbage in, garbage out”, holds very true in this situation. Computers aren’t magic. They can only function on what they receive as input. That’s why your training datasets must be carefully prepared to make them as accurate as possible. 

In this article, we’ll walk you through the basics of data cleaning, including the main issues that cause dirty data, the problems it can create in machine translation workflows, and the most effective ways to clean it.

What Makes Data ‘Dirty’?

What exactly does it mean for data to be ‘dirty’? Well, data gathered from the real world often has certain issues that make it difficult for the computer to train properly. Datasets for machine translation tend to be amalgamated from various sources, which can lead to inconsistencies in structure and quality. Here are a few examples of common kinds of dirty data that can cause problems for your machine translation models.

Duplicate values

There’s a high risk of duplicates when data is amalgamated from multiple sources. These can often be identified with a simple script. But you may also have data showing different values for the same meaning.

Missing or mangled values

This is especially common when your data set has been scraped from somewhere on the web. It’s easy for the scraper to generate messed up values that risk throwing your machine translation model out of balance. 

Non-standard values

One common example of non-standard values in machine translation is using different date formats. When using human-generated data, it’s necessary to check that people are spelling and capitalizing words in the same way. Any confusion in the input will end up confusing the model’s output.

Input errors

These often happen when a human has added the data manually, misspelling or mistyping certain words. Another human could easily spot this, but a computer cannot. For machine translation models, this can have significant negative effects on the finished output.

Unbalanced or biased data

When your model needs to translate for a specific domain, unbalanced training data can become an issue when the dataset contains too much text from other domains. 

Incorrect or imprecise translations 

Datasets found around the web can be littered with poor translations. These may come in the form of incorrect words or sentences loosely or carelessly translated, causing the original meaning to be lost.

The Importance of Data Cleaning in Machine Translation

All machine translation systems learn from examining patterns in language. Elements such as emojis and usernames risk confusing the algorithm because they are not translatable. Words in all capital letters can also be problematic, as can certain kinds of punctuation. You should start your cleaning process by removing these elements from your training dataset. 

Another important step in the data cleaning process is data normalization. This involves removing certain parts of the data, such as numbers, which are usually the same across all languages and therefore not relevant to the translation process. Data normalization helps us to understand how the machine learns, so we can identify the best data for it to learn from. Then we can further optimize the input for better end results. 

A Typical Data Cleaning Process for Machine Translation

The exact nature of each data cleaning workflow will depend on the data being processed. Unlike machine learning for text analytics, the machine translation pipeline is not standard. Your data scientists may decide to adapt this pipeline, depending on the translation model architecture and various challenges observed in the data itself.

A typical machine translation data cleaning workflow uses the following steps for pre-processing the text: lowercasing, tokenization, normalization, and removal of unwanted characters (punctuation, URLs, numbers, HTML tags, and emoticons).

Lowercasing

This step involves transforming all words in the dataset into lower-case forms. It is useful in cases where your datasets contain a mixture of different capitalization patterns, which may lead to translation errors, for example, having ‘portugal’ and ‘Portugal’ in the same dataset. However, lowercasing isn’t suitable for all translation projects. Some languages rely on capitalization to generate meaning.

Tokenization

This refers to the process of splitting sentences up into individual words. It’s a central part of the data cleaning workflow, essential to enable the algorithm to make sense of the text.

Normalization

This is the process of transforming a text into a standard (canonical) form. For example, the words ‘2mor’, ‘2moro’ and ‘2mrw’ can all be normalized into a single standard word: ‘tomorrow’. This is an essential step in data cleaning, especially when handling user-generated content from social media, blog or forum comments.

Removal of unwanted characters 

The data cleaning process should also involve removing other parts of the data that don’t add to the translation meaning, including emojis, URLs, HTML tags, and numbers. Also, most punctuation is unnecessary because it doesn’t provide any additional meaning for a machine translation model. 

Don’t Sacrifice Accuracy For Volume

If the human translations in the training data have inaccurate meanings, then the overall quality of the machine translation model will suffer. This often happens when using data scraped from bilingual or multilingual websites. Generally, it’s better to have as much training data as possible, but not if it’s inaccurate. 

In that case, it’s better to go with a smaller dataset of high-quality accurate data. Cleaning the data also makes sure that irrelevant words get removed, reducing the size of the dataset that the machine translation model has to deal with. This makes the model perform more efficiently.

Data Cleaning Challenges by Language 

Certain languages present additional challenges for data cleaning, not because of the structure of the target language itself, but rather the difficulties in obtaining sufficient volumes of data. It’s important to have a good lexicon, which contains all essential words in that language. You also need to have some suitable normalization tools for that language and to understand how the language works. 

For example, it’s important to understand when to preserve capital letters, such as those in German nouns. Your data science team will also need to understand how to deal with symbols and user-generated content, such as emoticons. Different languages have different challenges, but they often depend on what we’re most used to in terms of language processing. 

Enhancing Your Data Cleaning With DefinedCrowd

It may seem like a good idea simply to scrape bilingual websites to create a set of training data for your machine translation needs. But this is far from an ideal solution. 

Scraping websites produces a large quantity of natural language text, which will probably require extensive cleaning to get it into shape for use as training data. Your data science team will have to spend significant time and effort to make this data to make it suitable for building a reliable machine translation model. 

DefinedCrowd saves you from this process by providing ready made and fully cleaned datasets in a wide range of languages. You can just plug the dataset directly into your machine translation model and have confidence that it will give you reliable results. DefinedCrowd can also provide datasets in more raw natural language formats, for your data science team to process using their own tools. 

Unclean data can cause disastrous mistranslations in machine translation systems. With Defined Crowd, you can access clean, high-quality, and domain-specific data for training your machine translation models. 

The final part of our machine translation series, How to Perform Machine Translation Evaluation is coming soon. Don’t miss it – keep checking our blog for more.