What is Data Cleaning?

Machine translation systems rely on large amounts of data for training the models. Data quality plays an important role in the training of statistical and, especially, neural network-based models like NMT, which is quick to memorize bad examples. It is often the case that larger amounts of corpora lead to higher quality models. Therefore, a common practice is crawling of such corpora from web resources, digitized books and other sources that are prone to be noisy and include unclean sentences alongside the high-quality ones. Data cleaning has always been an important step in the MT workflow and it is arguably more important now than it’s ever been.

What is Dirty or Noisy Data?

The term “noise” can refer to a variety of phenomena in natural language. To give you an idea of the challenges posed to MT systems operating on unclean text, here is a list of types of noise and, more generally, input variations that deviate from standard MT training data.

Poor translations
Incorrect translation of metaphors
Imprecise translations which do not contain all of the details in the original sentence
Misaligned sentences
Repetition, insertion, and duplication errors
Missing diacritics – café vs. cafe
Mixed encoding, broken HTML
Misordered words in the source or target text
Issues with case sensitivity, spelling errors, and name entities
Unbalanced/Biased data - Too much text from other domains

How to Tackle Noisy Data?

“More data beats better models. Better data beats more data.” — Riley Newman

Below techniques are mainly intended for parallel corpora but are applicable for monolingual data as well. They can also be used in other use cases like back translation, clean data for unsupervised MT training, etc.

Remove Duplicates: Refers to removing the duplicates in the parallel corpora and identifying the unique parallel sentence in source and targets. The identical sentences in both the source and target side of the corpus should be removed as well as segments containing a URL, as detected by a regular expression.

Remove Non-alphabetical Segments: The sentences which contain more than 50% non-alphabetical characters/symbols in both the target or source should be removed.

Check for Correct Language: Remove the segments that are written in any different language from the specified one, any language identification software or any labeling software can be used for this practice.

Use Various Tools and Scripts: You can use Moses scripts for tokenizing, truecasing, and cleaning. Zipporah is a trainable tool for selecting a high-quality subset of data from a huge amount of noisy data. It can improve MT quality but in order to use it, the tool requires a known high-quality data set for training.

In Conclusion

Remember that noisy and unclean data can cause disastrous mistranslation in Modern Machine Translation systems. TAUS Matching Data provides you with high-quality clean in-domain data to train your MT engines.

What is Data Cleaning?