What is Data Cleaning?
icons-action-calendar15 Apr 2020
4 minute read
What is data cleaning? What is Dirty or Noisy Data? Methods for removing noisy data from the MT training datasets.

Machine translation systems rely on large amounts of data for training the models. Data quality plays an important role in the training of statistical and, especially, neural network-based models like NMT, which is quick to memorize bad examples. It is often the case that larger amounts of corpora lead to higher quality models. Therefore, a common practice is crawling of such corpora from web resources, digitized books and other sources that are prone to be noisy and include unclean sentences alongside the high-quality ones. Data cleaning has always been an important step in the MT workflow and it is arguably more important now than it’s ever been.

What is Dirty or Noisy Data?

The term “noise” can refer to a variety of phenomena in natural language. To give you an idea of the challenges posed to MT systems operating on unclean text, here is a list of types of noise and, more generally, input variations that deviate from standard MT training data.

  • Poor translations
  • Incorrect translation of metaphors
  • Imprecise translations which do not contain all of the details in the original sentence
  • Misaligned sentences
  • Repetition, insertion, and duplication errors
  • Missing diacritics – café vs. cafe
  • Mixed encoding, broken HTML
  • Misordered words in the source or target text
  • Issues with case sensitivity, spelling errors, and name entities
  • Unbalanced/Biased data - Too much text from other domains

How to Tackle Noisy Data?

“More data beats better models. Better data beats more data.” — Riley Newman

Below techniques are mainly intended for parallel corpora but are applicable for monolingual data as well. They can also be used in other use cases like back translation, clean data for unsupervised MT training, etc.

Remove Duplicates: Refers to removing the duplicates in the parallel corpora and identifying the unique parallel sentence in source and targets. The identical sentences in both the source and target side of the corpus should be removed as well as segments containing a URL, as detected by a regular expression.

Remove Non-alphabetical Segments: The sentences which contain more than 50% non-alphabetical characters/symbols in both the target or source should be removed.

Check for Correct Language: Remove the segments that are written in any different language from the specified one, any language identification software or any labeling software can be used for this practice.

Use Various Tools and Scripts: You can use Moses scripts for tokenizing, truecasing, and cleaning. Zipporah is a trainable tool for selecting a high-quality subset of data from a huge amount of noisy data. It can improve MT quality but in order to use it, the tool requires a known high-quality data set for training.

In Conclusion

Remember that noisy and unclean data can cause disastrous mistranslation in Modern Machine Translation systems. TAUS Matching Data provides you with high-quality clean in-domain data to train your MT engines.


Shikha is a Data Engineer at TAUS working on creating and maintaining data pipelines. Her mission is to find trends in datasets and develop algorithms to help make raw data more useful to enterprise users. She focuses on implementing methods to improve data reliability and quality to enrich the TAUS data services.

Related Articles
icons-action-calendar7 Oct 2022

In recent years, NMT systems are getting better and better, some even claiming human parity. If systems on-par with human translators could really be deployed, that would fulfill the “no-human in the loop” dream that the industry seems to indulge in more and more frequently.

icons-action-calendar3 Mar 2022

The AI scene of the 2010s was shaped by breakthroughs in vision-enabled technologies, from advanced image searches to computer vision systems for medical image analysis or for detecting defective parts in manufacturing and assembly. The 2020s, however, are foreseen to be all about natural language technologies and language-based AI tasks. NLP, NLG, NLQ, NLU… The list of abbreviations starting with NL (Natural Language) seems to grow each day. Regardless of the technology domain, it’s observed that natural language technologies will be in a field-shaping position in a variety of areas from business intelligence and healthcare to fintech.

icons-action-calendar3 Jan 2022

Bilingual, NLP-driven word clouds are now available in TAUS Data Marketplace. In this article, we discuss what word clouds are and what they can tell us about the contents of a document containing bilingual text data.