What is Data Cleaning?
4 minute read
What is data cleaning? What is Dirty or Noisy Data? Methods for removing noisy data from the MT training datasets.

Machine translation systems rely on large amounts of data for training the models. Data quality plays an important role in the training of statistical and, especially, neural network-based models like NMT, which is quick to memorize bad examples. It is often the case that larger amounts of corpora lead to higher quality models. Therefore, a common practice is crawling of such corpora from web resources, digitized books and other sources that are prone to be noisy and include unclean sentences alongside the high-quality ones. Data cleaning has always been an important step in the MT workflow and it is arguably more important now than it’s ever been.

What is Dirty or Noisy Data?

The term “noise” can refer to a variety of phenomena in natural language. To give you an idea of the challenges posed to MT systems operating on unclean text, here is a list of types of noise and, more generally, input variations that deviate from standard MT training data.

  • Poor translations
  • Incorrect translation of metaphors
  • Imprecise translations which do not contain all of the details in the original sentence
  • Misaligned sentences
  • Repetition, insertion, and duplication errors
  • Missing diacritics – café vs. cafe
  • Mixed encoding, broken HTML
  • Misordered words in the source or target text
  • Issues with case sensitivity, spelling errors, and name entities
  • Unbalanced/Biased data - Too much text from other domains

How to Tackle Noisy Data?

“More data beats better models. Better data beats more data.” — Riley Newman

Below techniques are mainly intended for parallel corpora but are applicable for monolingual data as well. They can also be used in other use cases like back translation, clean data for unsupervised MT training, etc.

Remove Duplicates: Refers to removing the duplicates in the parallel corpora and identifying the unique parallel sentence in source and targets. The identical sentences in both the source and target side of the corpus should be removed as well as segments containing a URL, as detected by a regular expression.

Remove Non-alphabetical Segments: The sentences which contain more than 50% non-alphabetical characters/symbols in both the target or source should be removed.

Check for Correct Language: Remove the segments that are written in any different language from the specified one, any language identification software or any labeling software can be used for this practice.

Use Various Tools and Scripts: You can use Moses scripts for tokenizing, truecasing, and cleaning. Zipporah is a trainable tool for selecting a high-quality subset of data from a huge amount of noisy data. It can improve MT quality but in order to use it, the tool requires a known high-quality data set for training.

In Conclusion

Remember that noisy and unclean data can cause disastrous mistranslation in Modern Machine Translation systems. TAUS Matching Data provides you with high-quality clean in-domain data to train your MT engines.


Shikha is a Data Engineer at TAUS working on creating and maintaining data pipelines. Her mission is to find trends in datasets and develop algorithms to help make raw data more useful to enterprise users. She focuses on implementing methods to improve data reliability and quality to enrich the TAUS data services.

Related Articles
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.
Machine learning and AI applications need data in order to work. And in order to get good results and output, the cleaner the data, the better.