icons-social-media-facebook-circleicons-social-media-twitter-circleicons-social-media-linked-in-circle
Why Do Data Cleaning and Anonymization Matter?
icons-action-calendar4 Oct 2021
2 minute read
Data cleaning and data anonymization are very critical in training ML models. Here are the reasons why.

Data cleaning is an essential step in machine learning and takes place before the model training step. It is important because your machine learning model will produce results only as good as the data you feed it. If your dataset contains too much noise, your model will capture that noise as a result. Furthermore, messy data can break your model and cause model accuracy rates to decrease. Examples of data cleaning techniques include syntax error removals, data normalization, duplicate removal, outlier detection/removal, and fixing encoding issues. 

Data anonymization is another imperative step in machine learning and entails the process of removing sensitive or personally identifiable information from datasets. For many organizations, data privacy laws make this  a vital step. Some common data anonymization techniques include perturbation, generalization, shuffling, scrambling, and synthetic data generation. Synthetic data could be a good alternative when dealing with sensitive data. Synthetic data can be generated in-house and can use characteristics of naturally-occurring data, without the inclusion of personally identifiable data. 

 

Author
husna-sayedi

Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.

Related Articles
icons-action-calendar3 Mar 2022

The AI scene of the 2010s was shaped by breakthroughs in vision-enabled technologies, from advanced image searches to computer vision systems for medical image analysis or for detecting defective parts in manufacturing and assembly. The 2020s, however, are foreseen to be all about natural language technologies and language-based AI tasks. NLP, NLG, NLQ, NLU… The list of abbreviations starting with NL (Natural Language) seems to grow each day. Regardless of the technology domain, it’s observed that natural language technologies will be in a field-shaping position in a variety of areas from business intelligence and healthcare to fintech.

icons-action-calendar3 Jan 2022

Bilingual, NLP-driven word clouds are now available in TAUS Data Marketplace. In this article, we discuss what word clouds are and what they can tell us about the contents of a document containing bilingual text data.

icons-action-calendar2 Dec 2021

This is the third article in my series on Translation Economics of the 2020s. In the first article published in Multilingual, I sketched the evolution of the translation industry driven by technological breakthroughs from an economic perspective. In the second article, Reconfiguring the Translation Ecosystem, I laid out the emerging new business models and ended with the observation that new smarter models still need to be invented. This is where I will now pick up the thread and introduce you to the next logical translation solution. I call it: Data-Enhanced Machine Translation.