Why Do Data Cleaning and Anonymization Matter?
icons-action-calendar4 Oct 2021
2 minute read
Data cleaning and data anonymization are very critical in training ML models. Here are the reasons why.

Data cleaning is an essential step in machine learning and takes place before the model training step. It is important because your machine learning model will produce results only as good as the data you feed it. If your dataset contains too much noise, your model will capture that noise as a result. Furthermore, messy data can break your model and cause model accuracy rates to decrease. Examples of data cleaning techniques include syntax error removals, data normalization, duplicate removal, outlier detection/removal, and fixing encoding issues. 

Data anonymization is another imperative step in machine learning and entails the process of removing sensitive or personally identifiable information from datasets. For many organizations, data privacy laws make this  a vital step. Some common data anonymization techniques include perturbation, generalization, shuffling, scrambling, and synthetic data generation. Synthetic data could be a good alternative when dealing with sensitive data. Synthetic data can be generated in-house and can use characteristics of naturally-occurring data, without the inclusion of personally identifiable data. 



Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.

Related Articles
icons-action-calendar19 Dec 2022
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.
icons-action-calendar19 Dec 2022
Machine learning and AI applications need data in order to work. And in order to get good results and output, the cleaner the data, the better.
icons-action-calendar19 Dec 2022
Text Summarization can be categorized under two types: Extraction and Abstraction. With the power of AI, summarization is becoming more popular and accessible.