icons-social-media-facebook-circleicons-social-media-twitter-circleicons-social-media-linked-in-circle
Why Do Data Cleaning and Anonymization Matter?
icons-action-calendar4 Oct 2021
2 minute read
Data cleaning and data anonymization are very critical in training ML models. Here are the reasons why.

Data cleaning is an essential step in machine learning and takes place before the model training step. It is important because your machine learning model will produce results only as good as the data you feed it. If your dataset contains too much noise, your model will capture that noise as a result. Furthermore, messy data can break your model and cause model accuracy rates to decrease. Examples of data cleaning techniques include syntax error removals, data normalization, duplicate removal, outlier detection/removal, and fixing encoding issues. 

Data anonymization is another imperative step in machine learning and entails the process of removing sensitive or personally identifiable information from datasets. For many organizations, data privacy laws make this  a vital step. Some common data anonymization techniques include perturbation, generalization, shuffling, scrambling, and synthetic data generation. Synthetic data could be a good alternative when dealing with sensitive data. Synthetic data can be generated in-house and can use characteristics of naturally-occurring data, without the inclusion of personally identifiable data. 

 

Author
husna-sayedi

Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.

Related Articles
icons-action-calendar7 Oct 2022

In recent years, NMT systems are getting better and better, some even claiming human parity. If systems on-par with human translators could really be deployed, that would fulfill the “no-human in the loop” dream that the industry seems to indulge in more and more frequently.

icons-action-calendar3 Mar 2022

The AI scene of the 2010s was shaped by breakthroughs in vision-enabled technologies, from advanced image searches to computer vision systems for medical image analysis or for detecting defective parts in manufacturing and assembly. The 2020s, however, are foreseen to be all about natural language technologies and language-based AI tasks. NLP, NLG, NLQ, NLU… The list of abbreviations starting with NL (Natural Language) seems to grow each day. Regardless of the technology domain, it’s observed that natural language technologies will be in a field-shaping position in a variety of areas from business intelligence and healthcare to fintech.

icons-action-calendar3 Jan 2022

Bilingual, NLP-driven word clouds are now available in TAUS Data Marketplace. In this article, we discuss what word clouds are and what they can tell us about the contents of a document containing bilingual text data.