Types of Training Data
icons-action-calendar4 Oct 2021
3 minute read
A brief introduction to types of training data including structured, unstructured, and semi-structured data.

Training data is used in three primary types of machine learning: supervised, unsupervised, and semi-supervised learning. In supervised learning, the training data must be labeled. This allows the model to learn a mapping from the label to its associated features. In unsupervised learning, labels are not required in the training set. Unsupervised machine learning models look for underlying structures in the features of the training set to make generalized groupings or predictions. A semi-supervised training dataset will have a mix of both unlabeled and labeled features, used in semi-supervised learning problems. 

Reinforcement learning models use learned errors and associate them with a given reward or penalty. This family of models can use either no training data and learn from experience, or use training data and learn from experience. 

Types of training data illustration

Within these three areas of machine learning, there are many different types of data that could be used for training,  including structured, unstructured, and semi-structured data. As the names suggest, structured data are data that have clearly defined patterns and data types, where unstructured data does not. Structured data is highly organized and easily searchable, usually residing in relational databases.

Examples of structured data include sales transactions, inventory, addresses, dates, stock information, etc. Unstructured data, often living in non-relational databases, are more difficult to pinpoint and are most often categorized as qualitative data. Examples of unstructured data include audio recordings, video, tweets, social media posts, satellite imagery, text files, etc. Depending on the machine learning application, both structured and unstructured data can be used as training data. 



Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.

Related Articles
icons-action-calendar3 Mar 2022

The AI scene of the 2010s was shaped by breakthroughs in vision-enabled technologies, from advanced image searches to computer vision systems for medical image analysis or for detecting defective parts in manufacturing and assembly. The 2020s, however, are foreseen to be all about natural language technologies and language-based AI tasks. NLP, NLG, NLQ, NLU… The list of abbreviations starting with NL (Natural Language) seems to grow each day. Regardless of the technology domain, it’s observed that natural language technologies will be in a field-shaping position in a variety of areas from business intelligence and healthcare to fintech.

icons-action-calendar3 Jan 2022

Bilingual, NLP-driven word clouds are now available in TAUS Data Marketplace. In this article, we discuss what word clouds are and what they can tell us about the contents of a document containing bilingual text data.

icons-action-calendar2 Dec 2021

This is the third article in my series on Translation Economics of the 2020s. In the first article published in Multilingual, I sketched the evolution of the translation industry driven by technological breakthroughs from an economic perspective. In the second article, Reconfiguring the Translation Ecosystem, I laid out the emerging new business models and ended with the observation that new smarter models still need to be invented. This is where I will now pick up the thread and introduce you to the next logical translation solution. I call it: Data-Enhanced Machine Translation.