How Much Training Data Do I Need?
icons-action-calendar4 Oct 2021
2 minute read
Here are some pointers on how much training data do you need to train your ML models.

The amount of training data you need depends on many variables - the model you use, the task you perform, the performance you wish to achieve, the number of features available, the noise in the data, the complexity of the model, and more. 

While there is no set answer to how much training data you will need for your given machine learning application, we do have some key guidelines. Generally speaking, the first rule of thumb is that the more training data a model has, the better the outcome. The higher the volume of training data, the less likely a model will overfit, or capture too much noise, taking away from the data’s true signal. Moreover, more training data will reduce the chances of a high bias (when a model oversimplifies assumptions). 

Next, using domain expertise can help you narrow down to a suitably sized training set. Training data should ideally be independent and identically distributed, to avoid the problem of an imbalanced dataset. Accordingly, there should be enough data in the training set that captures all relationships that may exist for a model to be able to effectively map the input to the predicted outputs. 

Lastly, intuition based on your given machine learning model can help you understand how much training data your given model needs. While there is no golden rule, some machine learning models are known to need more training data than others. For regression problems, it is suggested to have at least ten times more data points than the number of features present. For image classification problems, tens of thousands of images are needed to build a robust classifier. For natural language processing problems, tens of thousands of samples are needed for the model to see enough variation in text data. 



Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.

Related Articles
icons-action-calendar7 Oct 2022

In recent years, NMT systems are getting better and better, some even claiming human parity. If systems on-par with human translators could really be deployed, that would fulfill the “no-human in the loop” dream that the industry seems to indulge in more and more frequently.

icons-action-calendar3 Mar 2022

The AI scene of the 2010s was shaped by breakthroughs in vision-enabled technologies, from advanced image searches to computer vision systems for medical image analysis or for detecting defective parts in manufacturing and assembly. The 2020s, however, are foreseen to be all about natural language technologies and language-based AI tasks. NLP, NLG, NLQ, NLU… The list of abbreviations starting with NL (Natural Language) seems to grow each day. Regardless of the technology domain, it’s observed that natural language technologies will be in a field-shaping position in a variety of areas from business intelligence and healthcare to fintech.

icons-action-calendar3 Jan 2022

Bilingual, NLP-driven word clouds are now available in TAUS Data Marketplace. In this article, we discuss what word clouds are and what they can tell us about the contents of a document containing bilingual text data.