The amount of training data you need depends on many variables - the model you use, the task you perform, the performance you wish to achieve, the number of features available, the noise in the data, the complexity of the model, and more.
While there is no set answer to how much training data you will need for your given machine learning application, we do have some key guidelines. Generally speaking, the first rule of thumb is that the more training data a model has, the better the outcome. The higher the volume of training data, the less likely a model will overfit, or capture too much noise, taking away from the data’s true signal. Moreover, more training data will reduce the chances of a high bias (when a model oversimplifies assumptions).
Next, using domain expertise can help you narrow down to a suitably sized training set. Training data should ideally be independent and identically distributed, to avoid the problem of an imbalanced dataset. Accordingly, there should be enough data in the training set that captures all relationships that may exist for a model to be able to effectively map the input to the predicted outputs.
Lastly, intuition based on your given machine learning model can help you understand how much training data your given model needs. While there is no golden rule, some machine learning models are known to need more training data than others. For regression problems, it is suggested to have at least ten times more data points than the number of features present. For image classification problems, tens of thousands of images are needed to build a robust classifier. For natural language processing problems, tens of thousands of samples are needed for the model to see enough variation in text data.
Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.
The AI scene of the 2010s was shaped by breakthroughs in vision-enabled technologies, from advanced image searches to computer vision systems for medical image analysis or for detecting defective parts in manufacturing and assembly. The 2020s, however, are foreseen to be all about natural language technologies and language-based AI tasks. NLP, NLG, NLQ, NLU… The list of abbreviations starting with NL (Natural Language) seems to grow each day. Regardless of the technology domain, it’s observed that natural language technologies will be in a field-shaping position in a variety of areas from business intelligence and healthcare to fintech.
This is the third article in my series on Translation Economics of the 2020s. In the first article published in Multilingual, I sketched the evolution of the translation industry driven by technological breakthroughs from an economic perspective. In the second article, Reconfiguring the Translation Ecosystem, I laid out the emerging new business models and ended with the observation that new smarter models still need to be invented. This is where I will now pick up the thread and introduce you to the next logical translation solution. I call it: Data-Enhanced Machine Translation.