Building a machine learning model is not just about feeding the data, there are many shortcomings that affect the accuracy of any model. Overfitting in machine learning is one of the shortcomings in machine learning that hampers model accuracy and performance. In this article we explain what overfitting is, what its causes and consequences are, and how to solve it.
What is overfitting in machine learning?
You have probably already experienced, in the age of big data and artificial intelligence, a situation that looks like the following: you start learning a machine learning model and get very promising results, after which you quickly release the model to production. However, a few days later, you realize that your clients are calling you to complain about the poor results of your predictions. What happened?
Most likely, you have been too optimistic and have not validated your model against the correct database. Or rather, you didn’t use your learning base in the right way.
When we develop a machine learning model , we try to teach it how to achieve a goal: detect an object in an image, classify a text based on its content, speech recognition, etc. To do this, we start from a database that will be used to train the model, that is, to learn how to use it to achieve the desired objective. However, if we don’t do things right, the model may consider as validated only the data that was used to train the model, without recognizing any other data that is slightly different from the initial database. This phenomenon is called overfitting in machine learning .
A statistical model is said to be overfitted when we train it on a lot of data. When a model is trained on this much data, it begins to learn from noise and inaccurate data inputs in our dataset. So the model does not categorize the data correctly, due to too much detail and noise.
The causes of overfitting are non-parametric and non-linear methods because these types of machine learning algorithms have more freedom to build the model based on the dataset and thus can actually build unrealistic models. A solution to avoid overfitting is to use a linear algorithm if we have linear data or use parameters such as maximum depth if we are using decision trees.
To understand overfitting, you need to understand a number of key concepts.
Sweet spot is the midpoint that we must find in learning our model in which we make sure that we are not underfitting or overfitting, which can sometimes be a complicated task.
The behavior of machine learning models with increasing amounts of data is interesting. If you are building a company based on machine learning, first of all, you need to make sure that more data gives you better algorithms.
But that is a necessary condition, not a sufficient one. You also need to find a sweet spot where:
- It is not too easy to collect enough data, because then the value of your data is small.
- It is not too difficult to collect enough data, because then you will spend too much money to solve the problem.
- The value of data keeps growing as you get more data.
This is the difference between the predicted values and the actual or true values in the model. It is not always easy for the model to learn from fairly complex signals.
Imagine fitting a linear regression to a model with nonlinear data . No matter how efficiently the model learns the observations, it will not efficiently model the curves. It is known as mismatch or bias.
In a very simple way, a high bias indicates that the model suffers from underfitting.
It refers to the sensitivity of the model to specific sets in the training data. A high variance algorithm will produce a strange model that is drastically different from the training set.
Imagine an algorithm that fits the model unconstrained and super flexible, it will also learn from noise in the training set that causes overfitting.
A machine learning algorithm cannot be perceived as a one-time method of training the model, rather it is an iterative process.
Low variance, high bias algorithms are less complex, with a simple and rigid structure.
- They will train models that are consistent, but on average inaccurate.
- These include linear or parametric algorithms, such as regression, Naive Bayes, etc.
High variance, low bias algorithms tend to be more complex, with a flexible structure.
- They will train models that are inconsistent but accurate on average.
- These include nonlinear or nonparametric algorithms like decision trees, nearest neighbor, etc.
Evaluating your machine learning algorithm is an essential part of any project. Your model may give you satisfactory results when evaluated against one metric, such as precision_score , but it may give poor results when evaluated against other metrics, such as logarithmic_loss or any other similar metric.
Most of the time we use classification accuracy to measure the accuracy of our model , however it is not enough to really judge our model.
Accuracy is the ratio of the number of correct predictions to the total number of input samples.
For example, consider that there are 98% samples of class A and 2% samples of class B in our training set. So our model can easily obtain a training accuracy of 98% simply by predicting each training sample belonging to class A.
When the same model is tested on a test set with 60% class A samples and 40% class B samples, the test accuracy would drop to 60%. Classification accuracy is excellent, but it gives us the false sense of achieving high accuracy.
What are your causes?
The causes of overfitting can be complicated. Generally, we can categorize them into three
- Noise learning in the training set: when the training set is too small or has less representative data or too much noise. This situation means that noises have a great chance of being learned, and then acting as a basis for predictions. Therefore, a well-functioning algorithm should be able to distinguish representative data from noise.
- Hypothesis Complexity: The trade-off in complexity, a key concept in learning statistics and machining, is a compromise between Variance and Bias. That refers to a balance between accuracy and consistency. When algorithms have too many hypotheses (too many inputs), the model becomes more accurate on average with less consistency. This situation means that the models can be drastically different on different data sets.
- Multiple comparison procedures that are ubiquitous in induction algorithms as well as other AI algorithms. During these processes, we always compare multiple items based on scores from an evaluation function and select the item with the highest score. However, this process will likely pick out some items that will not improve or even reduce the accuracy of the classification.
Overfitting can have many causes and is usually a combination of the following:
- Model too powerful: For example, it allows polynomials up to degree 100. With polynomials up to degree 5, you would have a much less powerful model that is much less prone to overfitting.
- Not Enough Data – Getting more data can sometimes fix overfitting issues.
What negative consequences can it have?
Overfitting , simply put, means taking too much information from your data and/or prior knowledge into account, and using it in a model. To make it easier, consider the following example: Some scientists hire you to provide them with a model to predict the growth of some type of plant. The scientists have provided you with information gathered from their work with these plants for a whole year, and they will continually provide you with information on the future development of their plantation.
So, it reviews the received data and creates a model from it. Now suppose that, in your model, you have considered as many features as possible to always find out the exact behavior of the plants you saw in the initial data set. Now, as production continues, you will always keep those characteristics in mind and produce highly accurate results. However, if the plantation eventually undergoes some seasonal change, the results you receive may fit your model in such a way that your predictions will start to fail.
In addition to being unable to detect such small variations and classifying your inputs incorrectly, the detail of the model, that is, the large number of variables, can make processing too expensive. Now, imagine that your data is already complex. Overfitting your model to the data will not only make the classification/evaluation very complex, it will also likely cause you to miss the prediction by the slightest variation you might have in the input.
Overfitting is empirically bad. Suppose you have a data set that you split into two, test and training. An overfit model is one that performs much worse on the test dataset than on the training dataset. It is often observed that models like that also generally perform worse on additional test data sets than models that are not overfitted.
How to detect overfitting in a predictive model?
Detecting overfitting is almost impossible before testing the data. It can help address the inherent characteristic of overfitting, which is the inability to generalize data sets. Therefore, the data can be separated into different subsets to facilitate training and testing. The data is divided into two main parts, i.e. a test set and a training set.
The training set represents the majority of the available data (about 80%) and trains the model. The test set represents a small portion of the data set (approximately 20%) and is used to test the accuracy of data that you have never interacted with before. By segmenting the dataset, we can examine the performance of the model on each dataset to detect overfitting when it occurs, as well as see how the training process works.
Performance can be measured by using the percent accuracy observed in both data sets to conclude on the presence of overfitting. If the model performs better on the training set than on the test set, then the model is likely to be overfitting.
How to avoid overfitting?
Here are some of the ways to avoid overfitting:
Training with more data
This technique may not work every time. Basically, it helps the model to better identify the signal.
But in some cases, increasing data can also mean feeding more noise into the model. When we train the model with more data, we need to make sure that the data is clean and free of randomness and inconsistencies.
When the model is being trained, you can measure the performance of the model based on each iteration. We can do this up to a point where iterations improve the performance of the model. After this, the model overfits the training data as the generalization weakens after each iteration.
So basically stopping early means stopping the training process before the model passes the point where the model starts to overfit the training data. This technique is mainly used in deep learning.
Elimination of functions
Although some algorithms have an automatic selection of functions. For a significant number of those that do not have built-in feature selection, we can manually remove some irrelevant features from the input features to improve generalizability.
One way to do this is to derive a conclusion about how a feature fits into the model. It is quite similar to debugging the code line by line.
In case a feature cannot explain relevance in the model, we can simply identify those features. We can even use some feature selection heuristics for a good starting point.
This technique basically combines predictions from different Machine Learning models. Two of the most common methods for assembling are listed below:
- Bagging attempts to reduce the possibility of overfitting models
- Drive attempts to improve the predictive flexibility of simpler models
Although both are set methods, the approach starts in totally opposite directions. Bagging uses complex base models and tries to smooth their predictions while boosting uses simple base models and tries to increase their aggregate complexity.
When you train a learning algorithm iteratively, you can measure how well each iteration of the model performs.
Up to a certain number of iterations, new iterations improve the model. However, after that point, the model’s ability to generalize may weaken as it begins to overfit the training data.
Early stopping refers to stopping the training process before the learner passes that point.
Nowadays, this technique is mainly used in deep learning, while other techniques are preferred for classical machine learning.
Cross-validation is a powerful preventive measure against overfitting.
The idea is clever: use your initial training data to generate multiple mini-train test splits. Use these divisions to adjust your model.
In standard k-fold cross-validation, we divide the data into k subsets, called folds. We then iteratively train the algorithm on k-1 folds while using the remaining fold as the test set (called the “reserved fold”).
Cross-validation allows you to fit hyperparameters only against their original training set. This allows you to keep your test set as a truly invisible dataset to select your final model.
The above content published at Collaborative Research Group is for informational purposes only and has been developed by referring to reliable sources and recommendations from experts. We do not have any contact with official entities nor do we intend to replace the information that they emit.