**By Ahmed Gad, Menoufia University**

Polynomial Regression & Overfitting

Have you ever created a machine learning model that is perfect for the training samples but gives very bad predictions with unseen samples! Did you ever think why this happens? This article explains overfitting which is one of the reasons for poor predictions for unseen samples. Also, regularization technique based on regression is presented by simple steps to make it clear how to avoid overfitting.

The focus of machine learning (ML) is to train an algorithm with training data in order create a model that is able to make the correct predictions for unseen data (test data). To create a classifier, for example, a human expert will start by collecting the data required to train the ML algorithm. The human is responsible for finding the best types of features to represent each class which is capable of discriminating between the different classes. Such features will be used to train the ML algorithm. Suppose we are to build a ML model that classifies images as containing cats or not using the following training data.

The first question we have to answer is “what are the best features to use?”. This is a critical question in ML as the better the used features the better the predictions the trained ML model makes and vice versa. Let us try to visualize such images and extract some features that are representative of cats. Some of the representative features may be the existence of two dark eye pupils and two ears with a diagonal direction. Assuming that we extracted such features, somehow, from the above training images and a trained ML model is created. Such model can work with a wide range of cat images because the used features are existing in most of the cats. We can test the model using some unseen data as the following. Assuming that the classification accuracy of the test data is

**x%**.

One may want to increase the classification accuracy. The first thing to think of is by using more features than the two ones used previously. This is because the more discriminative features to use, the better the accuracy. By inspecting the training data again, we can find more features such as the overall image color as all training cat samples are white and the eye irises color as the training data has a yellow iris color. The feature vector will have the 4 features shown below. They will be used to retrain the ML model.

After creating the trained model next is to test it. The expected result after using the new feature vector is that the classification accuracy will decrease to be less than

**x%**. But why? The cause of accuracy drop is using some features that are already existing in the training data but not existing generally in all cat images. The features are not general across all cat images. All used training images have a while image color and a yellow eye irises but they are generalized to all cats. In the testing data, some cats have a black or yellow color which is not white as used in training. Some cats have not the irises color yellow.

In the testing data, some cats have a black or yellow color which is not white as used in training. Some cats have not the irises color yellow.

Our case in which the used features are powerful for the training samples but very poor for the testing samples is known as overfitting. The model is trained with some features that are exclusive to the training data but not existing in the testing data.

The goal of the previous discussion is to make the idea of overfitting simple by a high-level example. To get into the details it is preferable to work with a simpler example. That is why the rest of the discussion will be based on a regression example.

**Understand Regularization based on a Regression Example**

Assume we want to create a regression model that fits the data shown below. We can use polynomial regression.

The simplest model that we can start with is the linear model with a first-degree polynomial equation:

Where

The plot of the previous model is shown below:

Based on a loss function such as the one shown below, we can conclude that the model is not fitting the data well.

Where

is the expected output for sample and is the desired output for the same sample.The model is too simple and there are many predictions that are not accurate. For such reason, we should create a more complex model that can fit the data well. For such reason, we can increase the degree of the equation from one to two. It will be as follows:

By using the same feature

The graph shows that the second degree polynomial fits the data better than the first degree. But also the quadratic equation does not fit well some of the data samples. This is why we can create a more complex model of the third degree with the following equation:

The graph will be as follows:

It is noted that the model fits the data better after adding a new feature that capturing the data properties of the third degree. To fit the data better than before, we can increase the degree of the equation to be of the fourth degree as in the following equation:

The graph will be as follows:

It seems that the higher the degree of the polynomial equation the better it fits the data. But there are some important questions to be answered. If increasing the degree of the polynomial equation by adding new features enhances the results, so why not using a very high degree such as 100^{th} degree? What is the best degree to be used for a problem?

**Model Capacity/Complexity**

There is a term called model capacity or complexity. Model capacity/complexity refers to the level of variation that the model can work with. The higher the capacity the more variation the model can cope with. The first model

is said to be of a small capacity compared toFor sure the higher the degree of the polynomial equation the more fit it will be for the data. But remember that increasing the polynomial degree increases the complexity of the model. Using a model with a capacity higher than required may lead to overfitting. The model becomes very complex and fits the training data very well but unfortunately, it is a very weak for unseen data. The goal of ML is not only creating a model that is robust with the training data but also to the unseen data samples.

The model of the fourth degree (

In this example, we actually know which features to remove. So, we can remove it and return back to the previous model of the third-degree (

). But in actual work, we do not know which features to remove. Moreover, assume that the new feature is not too bad and we do not want to completely remove it and just want to penalize it. What should we do?Looking back at the loss function, the only goal is to minimize/penalize the prediction error. We can set a new objective to minimize/penalize the effect of the new feature

Our objective now is to minimize the loss function. We are now just interested in minimizing this term

By removing it, we go back to the third-degree polynomial equation (

But in case it

Going back to

**Regularization**

Note that we actually knew that

Regularization helps us to select the model complexity to fit the data. It is useful to automatically penalize features that make the model too complex. Remember that regularization is useful if the features are not bad and relatively helps us to get good predictions and we just need to penalize but not to remove them completely. Regularization penalizes all used features, not a selected subset. Previously, we penalized just two features

Using regularization, a new term is added to the loss function to penalize the features so the loss function will be as follows:

It can also be written as follows after moving Λ outside the summation:

The newly added term

is used to penalize the features to control the level of model complexity. Our previous goal before adding the regularization term is to minimize the prediction error as much as possible. Now our goal is to minimize the error but to be careful of not making the model too complex and avoids overfitting.There is a regularization parameter called lambda (λ) which controls how to penalize the features. It is a hyperparameter with no fixed value. Its value is variable based on the task at hand. As its value increases as there will be high penalization for the features. As a result, the model becomes simpler. When its values decrease there will be a low penalization of the features and thus the model complexity increases. A value of zero means no removal of features at all.

When

But when the value of the penalization parameter

Please note that the regularization term starts its index

from 1 not zero. Actually, we use the regularization term to penalize features ( ). Because

**Bio: Ahmed Gad** received his B.Sc. degree with excellent with honors in information technology from the Faculty of Computers and Information (FCI), Menoufia University, Egypt, in July 2015. For being ranked first in his faculty, he was recommended to work as a teaching assistant in one of the Egyptian institutes in 2015 and then in 2016 to work as a teaching assistant and a researcher in his faculty. His current research interests include deep learning, machine learning, artificial intelligence, digital signal processing, and computer vision.

Original. Reposted with permission.

**Related:**

- TensorFlow: Building Feed-Forward Neural Networks Step-by-Step
- Is Learning Rate Useful in Artificial Neural Networks?
- 5 Free Resources for Furthering Your Understanding of Deep Learning