Supervised machine learning is inferring a function which will map input variables to an output variable. Let’s unpack this definition a bit with an example. Say that we are a bank that wants to determine to whom we should give a loan. The objective, therefore, is to infer a function that examines a set of characteristics, our input variables, to predict whether an applicant is a good or bad credit risk, our output variable. Although there are other types of supervised learning, since this is a simple binary choice, good or bad, we will use a classification model. We will train this model in how to distinguish good from bad credit applicants with a training data set. The training data set is a set of data where the results are already known. In our example, the training data set contains characteristics of applicants whose creditworthiness is known. Perhaps they are past applicants that have defaulted on a loan or have paid off their loans. The function takes this specific set of examples to infer a general set of rules by which it predicts the creditworthiness of future applicants not in our training data set. The bank then applies this function to new applicants to predict if they are going to default on the loan.
In machine learning, we refer to the application of a general set of rules to specific examples as generalization. When we talk about how well a model generalizes, we are referring to how well the general rule is able to predict a result based on a specific instance not in the training dataset. In statistics, they ask how well does the model fit the data. In our example, if the model accurately predicts people who consistently pay off their loans, we would say the model generalize well which means that we have a good fit.
To understand how well a model generalizes, i.e. how good of a fit we have, we test the model with a test data set. The test data set is similar to the training data set in that we already know the result, however, the model has not seen this data prior to the test. After running the test, we can then compare how well the model predicted the results with the training data and the test data. The graph below shows this comparison. When the model does a poor job of predicting results for both the training and test data, we say that the model underfits the data. In our example, this would be a case where the model is rejecting people who have high credit worthiness and/or accepting people who have low credit worthiness. In such situations, we can:
- Increase the number of training examples
- Add new features or change the set of features
- Decrease the amount of regularization
- Use a different type of model
A model that overfits the data is also a concern. An overfitted model is when the model has learned the training data too well. It keys in on specifics of the training data where it can predict the training data very well, but it is unable to generalize. It is unable to predict results with the test data. We see this when there is a low error rate with the training data, but a high error rate with the test data. In our example, as you may guess, the model does an excellent job of telling us who is a good and bad credit risk in the training data, but not in the test data. In such situations, we can:
- Simplify the model by reducing the number of features
- Randomization Testing
- Increase the amount of regularization
- Adjusting the false discovery rate
Of course, the optimum solution is one where the model predicts both the test and training data well. We need to remember to achieve this solution training our model is an iterative process. We may find that in the first few passes that our model is underfitting. As we adjust the model we may realize that we have gone too far resulting in a model that overfits the training data. I will discuss the process of adjusting models for underfit and overfit in more detail in future posts.
Additional Sources of Information