William of Ockham was a fourteenth-century Franciscan friar who studied logic. He is most often associated with lex parsimoniae, the law of briefness. Simply stated, when there are two explanations for something that work equally well, the simplest explanation is probably best. The principle behind this law, commonly referred to as Occam’s razor, is the shaving away of unnecessary complexity which is precisely how we optimize our machine learning models. Let me explain…
I discussed in my last post, Overfitting/Underfitting – How Well Does Your Model Fit, how to evaluate the fit of your model as well as steps that can be taken if your model does not generalize well. I discussed how comparing the training data results with the test data results will tell us if we have a model that is underfitting or overfitting. I showed this comparison in the graph below. The objective, of course, is to land in the middle of this graph, where the model generalizes well for both the training and test data. As we move to the left on this graph, i.e. into the underfitting range, we have a model which is too simplistic. It does not have enough information to make an accurate prediction. If we go to the other extreme, into the overfitting range, we have too much information. The model is making predictions on data containing unnecessary features picking up what is essentially noise that does not generalize well. We, therefore, want to shave away this excess complexity.
To demonstrate this, let’s continue with the example from the previous post, we are a bank trying to determine who is a good credit risk. So, we would like to create a function that can predict the credit worthiness of an applicant from a set of features. A formal way of saying this is to say that given training set (Ax, Bx) for X = 1…n, we want to create a classification function f that can predict B for a new A. Where A are applicants and B their associated credit worthiness. We begin by creating a vector representing the attributes of the loan applicants, as shown in the table below. Each vector represents an applicant and each column represents a feature of that applicant. This being our training data set we already know the credit worthiness of each. In the table below an applicant that is deemed to have good credit worthiness is indicated with a 1.
Now, this is a very simple model with just five dimensions. Let’s say that we run the model with this training data set only to find that it does not generalize well, i.e. the model is not accurately predicting the credit worthiness of the applicants in the training data set. Since the model is not performing well with the training data, we need not even bother with the test data set. We try to improve our results by introducing more features into the data set, as shown in the example below. As we can see once we added the additional features the model generalizes much better with the training data set, matching it exactly.
When we apply this same model to the test data set, however, we don’t do as well. I have shown this example in the table below. We now have a situation where the model is overfitted, specifically tuned to the training data set. For the purposes of our discussion today, we are going to assume that the reason for the overfitting is the dimensionality of the data set. We will address other fitting issues in future posts.
It is now time to get out our razor and start shaving away some of the complexity. After a few iterations, we settle on two dimensions; savings in thousands and previous defaults. The model detects that when the applicant has a previous default and savings less than $100,000 they have poor credit worthiness. In the data provided below, we see that by eliminating the excess five dimensions we have been able to develop a model that generalizes well. Although the model is not one hundred percent accurate, in this instance, it generalizes well enough for our purposes.
As we consider simplicity, the issue extends beyond the ability of the model to generalize well. We should consider the complexity of the model even when we have a good fit. Unnecessarily complex models, even when they predict well, are inefficient. Large volumes of data, in addition to putting greater demand on system resources, increase the data cleansing requirements, data load times, and overall processing time.
While this is a very simple example, the concept is that we want to establish a balance in our models between complexity and accuracy. As William of Ockham reminds us, the simplest explanations are often the best. As we apply this principle to machine learning, the best models are simple ones that fit the data well.
One thought on “Overfit / Underfit – Shaving with Occam’s Razor”
Pingback: The Data Science Process | Meditations on BI and Data Science