There are a number of machine learning models to choose from. We can use Linear Regression to predict a value, Logistic Regression to classify distinct outcomes, and Neural Networks to model non-linear behaviors.

When we build these models, we always use a set of historical data to help our machine learning algorithms learn what is the relationship between a set of input features to a predicted output. But even if this model can accurately predict a value from historical data, how do we know it will work as well on new data?

Or more plainly, how do we evaluate whether a machine learning model is actually “good”?

In this post we’ll walk through some common scenarios where a seemingly good machine learning model may still be wrong. We’ll show how you can evaluate these issues by assessing metrics of bias vs. variance and precision vs. recall, and present some solutions that can help when you encounter such scenarios.

High Bias or High Variance

Screen-Shot-2016-11-26-at-6.38.21-PM.png

When evaluating a machine learning model, one of the first things you want to assess is whether you have “High Bias” or “High Variance”.

High Bias refers to a scenario where your model is “underfitting” your example dataset (see figure above). This is bad because your model is not presenting a very accurate or representative picture of the relationship between your inputs and predicted output, and is often outputting high error (e.g. the difference between the model’s predicted value and actual value).

High Variance represents the opposite scenario. In cases of High Variance or “overfitting”, your machine learning model is so accurate that it is perfectly fitted to your example dataset. While this may seem like a good outcome, it is also a cause for concern, as such models often fail to generalize to future datasets. So while your model works well for your existing data, you don’t know how well it’ll perform on other examples.

But how can you know whether your model has High Bias or High Variance?

One straightforward method is to do a Train-Test Split of your data. For instance, train your model on 70% of your data, and then measure its error rate on the remaining 30% of data. If your model has high error in both the train and test datasets, you know your model is underfitting both sets and has High Bias. If your model has low error in the training set but high error in the test set, this is indicative of High Variance as your model has failed to generalize to the second set of data.

If you can generate a model with overall low error in both your train (past) and test (future) datasets, you’ll have found a model that is “Just Right” and balanced the right levels of bias and variance.

Low Precision or Low Recall

Screen-Shot-2016-11-27-at-9.09.02-AM.png

Even when you have high accuracy, it’s possible that your machine learning model may be susceptible to other types of error.

Take the case of classifying email as spam (the positive class) or not spam (the negative class). 99% of the time, the email you receive is not spam, but perhaps 1% of the time it is spam. If we were to train a machine learning model and it learned to always predict an email as not spam (negative class), then it would be accurate 99% of the time despite never catching the positive class.

In scenarios like this, it’s helpful to look at what percentage of the positive class we’re actually predicting, given by two metrics of Precision and Recall.

Screen-Shot-2016-11-26-at-9.56.34-PM.png

Precision is a measure of how often your predictions for the positive class are actually true. It’s calculated as the number of True Positives (e.g. predicting an email is spam and it is actually spam) over the sum of the True Positives and False Positives (e.g. predicting an email is spam when it’s not).

Recall is the measure of how often the actual positive class is predicted as such. It’s calculated as the number of True Positives over the sum of the True Positives and False Negatives (e.g. predicting an email is not spam when it is).

Another way to interpret the difference between Precision and Recall, is that Precision is measuring what fraction of your predictions for the positive class are valid, while Recall is telling you how often your predictions actually capture the positive class. Hence, a situation of Low Precision emerges when very few of your positive predictions are true, and Low Recall occurs if most of your positive values are never predicted.

The goal of a good machine learning model is to get the right balance of Precision and Recall, by trying to maximize the number of True Positives while minimizing the number of False Negatives and False Positives (as represented in the diagram above).  

5 Ways to Improve Your Model

Screen-Shot-2016-11-26-at-6.38.57-PM.png

If you face issues of High Bias vs. High Variance in your models, or have trouble balancing Precision vs. Recall, there are a number of strategies you can employ.

For instances of High Bias in your machine learning model, you can try increasing the number of input features. As discussed, High Bias emerges when your model is underfit to the underlying data and you have high error in both your train and test set. Plotting model error as a function of the number of input features you are using (see figure above), we find that more features leads to a better fit in the model.

It follows then in the opposite scenario of High Variance, you can reduce the number of input features. If your model is overfit to the training data, it’s possible you’ve used too many features and reducing the number of inputs will make the model more flexible to test or future datasets. Similarly, increasing the number of training examples can help in cases of high variance, helping the machine learning algorithm build a more generalizable model.

For balancing cases of Low Precision and Low Recall, you can alter the probability threshold at which you classify the positive vs. negative class (see figure above). For cases of Low Precision you can increase the probability threshold, thereby making your model more conservative in its designation of the positive class. On the flip side if you are seeing Low Recall you may reduce the probability threshold, therein predicting the positive class more often.

With enough iterations, it’s hence often possible to find an appropriate machine learning model with the right balance of bias vs. variance and precision vs. recall.

This blog post is based on concepts taught in Stanford’s Machine Learning course notes by Andrew Ng on Coursera.