Often we want to predict discrete outcomes in our data. Can an email be designated as spam or not spam? Was a transaction fraudulent or valid?
Predicting such outcomes lends itself to a type of Supervised Machine Learning noted as Binary Classification, where you try to distinguish between two classes of outcomes.
One of the most common methods to solve for Binary Classification is called Logistic Regression. The goal of Logistic Regression is to evaluate the probability of a discrete outcome occurring, based on a set of past inputs and outcomes. As part of our continuing ML 101 series, we’ll review the basic steps of Logistic Regression, and show how you can use such an approach to predict the probability of any binary outcome.
Using Logistic Regression to Predict Probabilities
Developing a Logistic Regression model for Binary Classification involves a couple steps.
With Linear Regression, our goal was to develop a model that could predict any real value. But in Binary Classification we’re trying to distinguish between just two discrete classes. In such a scenario, it’s more helpful to predict the probability of the outcome, than the discrete outcome itself.
The goal of Binary Classification is thus to find a model that can best predict the probability of a discrete outcome (notated as 1 or 0, for the “positive” or “negative” classes), based on a set of explanatory input features related to that outcome.
Logistic Regression allows us to compute this probability based on a function:
The model represented computes probability using a sigmoid function of the form 1 / (1 + e-z). For the “z” input into the function, we include a linear multiplication of the parameters θ and features x, where z = θ0 + θ1*x1 + θ2*x2 (for simplicity throughout this post, we’ll focus on datasets with just two features x1 and x2).
To put this in context, let’s look at an example of credit card transactions.
Say we want to determine whether a transaction is fraudulent or valid, based on the transaction amount ($) and time of day (hour) the transaction
occurred. If we use a Logistic Regression model, we would denote the probability of the transaction being fraudulent as P(1), the amount as x1 and time as x2, such that the model takes the form of P(fraud) = 1 / (1 + e- (θ0 + θ1*amount + θ2*time)).
If you graph out the values for this model (as seen in Figure 1 above), you’ll see that all outputs take a value between 0 and 1, where 0 indicates no likelihood of occurring and 1 indicates 100% likelihood of the transaction being fraudulent.
But how do we leverage these probabilities to actually classify our binary outcomes as fraudulent or valid transactions?
Classifying Binary Outcomes With a Decision Boundary
A Logistic Regression model outputs a probability between 0 and 1 that your discrete positive outcome will occur. To actually convert these probabilities into classifications, we have to use Decision Boundaries.
A Decision Boundary is the line we use to separate our input examples, and designate which examples can be classified in the positive class (y = 1) and the negative class (y = 0).
For two features x1 and x2, we can visualize the Decision Boundary by graphing a scatter plot of all our input examples (see Figure 2 above), with x1 on the x-axis and x2 on the y-axis . The line for the Decision Boundary is in turn based on the linear parameterized function of θ0 + θ1*x1 + θ2*x2, and some defined probability threshold.
Revisiting our example of fraudulent transactions, say we classify any output as fraudulent if the predicted probability is >= 0.5. Combining inputs from Figure 1 and Figure 2, we can then compute the line that separates which examples are classified as y = 1 or fraudulent, based on which examples are above the red line (see Figure 2 above).
But as this Decision Boundary Line is based on a linear function of θ, how do we decide on the value of the parameters θ0, θ1, and θ2 to describe the model?
Measuring the Accuracy of Different Decision Boundaries
As with Linear Regression, we can see that the accuracy of our Logistic Regression model is dependent on our choice of the parameters for θ. Different choices of θ will generate different Decision Boundaries (as noted in Figure 3 above) that have varying levels of efficacy in distinguishing between our different outcomes.
To choose the best model, you want to choose the parameters for θ0, θ1, and θ2 that can produce a Decision Boundary with the highest efficacy at correctly classifying your outcomes.
To measure the accuracy of the Logistic Regression model for a given set of parameters θ, we can use a Cost Function.
The math behind this Cost Function is a little complicated, but for simplicity, you can interpret it as computing an error rate based on a comparison of the predicted probability of an outcome P(1) and the actual outcome y. The closer the predicted probability is to 1 or 100%, the lower the error for the “1” or “positive class”.
Going back to our fraud example, say we have a fraudulent transaction with [amount, time] as inputs [$100, 20]. If we look at two different models where P(fraud) = 1 / (1 + e- (1 + 0.002*amount + 0.01*time)) and P(fraud) = 1 / (1 + e- (1 + 0.001*amount + 0.002*time)), we’ll get P(fraud) = 80% and 75% respectively. Computing the -log of these probabilities we get 0.1 and 0.12 respectively, indicating that the first model has lower cost (which makes sense, considering 80% > 70% as a probability)
With the Cost Function defined to evaluate the effectiveness of our Logistic Regression Model, we can use the Gradient Descent Algorithm we defined in our previous post to algorithmically find the parameters θ0, θ1, and θ2 that produce the lowest error rates.
Defined explicitly for Logistic Regression, this looks like:
As a refresher, what this algorithm is doing is working through increments of different values of θ0, θ1, and θ2 by a factor of the Cost Function, until you reach convergence and any incremental change in θ0, θ1, and θ2 no longer reduces the error of the model or is closest to 0. The values for θ at that stage will be your optimal values for the Logistic Regression model.
Hopefully this helps better guide how you can use Logistic Regression to predict the probability of a discrete outcome occurring.
Starting with some training data of input variables x1 and x2, and respective binary outputs for y = 0 or 1, you use a learning algorithm like Gradient Descent to find the parameters θ0, θ1, and θ2 that present the lowest Cost to modeling a logistic relationship to compute the probability of the positive class by P(1) = 1 / (1 + e- (θ0 + θ1*x1 + θ2*x2)).
It should be noted in this post we focused on using Logistic Regression for Binary Classification, wherein we only looked at two discrete outcomes (0 or 1). If you have more than two discrete outcomes, or a Multiclass Classification problem, you can actually still reuse most of the logic and models covered in this post.
Say you are trying to distinguish between 3 outcomes, i.e. fraudulent vs. valid vs. refunded transactions. To solve this Multiclass problem, you’d basically create 3 separate logistic regression models: the 1st by separating fraud vs. valid + refunded, the 2nd for valid vs. fraud + refunded, and the 3rd for refunded vs. valid + fraud. Then for each example you want to run a prediction, you simply choose the model with the highest predicted probability.
So regardless if you’re trying to distinguish between Binary or Multiclass problems, Logistic Regression provides a powerful tool to predict the probability of any discrete outcome, and separate out which of your examples are in the positive or negative class of outcomes.
This blog post is based on concepts taught in Stanford’s Machine Learning course notes by Andrew Ng on Coursera.