One of the most common questions we have of our data is evaluating the value of something. How many items will we sell next month? How much does it cost to produce them? How much revenue will we make over the year?

You can often answer such questions with Machine Learning. As covered in our previous post on Supervised Machine Learning, if you have enough historical data on past outcomes, you can make such predictions on future outcomes.

One of the most common Supervised Learning approaches to predicting a value is Linear Regression. In Linear Regression, the goal is to evaluate a linear relationship between some set of inputs and the output value you are trying to predict. As part of our continuing ML 101 series, we’ll review the basic steps of Linear Regression, and show how you can use such an approach to predict any value in your own dataset.

The Model: Linear Regression


The fundamentals of Linear Regression are actually pretty straightforward.

The goal is to find a function that draws a linear relationship between a set of input features and the value we’d like to predict. This function can also be called a model.

For Linear Regression, we represent such a model with a function such as:


For simplicity throughout this post, we’ll focus on datasets with just one input feature. So in the function represented above, x is an input feature, ŷ is the predicted value for y, and θ0 and θ1 are parameters that we use to define the relationship between x and ŷ.

To further clarify, let’s look at an example of housing prices. Say we want to predict the price of a house, based on its size (in square feet). If we wanted to use a Linear Regression model to represent this relationship, we would denote the predicted house price as ŷ, and the house size as x, such that Price (predicted) = θ0 + θ1 * Size.

If we graph this out, the model will take the form of a line as noted in the figure above (and hence why this is called “Linear” Regression).

But how do we choose the parameters for θ0 and θ1 for our model, and fit an actual line to find the relationship between between x and ŷ?

Cost Functions: How to Evaluate the Accuracy of a Model


With Linear Regression, there are multiple models you can generate for the same set of data. You can designate many different values for θ0 and θ1 (as noted in the diagrams above), and generate different lines to model the relationship between the input x and prediction ŷ.

To choose the best model, you want to choose the parameters for θ0 and θ1 that can fit a line most closely following the relationship between actual examples of x and ŷ. But how do you measure the accuracy of such a line?

To measure the accuracy of a model, we use a Cost Function. A Cost Function for Linear Regression measures the error of your model. The error of the model is measured by taking the difference between your outputted prediction ŷ and comparing it to the actual value of y for each of your data examples.


Going back to our housing price example, say we have a set of houses with [size, price] such as [1000, $200,000], [3000, $700,000], etc. and a model where we predict price = 1000 + 200 * (size). In this case, for the first house we can predict a price = 1000 + 200 * 1000 = 201,000 → meaning our prediction is off by a value of 1000 (the actual recorded price for a house of size 1000 was $200,000).

If you sum the errors of all your pricing examples, you’ll get the total Cost of your model for housing prices.

Gradient Descent: How to Algorithmically Reduce Your Cost Function

Now that we know that you can use a Cost Function to evaluate the effectiveness of a model, we still need to find the model with the lowest Cost or Error Rate.

One way we could do this is just generate many different models, or different permutations of the parameters θ0 and θ1, and see which has the smallest error in the Cost Function. But done manually, this could take a really long time.

There is a quicker way to do this, which is to use an algorithm called Gradient Descent.

Gradient Descent works by essentially working through increments of different values of θ0 and θ1 by a factor of the Cost Function, until you reach convergence and any incremental change in θ0 and θ1 will no longer reduce the error of the model. The values for θ0 and θ1 at convergence will be your most optimal values for the Linear Regression Model.

The algorithm you’d use to find these step-wise increments involves a lot of calculus and partial differentials, but summarizing the end output, you can use the following formulas:


More simply, what this algorithm is saying is that if you start with some random values for the parameters θ0 and θ1, you can pick a new set of values for θ0 and θ1 based on the relative error or Cost produced by that version of the model. The algorithm will run through multiple iterations of these parameters in increments of the respective error rate, until the relative error rate is closest to 0.


Hopefully this helps better guide how you can use Linear Regression to predict a value.

Starting with an input variable x and respective output y, you can use a learning algorithm like Gradient Descent to find the parameters θ0 and θ1 that present the lowest error to modeling a linear relationship between the prediction ŷ = θ0 + θ1 * x.

It should be noted in this post we focused on Univariate Linear Regression, wherein there was only one input feature x1 into the model. More often, you will have many input features, and will need what is referred to as Multivariate Linear Regression. Other times the input features won’t exactly follow a linear relationship, and you’ll need to use polynomial adaptations to fit the Linear Regression appropriately.

These additional cases require a little more calculus and linear algebra to solve. But the concepts underlying their solutions are more or less the same as the basic Linear Regression models covered above – fitting a predictive model to explain the generally linear relationship between a set of input features and output value.

This blog post is based on concepts taught in Stanford’s Machine Learning course notes by Andrew Ng on Coursera.