
Simple Linear Regression
Regression is known as one of the most fundamental concepts of machine learning and is one of simplest prediction algorithms. It involves investigating the relationship between different quantitative variables and predicting their values.
In simple linear regression there are two variables of interest: the independent variable and the dependent variable. We typically use a dataset of these two variables to outline the relationship between them, and eventually use values of the independent variable to predict values of the dependent variable.
For example, we may be interested in examining the relationship between humidity and the amount of rainfall. We may also be interested in predicting the amount of rainfall given the value of humidity in an area. In this example, the humidity level is the independent variable, while the amount of rainfall is the dependent variable.
Definition
Suppose we are given a dataset of pairs of values. Each pair includes the value of our independent variable
and of our dependent variable
:
We can hypothesize the following relationship between and
:
where:
is the intercept, or the bias
is the regression coefficient, or the weight of
with respect to
is the error in our prediction
The goal of regression is finding the correct values for and
.
Loss Function
But what values are correct? In fact, a better question to ask would be, what values are incorrect? We need a way of evaluating how good/bad any set of values for and
are. To this end, it is necessary to define a loss function
(also known as cost function).
So what should our loss function be? There is no right answer to this question and our choice can (and should) depend on the context of our application. One of the most commonly used functions is the squared-loss function and that's the one we'll use in this exercise:
Our objective thus becomes to choose and
values so that we can minimize the value of our loss function
.
Minimizing Loss Function
So we know have our loss function, but how do we minimize it with respect to and
? We can take the partial derivative of course!
Setting the partial derivative with respect to to
, we get:
where:
is the mean of
is the mean of
Similarly, we set the partial derivative with respect to to
:
Substituting :
Thus, our optimal values of and
are given by:
Performance Metric
Once we have a regression model, how do we evaluate its performance? One way to do it might be to use the loss function, but because the squared-loss isn't bounded, it is difficult to make intuitive sense of it.
One of the most common goodness-of-fit measures is the value, also known as the coefficient of determination:
where,
is the residual sum of squares, or the amount of unexplained variation
is the total sum of squares, or the total amount of variation
Exercise
Write a program that performs simple linear regression. Complete the fit()
, predict()
, and score()
methods:
fit()
takes in arraysx
andy
, and computes and stores the regression coefficient and interceptpredict()
takes in arrayx
and returns a predicted arrayy_pred
using the stored coefficient and interceptscore()
takes in arraysx
andy
, predicts they_pred
values usingx
, and then usesy_pred
andy
arrays to return theperformance metric
Sample Test Cases
Test Case 1
Input:
[
[ 1,-2],
[ 5,-1],
[ 1,-2],
[ 5,-1]
]
Output:
w0 = 3
w1 = 2
y_pred = [-2, -1]
r_squared = 1.0
Test Case 2
Input:
[
[1, 4, 3],
[2, 9, 6],
[2, 6,-2],
[4,16,-5]
]
Output:
w0 = -0.429
w1 = 2.286
y_pred = [4.143, 13.286, -5]
r_squared = 0.9667