Machine LearningRidge Regression
Now that we've mastered multiple linear regression, we can move on to what is known as ridge regression.
Definition
The definition of the variables in ridge regression is the same as what we say in multiple linear regression.
Suppose we are given a matrix that holds values of independent variables.
is of size
n x d, where is the number of data points we have, and
is the number of independent variables we have:
We are also given a vector of targets
:
Each row in corresponds to a single element in
. Thus, we can hypothesize the following relationship between a single row in
and a single element in
:
where:
is the intercept, or the bias
is the set of
regression coefficients, or the weights of
with respect to
is the error in our prediction
For convenience, we can define a matrix that is constructed by padding a column of ones to the left-hand side of
:
Note that is of size
n x (d + 1).
We also define vector as the series of weights:
Note that is a vector of length
.
Now, we can simplify our hypothesis in matrix form:
where is a vector of
of length
:
The goal of regression is finding the correct values for .
Regularization
Before we define our loss function, let's explore a core concept in statistics, known as regularization. Regularization is the general set of measures for reducing overfitting in prediction models. Often, this is achieved by trading accuracy for an improvement in generalizability.
In the case of linear regression, overfitting can occur when our independent variables, , correlate with each other (a condition also known as multicollinearity), resulting in our weights,
, being too high. This can lead the estimates of regression to be highly sensitive to small changes in the independent variables.
So how do we deal with this? Let's take a look at how we can modify the loss function to penalize large weights.
Loss Function
In the case of multiple linear regression, our loss function was:
where is the L2-norm:
For ridge regression, we include an additional term that imposes a penalty when weights are too high:
where is known as the regularization constant.
Note: The regularization constant is what's known as a hyperparameter. The optimal value for the regularization constant would depend on the dataset we are working with. We often use validation methods, such as K-fold cross-validation, to determine optimal values of hyperparameters.
Minimizing Loss Function
Now that we have our loss function, we can work on optimizing to minimize the loss:
Note that both and
are scalar and equal. Thus we can combine them:
Now we can take the partial derivative of the loss function with respect to :
Setting the partial derivative to zero, we get:
where is the
(d + 1) x (d + 1) identity matrix.
And that's it! The optimal value for is the one that satisfies this equation.
Performance Metric
The performance metric for ridge regression remains the same as that of multiple linear regression. Once again, we can compute the value, also known as the coefficient of determination:
where:
is the mean of
is the residual sum of squares, or the amount of unexplained variation
is the total sum of squares, or the total amount of variation
Exercise
Write a program that performs ridge regression. Complete the fit(), predict(), and score() methods:
fit()takes in arraysXandy, and regularization constantLambda, and computes and stores the weightswpredict()takes in arrayXand returns a predicted arrayy_predusing the stored weightswscore()takes in arraysXandy, predicts they_predvalues usingXand the stored weightsw, and then usesy_predandyarrays to compute and return theperformance metric
Sample Test Cases
Test Case 1
Input:
[
[[ 1],
[ 4],
[ 3]],
[ 2, 9, 6],
[[ 2],
[ 6],
[-2]],
[ 4,16,-5],
0
]
Output:
w = [-0.429, 2.286]
y_pred = [4.143, 13.286, -5]
r_squared = 0.9667
Test Case 2
Input:
[
[[2,4],
[3,7],
[5,2]],
[1,5,3],
[[ 0,-1],
[ 4, 3]],
[-4, 2],
0.3
]
Output:
w = [-1.277, 0.618, 0.541]
y_pred = [-1.818, 2.818]
r_squared = 0.698
