Machine Learning

Advanced

Ridge Regression

Now that we've mastered multiple linear regression, we can move on to what is known as ridge regression.

Definition

The definition of the variables in ridge regression is the same as what we say in multiple linear regression.

Suppose we are given a matrix that holds values of independent variables. is of size n x d, where is the number of data points we have, and is the number of independent variables we have:

We are also given a vector of targets :

Each row in corresponds to a single element in . Thus, we can hypothesize the following relationship between a single row in and a single element in :

where:

is the intercept, or the bias
is the set of regression coefficients, or the weights of with respect to
is the error in our prediction

For convenience, we can define a matrix that is constructed by padding a column of ones to the left-hand side of :

Note that is of size n x (d + 1).

We also define vector as the series of weights:

Note that is a vector of length .

Now, we can simplify our hypothesis in matrix form:

where is a vector of of length :

The goal of regression is finding the correct values for .

Regularization

Before we define our loss function, let's explore a core concept in statistics, known as regularization. Regularization is the general set of measures for reducing overfitting in prediction models. Often, this is achieved by trading accuracy for an improvement in generalizability.

In the case of linear regression, overfitting can occur when our independent variables, , correlate with each other (a condition also known as multicollinearity), resulting in our weights, , being too high. This can lead the estimates of regression to be highly sensitive to small changes in the independent variables.

So how do we deal with this? Let's take a look at how we can modify the loss function to penalize large weights.

Loss Function

In the case of multiple linear regression, our loss function was:

where is the L2-norm:

For ridge regression, we include an additional term that imposes a penalty when weights are too high:

where is known as the regularization constant.

Note: The regularization constant is what's known as a hyperparameter. The optimal value for the regularization constant would depend on the dataset we are working with. We often use validation methods, such as K-fold cross-validation, to determine optimal values of hyperparameters.

Minimizing Loss Function

Now that we have our loss function, we can work on optimizing to minimize the loss:

Note that both and are scalar and equal. Thus we can combine them:

Now we can take the partial derivative of the loss function with respect to :

Setting the partial derivative to zero, we get:

where is the (d + 1) x (d + 1) identity matrix.

And that's it! The optimal value for is the one that satisfies this equation.

Performance Metric

The performance metric for ridge regression remains the same as that of multiple linear regression. Once again, we can compute the value, also known as the coefficient of determination:

where:

is the mean of
is the residual sum of squares, or the amount of unexplained variation
is the total sum of squares, or the total amount of variation

Exercise

Write a program that performs ridge regression. Complete the fit(), predict(), and score() methods:

fit() takes in arrays X and y, and regularization constant Lambda, and computes and stores the weights w
predict() takes in array X and returns a predicted array y_pred using the stored weights w
score() takes in arrays X and y, predicts the y_pred values using X and the stored weights w, and then uses y_pred and y arrays to compute and return the performance metric

Sample Test Cases

Test Case 1

Input:

[
  [[ 1],
   [ 4],
   [ 3]],
   [ 2, 9, 6],
  [[ 2],
   [ 6],
   [-2]],
   [ 4,16,-5],
  0
]

Output:

w = [-0.429, 2.286]
y_pred = [4.143, 13.286, -5]
r_squared = 0.9667

Test Case 2

Input:

[
  [[2,4],
   [3,7],
   [5,2]],
  [1,5,3],
  [[ 0,-1],
   [ 4, 3]],
  [-4, 2],
  0.3
]

Output:

w = [-1.277, 0.618, 0.541]
y_pred = [-1.818, 2.818]
r_squared = 0.698

import numpy as np

from numpy.typing import NDArray

class RidgeRegression:

def fit(

self,

X: NDArray[np.float64],

y: NDArray[np.float64],

Lambda: float,

# rewrite this function

# dummy code to get you started

self.w = np.array([])

def predict(self, X: NDArray[np.float64]) -> NDArray[np.float64]:

# rewrite this function

# dummy code to get you started

y_pred = np.array([])

return y_pred

def score(

self,

X: NDArray[np.float64],

y: NDArray[np.float64],

) -> float:

# rewrite this function

# dummy code to get you started

y_pred = self.predict(X)

r_squared = 0.0

return r_squared

def main(

self,

X_train: NDArray[np.float64],

y_train: NDArray[np.float64],

X_test: NDArray[np.float64],

y_test: NDArray[np.float64],

Lambda: float,

) -> tuple[NDArray[np.float64], float]:

# this is how your implementation will be tested

# do not edit this function

# fit training data

self.fit(X_train, y_train, Lambda)

# predict using test data

y_pred = self.predict(X_test)

# compute performance

r_squared = self.score(X_test, y_test)

return y_pred, r_squared