Machine Learning

Intermediate

Multiple Linear Regression

Now that we've completed simple linear regression, we can move on to multiple linear regression, or in other words, linear regression with multiple independent variables.

Definition

Suppose we are given a matrix that holds values of independent variables. is of size n x d, where is the number of data points we have, and is the number of independent variables we have:

We are also given a vector of targets :

Each row in corresponds to a single element in . Thus, we can hypothesize the following relationship between a single row in and a single element in :

where:

is the intercept, or the bias
is the set of regression coefficients, or the weights of with respect to
is the error in our prediction

For convenience, we can define a matrix that is constructed by padding a column of ones to the left-hand side of :

Note that is of size n x (d + 1).

We also define vector as the series of weights:

Note that is a vector of length .

Now, we can simplify our hypothesis in matrix form:

where is a vector of of length :

The goal of regression is finding the correct values for .

Loss Function

Similar to the case of simple linear regression, we need a loss function to evaluate how good/bad our solution for is. And once again, we will use the squared-loss:

where is the L2-norm:

Minimizing Loss Function

Now that we have our loss function, we can work on optimizing to minimize the loss:

Note that both and are scalar and equal. Thus we can combine them:

Now we can take the partial derivative of the loss function with respect to :

Setting the partial derivative to zero, we get:

And that's it! The optimal value for is the one that satisfies this equation.

Performance Metric

Now that we have a closed-form solution for linear regression, how do we evaluate it? Once again, similar to what we did for simple linear regression, we can compute the value, also known as the coefficient of determination:

where:

is the mean of
is the residual sum of squares, or the amount of unexplained variation
is the total sum of squares, or the total amount of variation

Exercise

Write a program that performs multiple linear regression. Complete the fit(), predict(), and score() methods:

fit() takes in arrays X and y, and computes and stores the weights w
predict() takes in array X and returns a predicted array y_pred using the stored weights w
score() takes in arrays X and y, predicts the y_pred values using X and the stored weights w, and then uses y_pred and y arrays to compute and return the performance metric

Sample Test Cases

Test Case 1

Input:

[
  [[ 1],
   [ 4],
   [ 3]],
   [ 2, 9, 6],
  [[ 2],
   [ 6],
   [-2]],
   [ 4,16,-5]
]

Output:

w = [-0.429, 2.286]
y_pred = [4.143, 13.286, -5]
r_squared = 0.9667

Test Case 2

Input:

[
  [[ 2, 4],
   [ 3, 7],
   [ 5, 2]],
   [ 1, 5, 3],
  [[ 0,-1],
   [ 4, 3]],
   [-4, 2]
]

Output:

w = [-5.182, 1.273, 0.909]
y_pred = [-6.091  2.636]
r_squared = 0.735

Supporting contributors: QuantScientist2023

import numpy as np

from numpy.typing import NDArray

class MultipleLinearRegression:

def fit(self, X: NDArray[np.float64], y: NDArray[np.float64]):

# rewrite this function

# dummy code to get you started

self.w = np.array([])

def predict(self, X: NDArray[np.float64]) -> NDArray[np.float64]:

# rewrite this function

# dummy code to get you started

y_pred = np.array([])

return y_pred

def score(

self,

X: NDArray[np.float64],

y: NDArray[np.float64],

) -> float:

# rewrite this function

# dummy code to get you started

y_pred = self.predict(X)

r_squared = 0.0

return r_squared

def main(

self,

X_train: NDArray[np.float64],

y_train: NDArray[np.float64],

X_test: NDArray[np.float64],

y_test: NDArray[np.float64],

) -> tuple[NDArray[np.float64], float]:

# this is how your implementation will be tested

# do not edit this function

# fit training data

self.fit(X_train, y_train)

# predict using test data

y_pred = self.predict(X_test)

# compute performance

r_squared = self.score(X_test, y_test)

return y_pred, r_squared