Machine Learning

Beginner

Simple Linear Regression

Regression is known as one of the most fundamental concepts of machine learning and is one of simplest prediction algorithms. It involves investigating the relationship between different quantitative variables and predicting their values.

In simple linear regression there are two variables of interest: the independent variable and the dependent variable. We typically use a dataset of these two variables to outline the relationship between them, and eventually use values of the independent variable to predict values of the dependent variable.

For example, we may be interested in examining the relationship between humidity and the amount of rainfall. We may also be interested in predicting the amount of rainfall given the value of humidity in an area. In this example, the humidity level is the independent variable, while the amount of rainfall is the dependent variable.

Definition

Suppose we are given a dataset of pairs of values. Each pair includes the value of our independent variable and of our dependent variable :

We can hypothesize the following relationship between and :

where:

is the intercept, or the bias
is the regression coefficient, or the weight of with respect to
is the error in our prediction

The goal of regression is finding the correct values for and .

Loss Function

But what values are correct? In fact, a better question to ask would be, what values are incorrect? We need a way of evaluating how good/bad any set of values for and are. To this end, it is necessary to define a loss function (also known as cost function).

So what should our loss function be? There is no right answer to this question and our choice can (and should) depend on the context of our application. One of the most commonly used functions is the squared-loss function and that's the one we'll use in this exercise:

Our objective thus becomes to choose and values so that we can minimize the value of our loss function .

Minimizing Loss Function

So we know have our loss function, but how do we minimize it with respect to and ? We can take the partial derivative of course!

Setting the partial derivative with respect to to , we get:

where:

is the mean of
is the mean of

Similarly, we set the partial derivative with respect to to :

Substituting :

Thus, our optimal values of and are given by:

Performance Metric

Once we have a regression model, how do we evaluate its performance? One way to do it might be to use the loss function, but because the squared-loss isn't bounded, it is difficult to make intuitive sense of it.

One of the most common goodness-of-fit measures is the value, also known as the coefficient of determination:

where,

is the residual sum of squares, or the amount of unexplained variation
is the total sum of squares, or the total amount of variation

Exercise

Write a program that performs simple linear regression. Complete the fit(), predict(), and score() methods:

fit() takes in arrays x and y, and computes and stores the regression coefficient and intercept
predict() takes in array x and returns a predicted array y_pred using the stored coefficient and intercept
score() takes in arrays x and y, predicts the y_pred values using x, and then uses y_pred and y arrays to return the performance metric

Sample Test Cases

Test Case 1

Input:

[
  [ 1,-2],
  [ 5,-1],
  [ 1,-2],
  [ 5,-1]
]

Output:

w0 = 3
w1 = 2
y_pred = [-2, -1]
r_squared = 1.0

Test Case 2

Input:

[
  [1, 4, 3],
  [2, 9, 6],
  [2, 6,-2],
  [4,16,-5]
]

Output:

w0 = -0.429
w1 = 2.286
y_pred = [4.143, 13.286, -5]
r_squared = 0.9667

Supporting contributors: hemits

import numpy as np

from numpy.typing import NDArray

class SimpleLinearRegression:

def fit(self, x: NDArray[np.float64], y: NDArray[np.float64]):

# rewrite this function

# dummy code to get you started

self.w1 = np.array([])

self.w0 = np.array([])

def predict(self, x: NDArray[np.float64]) -> NDArray[np.float64]:

# rewrite this function

# dummy code to get you started

y_pred = np.array([])

return y_pred

def score(self, x: NDArray[np.float64], y: NDArray[np.float64]) -> float:

# rewrite this function

# dummy code to get you started

y_pred = self.predict(x)

r_squared = 0.0

return r_squared

def main(

self,

x_train: NDArray[np.float64],

y_train: NDArray[np.float64],

x_test: NDArray[np.float64],

y_test: NDArray[np.float64],

) -> tuple[NDArray[np.float64], float]:

# this is how your implementation will be tested

# do not edit this function

# fit training data

self.fit(x_train, y_train)

# predict using test data

y_pred = self.predict(x_test)

# compute performance

r_squared = self.score(x_test, y_test)

return y_pred, r_squared