Since October 7, 2023, Israel has murdered 57,200 Palestinians, out of whom, 18,188 were children. Please consider donating to Palestine Children's Relief Fund.
Machine Learning IconMachine Learning
0
0
Intermediate

K-Means Clustering

In this exercise we start diving into one of the most commonly used unsupervised learning algorithms known as K-means clustering.

Supervised & Unsupervised Learning

So what is unsupervised learning? And what is supervised learning?

Supervised learning is the set of algorithms and learning mechanisms that deal with labelled datasets. In other words, when the desired output or target of the data points is provided, we generally use supervised learning. Popular supervised learning algorithms include:

  • Linear & Logistic Regression
  • Decision Tree
  • Random Forest
  • Support Vector Machines
  • Naive Bayes
  • K-Nearest Neighbours

Unsupervised learning is the set of algorithms and learning mechanisms that deal with unlabelled datasets. In other words, when the desired output or target of the data points is not provided, we generally use unsupervised learning. Popular unsupervised learning algorithms included:

  • K-Means Clustering
  • Hierarchical Clustering
  • Principal Component Analysis
  • Self-Organizing Maps

Definition

Suppose we are given a matrix that holds values of independent variables. is of size n x d, where is the number of data points we have, and is the number of independent variables we have:

Additionally, suppose we are given a value that represents the number of clusters.

Now consider a matrix that holds cluster centroids:

where is the coordinate of the centroid in the dimension. The centroids represent the coordinates of the centers of the clusters.

Finally we can define the output labels , where is an array of integers that can contain numbers from . represents the index of the centroid that is closest (using Euclidean distance) to the data point.

For example, suppose we have and:

This means we have data points and centroids .

For point :

  • distance to is
  • distance to is

Thus, since is closer to than is. Similarly we find and .

Then, given the matrix , K-means clustering involves computing and such that contains the indices of the centroids that are closest to each data point in , and each centroid in represents the coordinates of the center of the data points that are labelled :

where is the number of times label appears in and is the set of indices where label appears in .

In the previous example, centroid is the closest centroid to points and . The center of these points is , which is consistent with the location of centroid . Similarly, centroid is the closest centroid to point . The center of this point is , which is, once again, consistent with the location of centroid . Because both centroids are consistent, this is a valid solution to K-means clustering.

The algorithm of computing K-means clusters is as follows:

  1. Randomly select centroids
  2. Compute by computing the closest centroid to each data point in
  3. Given , compute new centroids
  4. If the centroids change, go back to step 2, otherwise terminate the process

Intuition

K-means clustering allows us to separate given data points into groups, in hopes that data points in the same groups will share characteristics. Consider, for example, a subset of the Iris dataset from the UCI Machine Learning Repository that includes data on different species of the Iris plant. We can visualize the distribution of sepal lengths and widths of the flowers using a scatter plot:

Iris Raw Data

The flowers can be classified into 3 classes:

  • Iris-setosa
  • Iris-versicolor
  • Iris-virginica

Without using the class labels of the dataset, we can apply K-means clustering to get an idea of how these classes are likely distributed. After performing K-means clustering, we get the following clusters and centroids:

Iris KMeans Predicted Labels

Note that labels 0, 1, and 2 are completely arbitrary.

Now let's take a look at the actual labels to see how much K-means clustering has been able to capture:

Iris Actual Labels

As we can see, K-means clustering is able to almost completely isolate instances of the Iris-setosa class, and can somewhat differentiate between the Iris-versicolor and Iris-virginica classes! All this can be obtained without looking at the actual class labels in our dataset! This is what makes clustering algorithms so powerful.

Performance Metrics

When it comes to K-means clustering, there are typically two characteristics of the algorithm we are interested in:

  • how close each point is to its respective centroid
  • how far apart the centroids are

To obtain a well-performing clustering solution, we want to minimize the distance between each point and its respective centroid and maximize the distance between the centroids. Thus, we consider two metrics for K-means clustering: within-cluster sum of squares and between-clusters sum of squares.

Within-Cluster Sum of Squares

This is computed by summing up the squared distance between each point and its closest centroid:

The lower this is, the better performing our clustering algorithm is.

Between-Cluster Sum of Squares

This is computed by summing up the squared distance between each centroid and the center of all centroids :

is computed by taking the mean of all centroids:

The lower this is, the better performing our clustering algorithm is.

Exercise

Write a program that performs K-means clustering. Complete the fit(), predict(), and score() methods:

  • fit() takes in array X and number of clusters k, and returns the labels array and the centroids matrix
  • predict() takes in array X and returns the predicted labels array based on the centroids computed in the fit() method
  • score() takes in array X, computes the predicted labels_pred array based on the centroids computed in the fit() method, and computes and returns the wcss_score and bcss_score values

Sample Test Cases

Test Case 1

Input:

[
  [[ 7.3,  8.1]],
  [[-5.4, -4.3]],
  1
]

Output:

labels = [0]
centroids = [[7.3, 8.1]]
labels_pred = [0]
wcss_score = 315.05
bcss_score = 0.0

Test Case 2

Input:

[
  [[-8.1,  5.5],
   [-9.8, -0.1],
   [-7.1, -0.9],
   [-0.3,  2.3]],
  [[-4.3,  5.6]],
  2
]

Output:

labels = [1, 1, 1, 0]
labels_pred = [0]
centroids = [[-0.3, 2.3], [-8.333, 1.5]]
wcss_score = 26.89
bcss_score = 32.587

Note: The values listed for centroids represent one of many possible solutions.

Login to Start Coding
import numpy as np
from numpy.typing import NDArray


class KMeansClustering:
def fit(
self,
X: NDArray[np.float64],
k: int,
max_iters: int = 1000,
) -> tuple[NDArray[np.int64], NDArray[np.float64]]:
# rewrite this function
# dummy code to get you started
self.centroids = np.array([])
labels = np.array([])

return labels, self.centroids

def predict(self, X: NDArray[np.float64]) -> NDArray[np.int64]:
# rewrite this function
# dummy code to get you started
labels_pred = np.array([])
return labels_pred

def score(self, X: NDArray[np.float64]) -> tuple[float, float]:
# rewrite this function
# dummy code to get you started
wcss = 0.0
bcss = 0.0

return wcss, bcss

def main(
self,
X_train: NDArray[np.float64],
X_test: NDArray[np.float64],
k: int,
) -> tuple[
NDArray[np.int64],
NDArray[np.float64],
NDArray[np.int64],
float,
float,
]:
# this is how your implementation will be tested
# do not edit this function

# fit training data
labels, centroids = self.fit(X_train, k)
# predict using test data
labels_pred = self.predict(X_test)
# compute performance
wcss_score, bcss_score = self.score(X_test)

return labels, centroids, labels_pred, wcss_score, bcss_score