Normalization

Normalization, in the context of data preprocessing, is the name given to a set of techniques employed to transform numerical data from multiple sources into a common scale of distribution. Normalization is one of the most common data preprocessing steps and can often be the difference between a poorly trained model and a high-accuracy model.

Why normalize?

Before we get into normalization, it is worthwhile to question why we need it in the first place. Consider these two data points from the Student Performance Data Set from Kaggle:

	Student Age	Weekly Hours of Study	Final Grade
Student 1	15	3	14
Student 2	18	1	0

From the data points we can see that Student 1 is 3 years younger than Student 2 and studies for 2 hours longer than Student 2. As a result, Student 1 scored a 14 on their finals, while Student 2 scored a 0. Given this information, if we are asked to determine which factor is a better predictor of a student's final grade, what would we conclude? Does the age cause a higher grade, or the number of hours of study?

Let's take the naive approach: Student 2 is 3 whole years older than Student 1, but has only studied 2 less hours than Student 1. Since 3 is greater than 2, we could conclude that the age of the student is what better determines their academic success, and not the amount of time spent studying. The lower the age, the better the student would do.

Intuitively, we know this is not likely the truth. While there is a difference in age between the students, this difference is not as significant as the difference in time spent studying. But it's important to remember that a computer completely lacks this bit of context! To a computer these are merely numbers, and 3 is simply greater than 2. So how do we teach a computer to do better?

Let us apply min-max scaling normalization (we will explore what this is shortly) on the age and study time columns with respect to the entire dataset:

	Student Age (Normalized)	Weekly Hours of Study (Normalized)	Final Grade
Student 1	0	0.667	14
Student 2	0.429	0	0

Now the table shows us that the normalized value of age is 0.429 lower for Student 1 than that of Student 2, and that the normalized value of weekly hours of study is 0.667 higher for Student 1 than that of Student 2. This should that, given the context of this dataset, the leap from 1 hours of study to 3 hours study is a lot more significant that the drop from the age of 18 to the age of 15.

Min-Max Scaling

In min-max normalization, we compute the ratio of the difference between our value and the minimum value, and the range of values:

Z-Score Normalization

In z-score normalization, also known as standardization, we compute the ratio of the difference between our value and the mean value, and the standard deviation value:

where is the mean of and is the standard deviation of , given by:

Note: The formula for standard deviation given above is the sample standard deviation, as opposed to the population standard deviation. The correct equation to use depends on whether we are dealing with population or sample data. In most cases, however, we are dealing with the latter. For the purpose of this exercise, use the sample standard deviation formula, i.e. the one formulated above.

Exercise

Write a program using Pandas that uses both approaches to normalize every column in the given dataframe and adds a new column with the _norm suffix that contains the normalized data. For example, if the dataframe has a column named age, create a normalized column called age_norm that holds the normalized age. Complete the min_max_scale(), and z_score_normalize() methods:

min_max_scale() takes in an input dataframe df and returns a new dataframe where the columns have been normalized using min-max scaling
z_score_normalize() takes in an input dataframe df and returns a new dataframe where the columns have been normalized using z-score normalization

These methods must return new dataframes, do not modify the input dataframe df.

Sample Test Cases

Test Case 1

Input:

[
  {
      "age": [
          15,
          16,
          18
      ],
      "weekly_study_time": [
          3,
          3,
          1
      ]
  }
]

Output:

min_max_scaled_df =

 age  weekly_study_time  age_norm  weekly_study_time_norm
  15                  3  0.000000                     1.0
  16                  3  0.333333                     1.0
  18                  1  1.000000                     0.0

z_score_normalized_df =

 age  weekly_study_time  age_norm  weekly_study_time_norm
  15                  3 -0.872872                0.577350
  16                  3 -0.218218                0.577350
  18                  1  1.091089               -1.154701

import pandas as pd

class Normalization:

def min_max_scale(self, df: pd.DataFrame) -> pd.DataFrame:

# rewrite this function

# dummy code to get you started

normalized_df = pd.DataFrame()

return normalized_df

def z_score_normalize(self, df: pd.DataFrame) -> pd.DataFrame:

# rewrite this function

# dummy code to get you started

normalized_df = pd.DataFrame()

return normalized_df

def main(

self,

df: pd.DataFrame,

) -> tuple[

pd.DataFrame,

# this is how your implementation will be tested

# do not edit this function

# apply min-max scaling to dataframe

min_max_scaled_df = self.min_max_scale(df)

# apply z-score normalization to dataframe

z_score_normalized_df = self.z_score_normalize(df)

return min_max_scaled_df, z_score_normalized_df