
Normalization
Normalization, in the context of data preprocessing, is the name given to a set of techniques employed to transform numerical data from multiple sources into a common scale of distribution. Normalization is one of the most common data preprocessing steps and can often be the difference between a poorly trained model and a high-accuracy model.
Why normalize?
Before we get into normalization, it is worthwhile to question why we need it in the first place. Consider these two data points from the Student Performance Data Set from Kaggle:
Student Age | Weekly Hours of Study | Final Grade | |
---|---|---|---|
Student 1 | 15 | 3 | 14 |
Student 2 | 18 | 1 | 0 |
From the data points we can see that Student 1 is 3 years younger than Student 2 and studies for 2 hours longer than Student 2. As a result, Student 1 scored a 14 on their finals, while Student 2 scored a 0. Given this information, if we are asked to determine which factor is a better predictor of a student's final grade, what would we conclude? Does the age cause a higher grade, or the number of hours of study?
Let's take the naive approach: Student 2 is 3 whole years older than Student 1, but has only studied 2 less hours than Student 1. Since 3 is greater than 2, we could conclude that the age of the student is what better determines their academic success, and not the amount of time spent studying. The lower the age, the better the student would do.
Intuitively, we know this is not likely the truth. While there is a difference in age between the students, this difference is not as significant as the difference in time spent studying. But it's important to remember that a computer completely lacks this bit of context! To a computer these are merely numbers, and 3 is simply greater than 2. So how do we teach a computer to do better?
Let us apply min-max scaling normalization (we will explore what this is shortly) on the age and study time columns with respect to the entire dataset:
Student Age (Normalized) | Weekly Hours of Study (Normalized) | Final Grade | |
---|---|---|---|
Student 1 | 0 | 0.667 | 14 |
Student 2 | 0.429 | 0 | 0 |
Now the table shows us that the normalized value of age is 0.429 lower for Student 1 than that of Student 2, and that the normalized value of weekly hours of study is 0.667 higher for Student 1 than that of Student 2. This should that, given the context of this dataset, the leap from 1 hours of study to 3 hours study is a lot more significant that the drop from the age of 18 to the age of 15.
Min-Max Scaling
In min-max normalization, we compute the ratio of the difference between our value and the minimum value, and the range of values:
Z-Score Normalization
In z-score normalization, also known as standardization, we compute the ratio of the difference between our value and the mean value, and the standard deviation value:
where is the mean of
and
is the standard deviation of
, given by:
Note: The formula for standard deviation given above is the sample standard deviation, as opposed to the population standard deviation. The correct equation to use depends on whether we are dealing with population or sample data. In most cases, however, we are dealing with the latter. For the purpose of this exercise, use the sample standard deviation formula, i.e. the one formulated above.
Exercise
Write a program using Pandas that uses both approaches to normalize every column in the given dataframe and adds a new column with the _norm
suffix that contains the normalized data. For example, if the dataframe has a column named age
, create a normalized column called age_norm
that holds the normalized age. Complete the min_max_scale()
, and z_score_normalize()
methods:
min_max_scale()
takes in an input dataframedf
and returns a new dataframe where the columns have been normalized using min-max scalingz_score_normalize()
takes in an input dataframedf
and returns a new dataframe where the columns have been normalized using z-score normalization
These methods must return new dataframes, do not modify the input dataframe df
.
Sample Test Cases
Test Case 1
Input:
[
{
"age": [
15,
16,
18
],
"weekly_study_time": [
3,
3,
1
]
}
]
Output:
min_max_scaled_df =
age weekly_study_time age_norm weekly_study_time_norm
15 3 0.000000 1.0
16 3 0.333333 1.0
18 1 1.000000 0.0
z_score_normalized_df =
age weekly_study_time age_norm weekly_study_time_norm
15 3 -0.872872 0.577350
16 3 -0.218218 0.577350
18 1 1.091089 -1.154701