
Cleaning Missing Data
Most datasets encountered in data science are not fully clean and can contain a lot of missing values due to a variety of reasons, including human error during data entry, technical issues with data collection, or the data not being applicable in certain cases. This makes data cleaning an essential step in the data analysis process. In this exercise, we will explore the approaches in cleaning missing data values.
Dropping Rows with Missing Values
Often it is convenient to simply drop rows that contain missing values. This is more appropriate when the dataset is large and the proportion of rows with missing values is low.
Consider the following table of numbers and their corresponding square roots:
number | square_root |
---|---|
2 | 1.414 |
3 | null |
5 | null |
null | 3.0 |
12 | 3.464 |
If we were to drop row with missing values, we would drop the third and fourth rows:
number | square_root |
---|---|
2 | 1.414 |
12 | 3.464 |
Note that while this is straightforward, it can sometimes lead to the loss of valuable data.
Filling with Mean
Instead of dropping the row entirely, it is sometimes useful to fill the missing values with the mean value of the column.
Consider again the following table of numbers and their corresponding square roots:
number | square_root |
---|---|
2 | 1.414 |
3 | null |
5 | null |
null | 3.0 |
12 | 3.464 |
Filling in rows with mean values, we would have:
number | square_root |
---|---|
2 | 1.414 |
3 | 2.626 |
5 | 2.626 |
5.5 | 3.0 |
12 | 3.464 |
Forward-Filling
Sometimes it is also effective to fill the missing values using the last valid value.
Consider again the following table of numbers and their corresponding square roots:
number | square_root |
---|---|
2 | 1.414 |
3 | null |
5 | null |
null | 3.0 |
12 | 3.464 |
Forward-filling rows, we would have:
number | square_root |
---|---|
2 | 1.414 |
3 | 1.414 |
5 | 1.414 |
5 | 3.0 |
12 | 3.464 |
Backward-Filling
Similarly, we can also fill the missing values using the next valid value.
Consider again the following table of numbers and their corresponding square roots:
number | square_root |
---|---|
2 | 1.414 |
3 | null |
5 | null |
null | 3.0 |
12 | 3.464 |
Backward-filling rows, we would have:
number | square_root |
---|---|
2 | 1.414 |
3 | 3.0 |
5 | 3.0 |
12 | 3.0 |
12 | 3.464 |
Linear Interpolation
Finally, we can also fill the missing values using linear interpolation. This involves filling rows with values that are evenly spaced between the last and next valid values.
Consider again the following table of numbers and their corresponding square roots:
number | square_root |
---|---|
2 | 1.414 |
3 | null |
5 | null |
null | 3.0 |
12 | 3.464 |
Interpolating missing values, we would have:
number | square_root |
---|---|
2 | 1.414 |
3 | 1.943 |
5 | 2.471 |
8.5 | 3.0 |
12 | 3.464 |
Exercise
Write a program using Pandas that uses various approaches to clean given dataframes with missing values. Complete the drop_missing()
, fill_mean()
, fill_forward()
, fill_backward()
, and fill_linear_interpolation()
methods:
drop_missing()
takes in an input dataframedf
and returns a new dataframe where rows with missing values are droppedfill_mean()
takes in an input dataframedf
and returns a new dataframe where rows with missing values filled using column meansfill_forward()
takes in an input dataframedf
and returns a new dataframe where rows with missing values are forward-filledfill_backward()
takes in an input dataframedf
and returns a new dataframe where rows with missing values are backward-filledfill_linear_interpolation()
takes in an input dataframedf
and returns a new dataframe where rows with missing values are filled using linear interpolation
These methods must return new dataframes, do not modify the input dataframe df
.
Sample Test Cases
Test Case 1
Input:
[
{
"number": [2, 3, 5, null, 12],
"square_root": [1.414, null, null, 3.0, 3.464]
}
]
Output:
drop_missing_df =
number square_root
2.0 1.414
12.0 3.464
fill_mean_df =
number square_root
2.0 1.414
3.0 2.626
5.0 2.626
5.5 3.000
12.0 3.464
fill_forward_df =
number square_root
2.0 1.414
3.0 1.414
5.0 1.414
5.0 3.000
12.0 3.464
fill_backward_df =
number square_root
2.0 1.414
3.0 3.000
5.0 3.000
12.0 3.000
12.0 3.464
fill_linear_interpolation_df =
number square_root
2.0 1.414000
3.0 1.942667
5.0 2.471333
8.5 3.000000
12.0 3.464000