Since October 7, 2023, Israel has murdered 52,263 Palestinians, out of whom, 18,184 were children. Please consider donating to Palestine Children's Relief Fund.
Data Preparation & Preprocessing IconData Preparation & Preprocessing
1
0
Beginner

Cleaning Missing Data

Most datasets encountered in data science are not fully clean and can contain a lot of missing values due to a variety of reasons, including human error during data entry, technical issues with data collection, or the data not being applicable in certain cases. This makes data cleaning an essential step in the data analysis process. In this exercise, we will explore the approaches in cleaning missing data values.

Dropping Rows with Missing Values

Often it is convenient to simply drop rows that contain missing values. This is more appropriate when the dataset is large and the proportion of rows with missing values is low.

Consider the following table of numbers and their corresponding square roots:

numbersquare_root
21.414
3null
5null
null3.0
123.464

If we were to drop row with missing values, we would drop the third and fourth rows:

numbersquare_root
21.414
123.464

Note that while this is straightforward, it can sometimes lead to the loss of valuable data.

Filling with Mean

Instead of dropping the row entirely, it is sometimes useful to fill the missing values with the mean value of the column.

Consider again the following table of numbers and their corresponding square roots:

numbersquare_root
21.414
3null
5null
null3.0
123.464

Filling in rows with mean values, we would have:

numbersquare_root
21.414
32.626
52.626
5.53.0
123.464

Forward-Filling

Sometimes it is also effective to fill the missing values using the last valid value.

Consider again the following table of numbers and their corresponding square roots:

numbersquare_root
21.414
3null
5null
null3.0
123.464

Forward-filling rows, we would have:

numbersquare_root
21.414
31.414
51.414
53.0
123.464

Backward-Filling

Similarly, we can also fill the missing values using the next valid value.

Consider again the following table of numbers and their corresponding square roots:

numbersquare_root
21.414
3null
5null
null3.0
123.464

Backward-filling rows, we would have:

numbersquare_root
21.414
33.0
53.0
123.0
123.464

Linear Interpolation

Finally, we can also fill the missing values using linear interpolation. This involves filling rows with values that are evenly spaced between the last and next valid values.

Consider again the following table of numbers and their corresponding square roots:

numbersquare_root
21.414
3null
5null
null3.0
123.464

Interpolating missing values, we would have:

numbersquare_root
21.414
31.943
52.471
8.53.0
123.464

Exercise

Write a program using Pandas that uses various approaches to clean given dataframes with missing values. Complete the drop_missing(), fill_mean(), fill_forward(), fill_backward(), and fill_linear_interpolation() methods:

  • drop_missing() takes in an input dataframe df and returns a new dataframe where rows with missing values are dropped
  • fill_mean() takes in an input dataframe df and returns a new dataframe where rows with missing values filled using column means
  • fill_forward() takes in an input dataframe df and returns a new dataframe where rows with missing values are forward-filled
  • fill_backward() takes in an input dataframe df and returns a new dataframe where rows with missing values are backward-filled
  • fill_linear_interpolation() takes in an input dataframe df and returns a new dataframe where rows with missing values are filled using linear interpolation

These methods must return new dataframes, do not modify the input dataframe df.

Sample Test Cases

Test Case 1

Input:

[
  {
    "number": [2, 3, 5, null, 12],
    "square_root": [1.414, null, null, 3.0, 3.464]
  }
]

Output:

drop_missing_df =

 number  square_root
    2.0        1.414
   12.0        3.464

fill_mean_df =

 number  square_root
    2.0        1.414
    3.0        2.626
    5.0        2.626
    5.5        3.000
   12.0        3.464

fill_forward_df =

 number  square_root
    2.0        1.414
    3.0        1.414
    5.0        1.414
    5.0        3.000
   12.0        3.464

fill_backward_df =

 number  square_root
    2.0        1.414
    3.0        3.000
    5.0        3.000
   12.0        3.000
   12.0        3.464

fill_linear_interpolation_df =

 number  square_root
    2.0     1.414000
    3.0     1.942667
    5.0     2.471333
    8.5     3.000000
   12.0     3.464000
Login to Start Coding
import pandas as pd


class CleaningMissingData:
def drop_missing(self, df: pd.DataFrame) -> pd.DataFrame:
# rewrite this function
# dummy code to get you started
processed_df = pd.DataFrame()

return processed_df

def fill_mean(self, df: pd.DataFrame) -> pd.DataFrame:
# rewrite this function
# dummy code to get you started
processed_df = pd.DataFrame()

return processed_df

def fill_forward(self, df: pd.DataFrame) -> pd.DataFrame:
# rewrite this function
# dummy code to get you started
processed_df = pd.DataFrame()

return processed_df

def fill_backward(self, df: pd.DataFrame) -> pd.DataFrame:
# rewrite this function
# dummy code to get you started
processed_df = pd.DataFrame()

return processed_df

def fill_linear_interpolation(self, df: pd.DataFrame) -> pd.DataFrame:
# rewrite this function
# dummy code to get you started
processed_df = pd.DataFrame()

return processed_df

def main(
self,
df: pd.DataFrame,
) -> tuple[
pd.DataFrame,
pd.DataFrame,
pd.DataFrame,
pd.DataFrame,
pd.DataFrame,
]:
# this is how your implementation will be tested
# do not edit this function

# drop rows from df with missing data
drop_missing_df = self.drop_missing(df)
# fill rows with missing data using mean
fill_mean_df = self.fill_mean(df)
# fill rows with missing data using forward values
fill_forward_df = self.fill_forward(df)
# fill rows with missing data using backward values
fill_backward_df = self.fill_backward(df)
# fill rows with missing data using linear interpolation
fill_linear_interpolation_df = self.fill_linear_interpolation(df)

return (
drop_missing_df,
fill_mean_df,
fill_forward_df,
fill_backward_df,
fill_linear_interpolation_df,
)