Cleaning Missing Data

Most datasets encountered in data science are not fully clean and can contain a lot of missing values due to a variety of reasons, including human error during data entry, technical issues with data collection, or the data not being applicable in certain cases. This makes data cleaning an essential step in the data analysis process. In this exercise, we will explore the approaches in cleaning missing data values.

Dropping Rows with Missing Values

Often it is convenient to simply drop rows that contain missing values. This is more appropriate when the dataset is large and the proportion of rows with missing values is low.

Consider the following table of numbers and their corresponding square roots:

`number`	`square_root`
2	1.414
3	`null`
5	`null`
`null`	3.0
12	3.464

If we were to drop row with missing values, we would drop the third and fourth rows:

`number`	`square_root`
2	1.414
12	3.464

Note that while this is straightforward, it can sometimes lead to the loss of valuable data.

Filling with Mean

Instead of dropping the row entirely, it is sometimes useful to fill the missing values with the mean value of the column.

Consider again the following table of numbers and their corresponding square roots:

`number`	`square_root`
2	1.414
3	`null`
5	`null`
`null`	3.0
12	3.464

Filling in rows with mean values, we would have:

`number`	`square_root`
2	1.414
3	2.626
5	2.626
5.5	3.0
12	3.464

Forward-Filling

Sometimes it is also effective to fill the missing values using the last valid value.

Consider again the following table of numbers and their corresponding square roots:

`number`	`square_root`
2	1.414
3	`null`
5	`null`
`null`	3.0
12	3.464

Forward-filling rows, we would have:

`number`	`square_root`
2	1.414
3	1.414
5	1.414
5	3.0
12	3.464

Backward-Filling

Similarly, we can also fill the missing values using the next valid value.

Consider again the following table of numbers and their corresponding square roots:

`number`	`square_root`
2	1.414
3	`null`
5	`null`
`null`	3.0
12	3.464

Backward-filling rows, we would have:

`number`	`square_root`
2	1.414
3	3.0
5	3.0
12	3.0
12	3.464

Linear Interpolation

Finally, we can also fill the missing values using linear interpolation. This involves filling rows with values that are evenly spaced between the last and next valid values.

Consider again the following table of numbers and their corresponding square roots:

`number`	`square_root`
2	1.414
3	`null`
5	`null`
`null`	3.0
12	3.464

Interpolating missing values, we would have:

`number`	`square_root`
2	1.414
3	1.943
5	2.471
8.5	3.0
12	3.464

Exercise

Write a program using Pandas that uses various approaches to clean given dataframes with missing values. Complete the drop_missing(), fill_mean(), fill_forward(), fill_backward(), and fill_linear_interpolation() methods:

drop_missing() takes in an input dataframe df and returns a new dataframe where rows with missing values are dropped
fill_mean() takes in an input dataframe df and returns a new dataframe where rows with missing values filled using column means
fill_forward() takes in an input dataframe df and returns a new dataframe where rows with missing values are forward-filled
fill_backward() takes in an input dataframe df and returns a new dataframe where rows with missing values are backward-filled
fill_linear_interpolation() takes in an input dataframe df and returns a new dataframe where rows with missing values are filled using linear interpolation

These methods must return new dataframes, do not modify the input dataframe df.

Sample Test Cases

Test Case 1

Input:

[
  {
    "number": [2, 3, 5, null, 12],
    "square_root": [1.414, null, null, 3.0, 3.464]
  }
]

Output:

drop_missing_df =

 number  square_root
    2.0        1.414
   12.0        3.464

fill_mean_df =

 number  square_root
    2.0        1.414
    3.0        2.626
    5.0        2.626
    5.5        3.000
   12.0        3.464

fill_forward_df =

 number  square_root
    2.0        1.414
    3.0        1.414
    5.0        1.414
    5.0        3.000
   12.0        3.464

fill_backward_df =

 number  square_root
    2.0        1.414
    3.0        3.000
    5.0        3.000
   12.0        3.000
   12.0        3.464

fill_linear_interpolation_df =

 number  square_root
    2.0     1.414000
    3.0     1.942667
    5.0     2.471333
    8.5     3.000000
   12.0     3.464000

import pandas as pd

class CleaningMissingData:

def drop_missing(self, df: pd.DataFrame) -> pd.DataFrame:

# rewrite this function