Wrangling missing values in student dataset
Question
You're given the following table (along with the Python code to generate it) showing student information:
# Python code to generate example dataframe
import pandas as pd
import numpy as np
raw_data = {'age': [20, 19, 22, 21, 21, 22, 23, 23],
'major': ['Engineering', 'Business', 'Engineering', 'Engineering', 'Business', 'Business', 'Engineering', 'Engineering'],
'grade': [88, 95, 92, 70, 68, 92, 75, 85],
'perc_attendance': [88,95,95,92,78,87,91,np.nan],
'student_id': [1,2,3,4,5,6,7,8],
'weekly_hrs_studying': [np.nan,8,12,14,6,7,12,np.nan]}
df = pd.DataFrame(raw_data, columns = ['student_id', 'age', 'major', 'grade', 'perc_attendance', 'weekly_hrs_studying'])
df
student_id | age | major | grade | perc_attendance | weekly_hrs_studying |
---|---|---|---|---|---|
1 | 20 | Engineering | 88 | 88 | NaN |
2 | 19 | Business | 95 | 95 | 8 |
3 | 22 | Engineering | 92 | 95 | 12 |
4 | 21 | Engineering | 70 | 92 | 14 |
5 | 21 | Business | 68 | 78 | 6 |
6 | 22 | Business | 92 | 87 | 7 |
7 | 23 | Engineering | 75 | 91 | 12 |
8 | 23 | Engineering | 85 | NaN | NaN |
Using this table:
- Drop all rows with empty data in any column
- Drop all rows with empty data in more than two columns (e.g. <=5 cols populated with data in a given row)
- Impute the missing weekly_hrs_studying with each major's mean value