Wrangling missing values in student dataset

Question

You're given the following table (along with the Python code to generate it) showing student information:

# Python code to generate example dataframe
import pandas as pd
import numpy as np
raw_data = {'age': [20, 19, 22, 21, 21, 22, 23, 23], 
        'major': ['Engineering', 'Business', 'Engineering', 'Engineering', 'Business', 'Business', 'Engineering', 'Engineering'], 
        'grade': [88, 95, 92, 70, 68, 92, 75, 85], 
        'perc_attendance': [88,95,95,92,78,87,91,np.nan],        
        'student_id': [1,2,3,4,5,6,7,8],
        'weekly_hrs_studying': [np.nan,8,12,14,6,7,12,np.nan]}
df = pd.DataFrame(raw_data, columns = ['student_id', 'age', 'major', 'grade', 'perc_attendance', 'weekly_hrs_studying'])
df
student_id age major grade perc_attendance weekly_hrs_studying
1 20 Engineering 88 88 NaN
2 19 Business 95 95 8
3 22 Engineering 92 95 12
4 21 Engineering 70 92 14
5 21 Business 68 78 6
6 22 Business 92 87 7
7 23 Engineering 75 91 12
8 23 Engineering 85 NaN NaN

Using this table:

  • Drop all rows with empty data in any column
  • Drop all rows with empty data in more than two columns (e.g. <=5 cols populated with data in a given row)
  • Impute the missing weekly_hrs_studying with each major's mean value

Solution

Access restricted

Subscribe to premium account to see the solution.

Get premium now