Wrangling missing values in student dataset

Question

You're given the following table (along with the Python code to generate it) showing student information:

# Python code to generate example dataframe
import pandas as pd
import numpy as np
raw_data = {'age': [20, 19, 22, 21, 21, 22, 23, 23], 
        'major': ['Engineering', 'Business', 'Engineering', 'Engineering', 'Business', 'Business', 'Engineering', 'Engineering'], 
        'grade': [88, 95, 92, 70, 68, 92, 75, 85], 
        'perc_attendance': [88,95,95,92,78,87,91,np.nan],        
        'student_id': [1,2,3,4,5,6,7,8],
        'weekly_hrs_studying': [np.nan,8,12,14,6,7,12,np.nan]}
df = pd.DataFrame(raw_data, columns = ['student_id', 'age', 'major', 'grade', 'perc_attendance', 'weekly_hrs_studying'])
df

student_id	age	major	grade	perc_attendance	weekly_hrs_studying
1	20	Engineering	88	88	NaN
2	19	Business	95	95	8
3	22	Engineering	92	95	12
4	21	Engineering	70	92	14
5	21	Business	68	78	6
6	22	Business	92	87	7
7	23	Engineering	75	91	12
8	23	Engineering	85	NaN	NaN

Using this table:

Drop all rows with empty data in any column
Drop all rows with empty data in more than two columns (e.g. <=5 cols populated with data in a given row)
Impute the missing weekly_hrs_studying with each major's mean value

Solution

Access restricted

Subscribe to premium account to see the solution.

Get premium now