Drop duplicate rows in Pandas based on column value


import modules

import pandas as pd
import numpy as np

create dummy dataframe

raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue', 'blue', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data, index = ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'])
df
age favorite_color grade name
Willard Morris 20 blue 88 Willard Morris
Al Jennings 19 blue 92 Al Jennings
Omar Mullins 22 yellow 95 Omar Mullins
Spencer McDaniel 21 green 70 Spencer McDaniel

drop a duplicate row, based on column name

#here we should drop Al Jennings' record from the df, 
#since his favorite color, blue, is a duplicate with Willard Morris
df = df.drop_duplicates(subset='favorite_color', keep="first")
df
age favorite_color grade name
Willard Morris 20 blue 88 Willard Morris
Omar Mullins 22 yellow 95 Omar Mullins
Spencer McDaniel 21 green 70 Spencer McDaniel


Ace your next data science interview

Get better at data science interviews by solving a few questions per week



Find a bug? Submit a suggested change on Github, or message me on Twitter.