Remove duplicate columns by name in Pandas


import pandas as pd
import numpy as np

Create a dataframe

#create a dataframe
raw_data = {'name': ['Willard Morris', 'Al Jennings'],
'age': [20, 19],
'favorite_color': ['blue', 'red'],
'grade': [88, 92],
'grade': [88, 92]}
df = pd.DataFrame(raw_data, index = ['Willard Morris', 'Al Jennings'])
df
age favorite_color grade grade name
Willard Morris 20 blue 88 88 Willard Morris
Al Jennings 19 red 92 92 Al Jennings


Remove duplicate columns (based on column name)

#preview the df
df = df.loc[:,~df.columns.duplicated()]
df
age favorite_color grade name
Willard Morris 20 blue 88 Willard Morris
Al Jennings 19 red 92 Al Jennings

df.columns.duplicated() returns a boolean array: a True or False for each column--False means the column name is unique up to that point, True means it's a duplicate

Pandas allows one to index using boolean values whereby it selects only the True values.

Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, False, True] -> ~[False,True,False])



Ace your next data science interview

Get better at data science interviews by solving a few questions per week



Find a bug? Submit a suggested change on Github, or message me on Twitter.