Finding duplicate values in a SQL table


The following example uses ipython-sql. To install:
pip install ipython-sql

%sql is a Jupyter magic command and can be ignored when not using Jupyter notebooks.

Configure SQL for Jupyter notebook

#Load sql
%load_ext sql
#connect to sqlite
%sql sqlite://

Create a table

%%sql
CREATE TABLE players2 (first_name, last_name, age);
INSERT INTO players2 (first_name, last_name, age) VALUES
('Lebron', 'James', 33),
('Steph', 'Curry', 30),
('Mike', 'Jordan', 55),
('Mike', 'Bibby', 40);
%%sql
SELECT * FROM players2
first_name last_name age
Lebron James 33
Steph Curry 30
Mike Jordan 55
Mike Bibby 40

Find duplicate values in SQL table based on single column

%%sql 
--Select all duplicate first names, and return the count
SELECT first_name, COUNT(first_name)
FROM players2
GROUP BY first_name
HAVING COUNT(first_name) > 1
first_name COUNT(first_name)
Mike 2

Create another table

%%sql

CREATE TABLE players3 (first_name, last_name, age);
INSERT INTO players3 (first_name, last_name, age) VALUES
('Lebron', 'James', 33),
('Steph', 'Curry', 30),
('Mike', 'Jordan', 55),
('Mike', 'Bibby', 40),
('Mike', 'Smith', 40);

Find duplicate values in SQL table based on multiple columns

%%sql 
--Where both first_name and age are duplicates,
--here should return Mike with a count of two, since there are two 40 year old Mikes
SELECT
first_name, age, COUNT(*)
FROM
players3
GROUP BY
first_name, age
HAVING
COUNT(*) > 1
 * sqlite:// 
Done.
first_name age COUNT(*)
Mike 40 2


Ace your next data science interview

Get better at data science interviews by solving a few questions per week



Find a bug? Submit a suggested change on Github, or message me on Twitter.