Get better at data science interviews by solving a few questions per week.

Join thousands of other data scientists and analysts practicing for interviews!

We will never spam. One-click unsubscribe.

1 We write questions

Get relevant data science interview questions frequently asked at top companies.

2 You solve them

Solve the problem before receiving the solution the next morning.

3 We send you the solution Premium

Check your work and get better at interviewing!

Sample question 1: Statistical knowledge

Suppose there are 15 different color crayons in a box. Each time one obtains a crayon, it is equally likely to be any of the 15 types. Compute the expected # of different colors that are obtained in a set of 5 crayons. (Hint: use indicator variables and linearity of expectation)

We enumerate the crayons from 1 to 15. Let \(X_i\) indicate when the ith crayon is among the 5 crayons selected.

So,

\(E(X_i) =\) Pr {Probability that at least one type i crayon is in set of 5}

\(E(X_i) =\) 1 - Pr {no type i crayons in set of 5}

\(E(X_i) = 1 - \frac{14}{15}^5\ \)

Therefore, the expected # of crayons is:

\( = \sum_{i=1}^{15} E(X_i)\)

\( = 15[1 - \frac{14}{15}^5]\)

\( = 4.38\)

Sample question 2: Coding/computation

Suppose you have a dataframe, df, with the following records:

age | favorite_color | grade | name | |
---|---|---|---|---|

0 | 20 | blue | 88 | Willard Morris |

1 | 19 | blue | 92 | Al Jennings |

2 | 22 | yellow | 95 | Omar Mullins |

3 | 21 | green | 70 | Spencer McDaniel |

The dataframe is showing information about students. Write code using Python Pandas to select the rows where the students' favorite color is blue or yellow and their grade is at least 90.

Click here to view this problem in an interactive Colab (Jupyter) notebook.

#Define our array of target colors

fav_color_filter = ['blue', 'yellow']

#To select rows whose column value is in an iterable array, which we defined as fav_color_filter, we can use isin

df = df.loc[df['favorite_color'].isin(fav_color_filter)]

#next, we need to filter on scores above 90. here we can use loc on our dataframe:

df = df.loc[(df['grade'] >= 90)]

#preview the dataframe

df.head()

Resultant dataframe:

age | favorite_color | grade | name | |
---|---|---|---|---|

1 | 19 | blue | 92 | Al Jennings |

2 | 22 | yellow | 95 | Omar Mullins |

Click here to view this solution in an interactive Colab (Jupyter) notebook.

Sample question 3: SQL/Database Querying

Suppose you are given the following table, containing information around total tonnage of trash for various landfills across various states. In other words, each row represents the total weight (in tons) of trash at a specific landfill site in a specific state.

Table: landfill_weights

landfillID | weight | state | number_garbage_vehicles |
---|---|---|---|

12300 | 95 | California | 1005 |

12401 | 85 | California | 850 |

00992 | 105 | New York | 1300 |

00882 | 100 | New York | 1000 |

11100 | 55 | Michigan | 580 |

11201 | 75 | Michigan | 700 |

11207 | 60 | Michigan | 500 |

Using the above table, write a SQL query to return the landfill with the second highest amount of garbage (based on weight) for each state shown.

*Note:* You can assume each row represents a unique landfill (e.g. the weights shown are the total weights, and do not need further aggregation) and each weight happens to be unique (e.g. there are no ties).

SELECT

# (3) Pull all records from our subquery below

*

FROM

# (1) Here, we can create a subquery to stack rank the landfills by weight

# for each given state

(SELECT

t.landfillID,

t.weight,

t.state,

# (2) Using rank, we assign an integer ranking to each row,

# allowing us to see which landfills have the most garbage

# in them (and second most, etc) in each state

RANK() OVER(PARTITION BY state ORDER BY t.weight DESC) as rank_weight

FROM landfill_weights t) stg

# (4) When pulling from our sub query, we simply filter where the rank_weight = 2

# to grab landfills with the second highest amount of garbage, per the question prompt

WHERE stg.rank_weight = 2

**Dylan + **

I've been on the mailing list since the initial beta, and found the questions to be very helpful with my data science interview at Facebook!

**Melissa + **

I've been enjoying the mix of questions coming out Data Interview Qs. The balance between stats, data manipulation, classic programming questions, and SQL came in handy during my Amazon interview.

**Richard + **

Data Interview Qs helped me land an analyst role at Google. The ROI here is great and would recommend for anyone seeking a role in the data science space.