Get better at data science interviews by solving a few questions per week.

Join thousands of other data scientists and analysts practicing for interviews!

1 We write questions

Get relevant data science interview questions frequently asked at top companies.

2 You solve them

Solve the problem before receiving the solution the next morning.

3 We send you the solution Premium

Check your work and get better at interviewing!

Sample question 1: Statistical knowledge

Suppose there are 15 different color crayons in a box. Each time one obtains a crayon, it is equally likely to be any of the 15 types. Compute the expected # of different colors that are obtained in a set of 5 crayons. (Hint: use indicator variables and linearity of expectation)

We enumerate the crayons from 1 to 15. Let \(X_i\) indicate when the ith crayon is among the 5 crayons selected.

So,

\(E(X_i) =\) Pr {Probability that at least one type i crayon is in set of 5}

\(E(X_i) =\) 1 - Pr {no type i crayons in set of 5}

\(E(X_i) = 1 - \frac{14}{15}^5\ \)

Therefore, the expected # of crayons is:

\( = \sum_{i=1}^{15} E(X_i)\)

\( = 15[1 - \frac{14}{15}^5]\)

\( = 4.38\)

Sample question 2: Coding/computation

Suppose you have a dataframe, df, with the following records:

age | favorite_color | grade | name | |
---|---|---|---|---|

0 | 20 | blue | 88 | Willard Morris |

1 | 19 | blue | 92 | Al Jennings |

2 | 22 | yellow | 95 | Omar Mullins |

3 | 21 | green | 70 | Spencer McDaniel |

The dataframe is showing information about students. Write code using Python Pandas to select the rows where the students' favorite color is blue or yellow and their grade is at least 90.

Click here to view this problem in an interactive Colab (Jupyter) notebook.

#Define our array of target colors

fav_color_filter = ['blue', 'yellow']

#To select rows whose column value is in an iterable array, which we defined as fav_color_filter, we can use isin

df = df.loc[df['favorite_color'].isin(fav_color_filter)]

#next, we need to filter on scores above 90. here we can use loc on our dataframe:

df = df.loc[(df['grade'] >= 90)]

#preview the dataframe

df.head()

Resultant dataframe:

age | favorite_color | grade | name | |
---|---|---|---|---|

1 | 19 | blue | 92 | Al Jennings |

2 | 22 | yellow | 95 | Omar Mullins |

Click here to view this solution in an interactive Colab (Jupyter) notebook.

Sample question 3: SQL/Database Querying

Suppose you are given the following table, containing information around total tonnage of trash for various landfills across various states. In other words, each row represents the total weight (in tons) of trash at a specific landfill site in a specific state.

Table: landfill_weights

landfillID | weight | state | number_garbage_vehicles |
---|---|---|---|

12300 | 95 | California | 1005 |

12401 | 85 | California | 850 |

00992 | 105 | New York | 1300 |

00882 | 100 | New York | 1000 |

11100 | 55 | Michigan | 580 |

11201 | 75 | Michigan | 700 |

11207 | 60 | Michigan | 500 |

Using the above table, write a SQL query to return the landfill with the second highest amount of garbage (based on weight) for each state shown.

*Note:* You can assume each row represents a unique landfill (e.g. the weights shown are the total weights, and do not need further aggregation) and each weight happens to be unique (e.g. there are no ties).

SELECT

# (3) Pull all records from our subquery below

*

FROM

# (1) Here, we can create a subquery to stack rank the landfills by weight

# for each given state

(SELECT

t.landfillID,

t.weight,

t.state,

# (2) Using rank, we assign an integer ranking to each row,

# allowing us to see which landfills have the most garbage

# in them (and second most, etc) in each state

RANK() OVER(PARTITION BY state ORDER BY t.weight DESC) as rank_weight

FROM landfill_weights t) stg

# (4) When pulling from our sub query, we simply filter where the rank_weight = 2

# to grab landfills with the second highest amount of garbage, per the question prompt

WHERE stg.rank_weight = 2

**Deepti + **

Data Interview Qs helped me cover all the basics of technical data science interviews, from SQL/Python to probability/statistics. The confidence/knowledge I gained from the service was incredibly useful, helping me land a data scientist position at Red Hat.

**Dylan + **

I've been on the mailing list since the initial beta, and found the questions to be very helpful with the technical side of my data science interview at Facebook, ultimately helping me land a role.

**Nayana + **

Data Interview Qs has been a great way to stay relevant with my data skills at Facebook, and has given me an edge in my Master’s Program in Data Science at Georgia Tech! It’s awesome to get a wide variety of questions and solutions sent to me on a regular basis.

**Walker + **

Data Interview Qs was very helpful in brushing up my SQL skills, providing with a good mix of both the basics as well more challenging questions leading up to an interview, ultimately helping me land a job at KLA.

**Melissa + **

I've been enjoying the mix of questions coming out Data Interview Qs. The balance between stats, data manipulation, classic programming questions, and SQL came in handy during my Amazon interview.

**Robert + **

I’m not actively looking for a job, but have found Data Interview Qs to be helpful at keeping my data skills fresh as well as providing tips/tricks to utilize in my current role at Google. It's a great product for anyone looking for practice problems on a regular basis!