Goldman Sachs PostgreSQL interview question

This short article answers the following PostgreSQL question:


The two tables that we are given have the following structure:


The first thing that we need to do is join both tables so we have all the necessary information in one table. We will do that with a Common Table Expression (CTE):

WITH temp AS(
    SELECT
    ta.train_id,
    arrival_time,
    departure_time
    FROM train_arrivals ta
    INNER JOIN train_departures td
    ON ta.train_id = td.train_id
)

In order to find the minimum number of platforms needed, we have to calculate, for each train, how many trains are already in the station when said train arrives. For example, if train 5 arrives and there are 3 other trains already in the station (they arrived before and haven’t left yet), we would need a total of 4 platforms to accommodate all of the traffic.

We can accomplish that calculation by simply doing a self-join of the previous CTE. That way, for each train, we will calculate how many other trains coincide with it (in this context, coincide means that the time of the arriving train is between the arrival and departure time of any other trains). Since a train will always coincide with itself, we need to subtract one from that count:

WITH temp AS(
    SELECT
    ta.train_id,
    arrival_time,
    departure_time
    FROM train_arrivals ta
    INNER JOIN train_departures td
    ON ta.train_id = td.train_id
)
SELECT
t1.train_id,
t1.arrival_time,
t1.departure_time,
(COUNT(t1.train_id) - 1) AS min_num_platforms
FROM temp t1
JOIN temp t2
ON t1.arrival_time BETWEEN t2.arrival_time AND t2.departure_time
GROUP BY 1, 2, 3
ORDER BY 4 DESC

That query will give us the following table:


As we can see, when train 4 arrives at the station, there would be 2 more trains already there (1 and 2), and train 5 would be arriving at the same time. Therefore, we would need at least 4 platforms to accommodate all scheduled traffic for the day. We can always calculate the minimum number of platforms needed for any given schedule by simply computing the maximum value of the min_number_platforms column.

SQLPad.io hard question 2

This is the question that we are going to answer in this post (you can find it here):


First of all, we need to take into account that the rental table does not have data for days that have no rentals. However, we still need to consider those days and count them as having 0 rentals. That can be done by starting from the date table (which contains every day for May 2020 without any gaps).

Since the date and rental tables are not directly related in the database schema, we need to create a custom join condition based on the date:

SELECT
EXTRACT(DAY from d.date) AS day_,
COUNT(r.rental_id) AS n_rentals
FROM dates d
LEFT JOIN rental r
ON DATE(r.rental_ts) = d.date
WHERE EXTRACT(MONTH from d.date) = 5
AND EXTRACT(YEAR from d.date) = 2020
GROUP BY 1

With this query, we start from the date table and then LEFT JOIN it to rental, which guarantees that every day in May 2020 will be included in the analysis. The count of rentals will be 0 for days that have no data in the rental table:


Now we can get to the requested result by creating custom columns with CASE and aggregating them:

WITH temp_ AS(
  SELECT
  EXTRACT(DAY from d.date) AS day_,
  COUNT(r.rental_id) AS n_rentals
  FROM dates d
  LEFT JOIN rental r
  ON DATE(r.rental_ts) = d.date
  WHERE EXTRACT(MONTH from d.date) = 5
  AND EXTRACT(YEAR from d.date) = 2020
  GROUP BY 1
  )
SELECT
	SUM(CASE WHEN n_rentals > 100 THEN 1 ELSE 0 END) AS good_days,
	SUM(CASE WHEN n_rentals <= 100 THEN 1 ELSE 0 END) AS bad_days
FROM temp_

For the good_days column, for example, PostgreSQL checks the temp_ CTE row by row and determines if the value for the n_rentals column is greater than 100. If it is, then the value for that row will be 1 (and 0 otherwise). After that, a simple SUM over our custom column will give us the number of good days in May 2020. We do the same thing for bad days and get to the final table:

SQLPad.io hard question 1

We will answer the following SQLPad.io question in this article:


First, we need to have a table with all the rentals for each customer, the date of those rentals, and the date difference between them (with the corresponding filters on the days and the month). We can build that query like this:

SELECT
  customer_id,
  EXTRACT(DAY from rental_ts) AS day_,
  -- This is the key part, where we get the number of days that have passed since the last day that
  -- the same customer rented a movie
  EXTRACT(DAY from rental_ts)- LAG(EXTRACT(DAY from rental_ts), 1) 
     -- We need to partition by customer so we only calculate the time differences
     -- for the same customer
     OVER (PARTITION BY customer_id ORDER BY EXTRACT(DAY from rental_ts)) AS day_diff
 FROM rental
 -- Filtering conditions as requested
 WHERE EXTRACT(DAY from rental_ts) BETWEEN 24 AND 31
 AND EXTRACT(MONTH from rental_ts) = 5
 ORDER BY 1, 2

The most important part of this query is the day_diff column, so let’s analyze it in detail. The LAG window function allows us to access the value of the rental_ts column for the previous row. The OVER() clause instructs PostgreSQL what the previous row should be; partition the data by customer_id and order it by the day of the month in ascending order. That way, for each customer, we will calculate the date difference (in days) for each rental and the previous rental (this is where the OVER() clause defines what the previous rental is; it is the previous rental in chronological order for the current customer that is being iterated).

Let’s see a sample of what the previous query would return:


As we can see, the day_diff column is NULL for the first available date for each customer, since we don’t have a previous row to calculate the difference.

Now we can easily determine if a customer is a ‘Happy’ one or not; it has to have at least one row in the previous table where the value for day_diff is 1. That would mean that the customer rented at least one movie in two consecutive days.

This is the final query:

SELECT
  -- We need to use distinct so we don't count repeated customers, as it's possible that some
  -- of them have more than one two-day streaks within the specified period (24th to 31st of May)
COUNT(DISTINCT(customer_id))
FROM (
  SELECT
  customer_id,
  EXTRACT(DAY from rental_ts) AS day_,
  -- This is the key part, where we get the number of days that have passed since the last day that
  -- the same customer rented a movie
  EXTRACT(DAY from rental_ts)- LAG(EXTRACT(DAY from rental_ts), 1) 
     -- We need to partition by customer so we only calculate the time differences
     -- for the same customer
     OVER (PARTITION BY customer_id ORDER BY EXTRACT(DAY from rental_ts)) AS day_diff
  FROM rental
  -- Filtering conditions as requested
  WHERE EXTRACT(DAY from rental_ts) BETWEEN 24 AND 31
  AND EXTRACT(MONTH from rental_ts) = 5
  ORDER BY 1, 2
  ) AS temp
-- We are only interested in consecutive days rentals, which is when day_diff equals 1
WHERE day_diff = 1

As discussed earlier, the WHERE clause filters the table to keep only the customers who rented movies on consecutive days. Since one customer can have more than one consecutive rental during the defined period (between May 24th and May 31st), we need to count the distinct values of the customer_id column. That gives us the requested result; 159 happy customers!

Google PostgreSQL interview question

The question that will be answered in this article is the following (it can be found here):

This is the table that we are given to start with:

First, we start by creating a Common Table Expression that will help us obtain the desired result later:

SELECT 
    UNNEST(ARRAY['bull', 'bear']) AS word,
    UNNEST(ARRAY[
                 (LENGTH(contents) - LENGTH(REPLACE(contents, ' bull ', ''))) / LENGTH(' bull '),
                 (LENGTH(contents) - LENGTH(REPLACE(contents, ' bear ', ''))) / LENGTH(' bear ')
                 ]
                 ) AS nentry
FROM google_file_store

The approach to counting the number of occurrences of each word in each file is the following one; by subtracting the length of the text without the word that we are looking for from the length of the entire text, we get the difference in the number of characters (in case the word appears in the text, otherwise this difference will be 0). After that, we can simply divide by the length of the word itself and that will give us the number of times that it appears in the text.

We use the array function to store the number of occurrences of both words in each file row by row. That allows us to use the unnest function to get a row for each word in each file (and its number of occurrences). This is what we would get if we executed the previous code:

And the expected output of the question is the following:

We can get to that final result pretty easily by simply grouping by the word column on the previous CTE. This is the full SQL query:

-- Creating a CTE with an array for the words and an array for the the number of occurrences
-- of each word in each file
WITH occurrences AS(
    SELECT 
    UNNEST(ARRAY['bull', 'bear']) AS word,
    UNNEST(ARRAY[
                 (LENGTH(contents) - LENGTH(REPLACE(contents, ' bull ', ''))) / LENGTH(' bull '),
                 (LENGTH(contents) - LENGTH(REPLACE(contents, ' bear ', ''))) / LENGTH(' bear ')
                 ]
                 ) AS nentry
    FROM google_file_store
)
-- Obtaining the requested table
SELECT
word,
SUM(nentry) AS nentry
FROM occurrences
GROUP BY 1
ORDER BY 2 DESC;

Amazon SQL interview question

This is a very short project in which I answer a PostgreSQL question that appeared in an Amazon interview. The question can be found here and reads like this:

First, we will build a Common Table Expression (CTE) that will be very useful to calculate the final rolling average. This is the SQL code for that CTE:

WITH montly_amt AS (
    SELECT 
    TO_CHAR(created_at, 'YYYY-MM') AS year_month,
    SUM(purchase_amt) AS total_amt
    FROM amazon_purchases
    WHERE purchase_amt > 0
    GROUP BY 1
    ORDER BY 1)

With this CTE, we extract the year and month from the created_at column with the required format and obtain the total monthly revenue (grouping by the newly created year_month column and performing a sum over purchase_amt). The ‘WHERE’ clause makes sure that we don’t include any returns, as specified in the question.

This is the temporary table that is generated with this CTE:

Great! Now we can take a look at the full query, which will give us the final result:

WITH montly_amt AS (
    SELECT 
    TO_CHAR(created_at, 'YYYY-MM') AS year_month,
    SUM(purchase_amt) AS total_amt
    FROM amazon_purchases
    WHERE purchase_amt > 0
    GROUP BY 1
    ORDER BY 1)

SELECT
year_month,
AVG(total_amt) OVER(ORDER BY year_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM montly_amt

With the use of a window function on the previous CTE, we can obtain the required three-month rolling average. The window function calculates the average over the current row and the previous two rows:


We have to keep in mind that, for the first two rows, it will not be a true 3-month rolling average since less than 3 months are included (only 1 for the first row and 2 for the second).

Blog de WordPress.com.

Subir ↑

Diseña un sitio como este con WordPress.com
Comenzar