SQLPad.io hard question 1

We will answer the following SQLPad.io question in this article:


First, we need to have a table with all the rentals for each customer, the date of those rentals, and the date difference between them (with the corresponding filters on the days and the month). We can build that query like this:

SELECT
  customer_id,
  EXTRACT(DAY from rental_ts) AS day_,
  -- This is the key part, where we get the number of days that have passed since the last day that
  -- the same customer rented a movie
  EXTRACT(DAY from rental_ts)- LAG(EXTRACT(DAY from rental_ts), 1) 
     -- We need to partition by customer so we only calculate the time differences
     -- for the same customer
     OVER (PARTITION BY customer_id ORDER BY EXTRACT(DAY from rental_ts)) AS day_diff
 FROM rental
 -- Filtering conditions as requested
 WHERE EXTRACT(DAY from rental_ts) BETWEEN 24 AND 31
 AND EXTRACT(MONTH from rental_ts) = 5
 ORDER BY 1, 2

The most important part of this query is the day_diff column, so let’s analyze it in detail. The LAG window function allows us to access the value of the rental_ts column for the previous row. The OVER() clause instructs PostgreSQL what the previous row should be; partition the data by customer_id and order it by the day of the month in ascending order. That way, for each customer, we will calculate the date difference (in days) for each rental and the previous rental (this is where the OVER() clause defines what the previous rental is; it is the previous rental in chronological order for the current customer that is being iterated).

Let’s see a sample of what the previous query would return:


As we can see, the day_diff column is NULL for the first available date for each customer, since we don’t have a previous row to calculate the difference.

Now we can easily determine if a customer is a ‘Happy’ one or not; it has to have at least one row in the previous table where the value for day_diff is 1. That would mean that the customer rented at least one movie in two consecutive days.

This is the final query:

SELECT
  -- We need to use distinct so we don't count repeated customers, as it's possible that some
  -- of them have more than one two-day streaks within the specified period (24th to 31st of May)
COUNT(DISTINCT(customer_id))
FROM (
  SELECT
  customer_id,
  EXTRACT(DAY from rental_ts) AS day_,
  -- This is the key part, where we get the number of days that have passed since the last day that
  -- the same customer rented a movie
  EXTRACT(DAY from rental_ts)- LAG(EXTRACT(DAY from rental_ts), 1) 
     -- We need to partition by customer so we only calculate the time differences
     -- for the same customer
     OVER (PARTITION BY customer_id ORDER BY EXTRACT(DAY from rental_ts)) AS day_diff
  FROM rental
  -- Filtering conditions as requested
  WHERE EXTRACT(DAY from rental_ts) BETWEEN 24 AND 31
  AND EXTRACT(MONTH from rental_ts) = 5
  ORDER BY 1, 2
  ) AS temp
-- We are only interested in consecutive days rentals, which is when day_diff equals 1
WHERE day_diff = 1

As discussed earlier, the WHERE clause filters the table to keep only the customers who rented movies on consecutive days. Since one customer can have more than one consecutive rental during the defined period (between May 24th and May 31st), we need to count the distinct values of the customer_id column. That gives us the requested result; 159 happy customers!

Deja un comentario

Blog de WordPress.com.

Subir ↑

Diseña un sitio como este con WordPress.com
Comenzar