0% found this document useful (0 votes)
41 views

SQL (Danny's Diner)

The document describes a dataset containing sales data for a restaurant called Danny's Diner. It includes tables for sales, menu, and members. Questions are provided to analyze the data, including calculating total spending by customer, popular menu items, and earning points for purchases. Solutions in SQL are presented to address the questions.

Uploaded by

ratnadepp
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

SQL (Danny's Diner)

The document describes a dataset containing sales data for a restaurant called Danny's Diner. It includes tables for sales, menu, and members. Questions are provided to analyze the data, including calculating total spending by customer, popular menu items, and earning points for purchases. Solutions in SQL are presented to address the questions.

Uploaded by

ratnadepp
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Set(https://8weeksqlchallenge.

com/case-study-1/)

CREATE SCHEMA dannys_diner;


SET search_path = dannys_diner;

CREATE TABLE sales (


"customer_id" VARCHAR(1),
"order_date" DATE,
"product_id" INTEGER
);

INSERT INTO sales


("customer_id", "order_date", "product_id")
VALUES
('A', '2021-01-01', '1'),
('A', '2021-01-01', '2'),
('A', '2021-01-07', '2'),
('A', '2021-01-10', '3'),
('A', '2021-01-11', '3'),
('A', '2021-01-11', '3'),
('B', '2021-01-01', '2'),
('B', '2021-01-02', '2'),
('B', '2021-01-04', '1'),
('B', '2021-01-11', '1'),
('B', '2021-01-16', '3'),
('B', '2021-02-01', '3'),
('C', '2021-01-01', '3'),
('C', '2021-01-01', '3'),
('C', '2021-01-07', '3');

CREATE TABLE menu (


"product_id" INTEGER,
"product_name" VARCHAR(5),
"price" INTEGER
);

INSERT INTO menu


("product_id", "product_name", "price")
VALUES
('1', 'sushi', '10'),
('2', 'curry', '15'),
('3', 'ramen', '12');

CREATE TABLE members (


"customer_id" VARCHAR(1),
"join_date" DATE
);

INSERT INTO members


("customer_id", "join_date")
VALUES
('A', '2021-01-07'),
('B', '2021-01-09');
Question

1. What is the total amount each customer spent at the restaurant?


2. How many days has each customer visited the restaurant?
3. What was the first item from the menu purchased by each customer?
4. What is the most purchased item on the menu and how many times was
it purchased by all customers?
5. Which item was the most popular for each customer?
6. Which item was purchased first by the customer after they became a
member?
7. Which item was purchased just before the customer became a member?
8. What is the total items and amount spent for each member before they
became a member?
9. If each $1 spent equates to 10 points and sushi has a 2x points
multiplier - how many points would each customer have?
10.In the first week after a customer joins the program (including their join
date) they earn 2x points on all items, not just sushi - how many points
do customer A and B have at the end of January?

Solutions

/*What is the total items and amount spent


for each member before they became a member?*/
with memberdata as (
select a.customer_id,a.order_date,b.join_date,c.price,c.product_name
from cricket.sales1 a
left join cricket.members1 b
on a.customer_id=b.customer_id
join cricket.menu1 c
on a.product_id=c.product_id
where a.order_date<b.join_date)

select customer_id,sum(price),count(distinct product_name)


from memberdata
group by customer_id

with points as (
select a.customer_id,a.order_date,c.product_name,c.price,
case when product_name='sushi' then 2*c.price
else c.price end as newprice
from cricket.sales1 a
join cricket.menu1 c
on a.product_id=c.product_id
)
select customer_id,sum(newprice)*10 from points
group by customer_id

with finalpoints as (
select a.customer_id,a.order_date,c.product_name,c.price,
case when product_name='sushi' then 2*c.price
when a.order_date between b.join_date
and (b.join_date+ interval 6 day) then 2*c.price
else c.price end as newprice
from cricket.sales1 a
join cricket.menu1 c
on a.product_id=c.product_id
join cricket.members1 b
on a.customer_id=b.customer_id
where a.order_date<='2021-01-31'
)
select customer_id,sum(newprice)*10 from finalpoints
group by customer_id

select a.customer_id,a.order_date,c.product_name,c.price,
case when a.order_date<b.join_date then 'N'
when b.join_date is null then 'N'
else 'Y' end as member

from cricket.sales1 a
join cricket.menu1 c
on a.product_id=c.product_id
left join cricket.members1 b
on a.customer_id=b.customer_id

with ranking as (
select a.customer_id,a.order_date,c.product_name,c.price,
case when a.order_date<b.join_date then 'N'
when b.join_date is null then 'N'
else 'Y' end as member
from cricket.sales1 a
join cricket.menu1 c
on a.product_id=c.product_id
left join cricket.members1 b
on a.customer_id=b.customer_id)

select *,
case when member='N' then null
else
rank() over (partition by customer_id,member order by order_date) end as
rankcalc
from ranking

his is the same question as problem #32 in the SQL Chapter of Ace the Data Science Interview!

Assume you're given a table containing information about Wayfair user transactions for different
products. Write a query to calculate the year-on-year growth rate for the total spend of each
product, grouping the results by product ID.

The output should include the year in ascending order, product ID, current year's spend, previous
year's spend and year-on-year growth percentage, rounded to 2 decimal places.

user_transactions Table:

Column Name Type

transaction_id integer

product_id integer

spend decimal

transaction_dat
datetime
e

user_transactions Example Input:

product_i
transaction_id spend transaction_date
d

1341 123424 1500.60 12/31/2019 12:00:00

1423 123424 1000.20 12/31/2020 12:00:00


product_i
transaction_id spend transaction_date
d

1623 123424 1246.44 12/31/2021 12:00:00

1322 123424 2145.32 12/31/2022 12:00:00

Example Output:

year product_id curr_year_spend prev_year_spend yoy_rate

2019 123424 1500.60 NULL NULL

2020 123424 1000.20 1500.60 -33.35

2021 123424 1246.44 1000.20 24.62

2022 123424 2145.32 1246.44 72.12

Step 1: Summarize Yearly Spend


First, we need to summarize the user transactions table to obtain the yearly spend information for
each product. We'll use the EXTRACT() function on the transaction date to extract the year and
select the product ID and spend for each year.

Here's the query for this step:

SELECT
EXTRACT(YEAR FROM transaction_date) AS yr,
product_id,
spend AS curr_year_spend
FROM user_transactions;
This query will generate a table that shows the product ID, the year, and the spend for each product
in each year.

Here's the yearly spend for product ID 234412:


year product_id curr_year_spend

2019 234412 1800.00

2020 234412 1234.00

2021 234412 889.50

2022 234412 2900.00

Step 2: Calculate Prior Year's Spend

Next, we'll calculate the previous year's spend for each product using the LAG() window function,
which you can understand more here. The LAG function allows us to access the spend of the
previous year based on the product ID.

SELECT
EXTRACT(YEAR FROM transaction_date) AS yr,
product_id,
spend AS curr_year_spend,
LAG(spend) OVER (
PARTITION BY product_id
ORDER BY
product_id,
EXTRACT(YEAR FROM transaction_date)) AS prev_year_spend
FROM user_transactions;
Here's an example output table for product ID 234412:

year product_id curr_year_spend prev_year_spend

2019 234412 1800.00 NULL

2020 234412 1234.00 1800.00

2021 234412 889.50 1234.00

2022 234412 2900.00 889.50


In this table, you can see that the previous year's spend is displayed in
the prev_year_spend column, which is populated based on the product ID and the order of the
years.

For example, in the year 2020, the previous year's spend is 1800.00, which is the spend for the year
2019. Similarly, in the year 2021, the previous year's spend is 1234.00, which is the spend for the year
2020.

Step 3: Calculate Year-on-Year Growth Rate

In the final step, we'll wrap the query from Step 2 in CTE called yearly_spend_cte .

Within this CTE, we'll apply the year-on-year (y-o-y) growth rate formula to calculate the growth rate
between the current year's spend and the previous year's spend.

Year-on-Year Growth Rate = ((Current Year's Spend - Previous Year’s Spend) / Previous Year’s
Spend) x 100

We'll also round the growth rate to 2 decimal places.

WITH yearly_spend_cte AS (
SELECT
EXTRACT(YEAR FROM transaction_date) AS yr,
product_id,
spend AS curr_year_spend,
LAG(spend) OVER (
PARTITION BY product_id
ORDER BY
product_id,
EXTRACT(YEAR FROM transaction_date)) AS prev_year_spend
FROM user_transactions
)

SELECT
yr,
product_id,
curr_year_spend,
prev_year_spend,
ROUND(100 *
(curr_year_spend - prev_year_spend)
/ prev_year_spend
, 2) AS yoy_rate
FROM yearly_spend_cte;
Here's the final output table for product id 234412:

year product_id curr_year_spend prev_year_spend yoy_rate

2019 234412 1800.00 NULL NULL

2020 234412 1234.00 1800.00 -31.44

2021 234412 889.50 1234.00 -27.92

2022 234412 2900.00 889.50 226.03

Sometimes, payment transactions are repeated by accident; it could be due to user error, API failure
or a retry error that causes a credit card to be charged twice.

Using the transactions table, identify any payments made at the same merchant with the same credit
card for the same amount within 10 minutes of each other. Count such repeated payments.

Assumptions:

 The first transaction of such payments should not be counted as a repeated payment. This
means, if there are two transactions performed by a merchant with the same credit card and
for the same amount within 10 minutes, there will only be 1 repeated payment.

transactions Table:

Column Name Type


transaction_id integer
merchant_id integer
credit_card_id integer
amount integer
transaction_timestam
datetime
p

transactions Example Input:


transaction_i merchant_i credit_card_i amoun transaction_timestam
d d d t p
1 101 1 100 09/25/2022 12:00:00
2 101 1 100 09/25/2022 12:08:00
3 101 1 100 09/25/2022 12:28:00
4 102 2 300 09/25/2022 12:00:00
6 102 2 400 09/25/2022 14:00:00

Example Output:

payment_count
1

Explanation
Within 10 minutes after Transaction 1, Transaction 2 is conducted at Merchant 1 using the same
credit card for the same amount. This is the only instance of repeated payment in the given sample
data.

Since Transaction 3 is completed after Transactions 2 and 1, each of which occurs after 20 and 28
minutes, respectively hence it does not meet the repeated payments' conditions. Whereas,
Transactions 4 and 6 hav

Solution
Here are the steps to solve this question:

1. Bring together the transactions that are a part of the same group.
2. Work out the time difference between the groups' successive transactions.
3. Identify the identical transactions that took place within ten minutes of one another.

Step 1

For each transaction present in the transactions table, we will obtain the time of the most recent
identical transaction. With the LAG window function , it is feasible. This function accesses the
specified field’s values from the previous rows.

Run this query and we will explain more below.

SELECT
transaction_id
merchant_id,
credit_card_id,
amount,
transaction_timestamp,
LAG(transaction_timestamp) OVER (
PARTITION BY merchant_id, credit_card_id, amount
ORDER BY transaction_timestamp
) AS previous_transaction
FROM transactions;
Let's interpret the LAG function:

 PARTITION BY clause separates the result’s rows by the unique merchant ID, credit card and
payment.
 Each partition is applied with a separate application of the LAG function, and the
computation is restarted for each partition (merchant ID, credit card and payment).
 Rows in each partition are sorted based on transaction_timestamp in the ORDER
BY clause.

Showing 4 transactions worth $100 performed at Merchant ID 101 with credit card ID 1:

transaction merchant credit_card amou transaction_timest previous_transac


_id _id _id nt amp tion

09/25/2022
1 101 1 100
12:00:00

09/25/2022 09/25/2022
2 101 1 100
12:08:00 12:00:00

09/25/2022 09/25/2022
3 101 1 100
12:17:00 12:28:00

09/25/2022 09/25/2022
5 101 1 100
12:27:00 13:17:00

 Transaction ID 1 was the first of the series of payments, therefore it doesn't have
a previous_transaction value from earlier transactions.
 Transaction ID 2 has a previous_transaction of 09/25/2022 12:00:00 which is pulled from the
transaction in transaction ID 1.
Can you follow the pattern of the records in the previous_transaction ?

Step 2

Next, we should evaluate the difference in time between two consecutive identical transactions. We
can simply subtract the previous_transaction values from the transaction_timestamp values
as we now have the previous transaction time for each completed payment.

The following statement can be easily incorporated into the SELECT clause of the previous query.

transaction_timestamp - previous_transaction -- (Obtained from the LAG function)


AS time_difference
transacti mercha credit_c amo transaction_ti previous_tra time_differe
on_id nt_id ard_id unt mestamp nsaction nce

09/25/2022
1 101 1 100
12:00:00

09/25/2022 09/25/2022
2 101 1 100 "minutes":8
12:08:00 12:00:00

09/25/2022 09/25/2022
3 101 1 100 "minutes":20
12:28:00 12:08:00

09/25/2022 09/25/2022 "hours":1,"mi


5 101 1 100
13:37:00 12:28:00 nutes":9

The time_difference field makes it clear that there is an 8-minute lag between the first and
second transactions. We will now convert this field into minutes format so that we can easily filter
them.

One of the best methods to achieve that is to use the EXTRACT function.

SELECT
merchant_id,
credit_card_id,
amount,
transaction_timestamp,
EXTRACT(EPOCH FROM transaction_timestamp -
LAG(transaction_timestamp) OVER(
PARTITION BY merchant_id, credit_card_id, amount
ORDER BY transaction_timestamp)
)/60 AS minute_difference
FROM transactions;

 EPOCH calculates the total number of seconds in a given interval.


 To calculate the difference in minutes, we divide these seconds by 60 (1 minute = 60
seconds). For example, the time interval for transaction id 5 is 1 hour and 9
minutes. EPOCH calculates its value as 4140 seconds. By dividing it by 60, we arrive at 69
minutes.

This is how the outcome looks:

transaction_timestam
merchant_id credit_card_id amount minute_difference
p

101 1 100 09/25/2022 12:00:00

101 1 100 09/25/2022 12:08:00 8

101 1 100 09/25/2022 12:28:00 20

101 1 100 09/25/2022 13:37:00 69

Step 3

The last thing we need to do is to gather all the identical transactions which occurred within a 10-
minute window.

To do that, we must first convert the query into a common table expression (CTE). Then, we will filter
the records using the WHERE clause for transactions a 10-minute or lesser window.

Additionally, the COUNT function allows us to count the filtered records.

WITH payments AS (
-- enter the previous query here
)

SELECT COUNT(merchant_id) AS payment_count


FROM payments
WHERE minute_difference <= 10;
Output for merchant ID 1:
payment_count

Now, we will eliminate the unnecessary columns from the SELECT clause since they were only there
for exploration purposes.

That's it! We have our final query :)

WITH payments AS (
SELECT
merchant_id,
EXTRACT(EPOCH FROM transaction_timestamp -
LAG(transaction_timestamp) OVER(
PARTITION BY merchant_id, credit_card_id, amount
ORDER BY transaction_timestamp)
)/60 AS minute_difference
FROM transactions)

SELECT COUNT(merchant_id) AS payment_count


FROM payments
WHERE minute_difference <= 10;
Assume you're given a table containing information on Facebook user actions. Write a query to
obtain number of monthly active users (MAUs) in July 2022, including the month in numerical format
"1, 2, 3".

Hint:

 An active user is defined as a user who has performed actions such as 'sign-in', 'like', or
'comment' in both the current month and the previous month.

user_actions Table:

Column
Type
Name
user_id integer
event_id integer
event_type string ("sign-in, "like", "comment")
Column
Type
Name
event_date datetime

user_actions Example Input:

user_id event_id event_type event_date


445 7765 sign-in 05/31/2022 12:00:00
742 6458 sign-in 06/03/2022 12:00:00
445 3634 like 06/05/2022 12:00:00
742 1374 comment 06/05/2022 12:00:00
648 3124 like 06/18/2022 12:00:00

Example Output for June 2022:

month monthly_active_users
6 1

Example

In June 2022, there was only one monthly active user (MAU) with the user_id 445.

Please note that the output provided is for June 2022 as the user_actions table only contains
event dates for that month. You should adapt the solution accordingly for July 2022.

The dataset you are querying against may have different input & output - this is just an example!

Solution
In order to calculate the active user retention, we need to identify users who were active in both the
current month and the previous month. We can approach this in two steps:

Step 1: Find users who were active in the previous month


To accomplish this, we can use the following code:

SELECT last_month.user_id
FROM user_actions AS last_month
WHERE last_month.user_id = curr_month.user_id
AND EXTRACT(MONTH FROM last_month.event_date) =
EXTRACT(MONTH FROM curr_month.event_date - interval '1 month'

 We alias the user_actions table as last_month in the FROM clause to indicate that this
table contains information from the previous month.
 In the WHERE clause, we match users in the last_month table to the curr_month table
based on user ID.
 EXTRACT(MONTH FROM last_month.event_date) = EXTRACT(MONTH FROM
curr_month.event_date - interval '1 month') means that a particular user from the
last month exists in the current month, indicating that this user was active in both months.
 - interval '1 month' subtracts one month from the current month's date to give us last
month's date.

Important: Please note that you won't be able to run this query on its own, as it references another
table curr_month which is not included in this query. We will provide you with the complete query
to run later. For now, focus on understanding the logic behind the solution."

Step 2: Find users who are active in the current month and also existed in
the previous month

To accomplish this, we use the EXISTS operator to check for users in the current month who also
exist in the subquery, which represents active users in the previous month (identified in Step 1). Note
that the user_actions table is aliased as curr_month to indicate that this represents the current
month's user information.

To extract the month information from the event_date column, we use the EXTRACT function.
Then, we use a COUNT DISTINCT over the user_id to obtain the count of monthly active users for
the current month.

Here's the complete query to run in the editor:

SELECT
EXTRACT(MONTH FROM curr_month.event_date) AS mth,
COUNT(DISTINCT curr_month.user_id) AS monthly_active_users
FROM user_actions AS curr_month
WHERE EXISTS (
SELECT last_month.user_id
FROM user_actions AS last_month
WHERE last_month.user_id = curr_month.user_id
AND EXTRACT(MONTH FROM last_month.event_date) =
EXTRACT(MONTH FROM curr_month.event_date - interval '1 month')
)
AND EXTRACT(MONTH FROM curr_month.event_date) = 7
AND EXTRACT(YEAR FROM curr_month.event_date) = 2022
GROUP BY EXTRACT(MONTH FROM curr_month.event_date);
Amazon wants to maximize the number of items it can stock in a 500,000 square feet warehouse. It
wants to stock as many prime items as possible, and afterwards use the remaining square footage to
stock the most number of non-prime items.

Write a query to find the number of prime and non-prime items that can be stored in the 500,000
square feet warehouse. Output the item type with prime_eligible followed by not_prime and
the maximum number of items that can be stocked.

Effective April 3rd 2023, we added some new assumptions to the question to provide additional clarity.

Assumptions:

 Prime and non-prime items have to be stored in equal amounts, regardless of their size or
square footage. This implies that prime items will be stored separately from non-prime items
in their respective containers, but within each container, all items must be in the same
amount.
 Non-prime items must always be available in stock to meet customer demand, so the non-
prime item count should never be zero.
 Item count should be whole numbers (integers).

inventory table:

Column Name Type


item_id integer
item_type string
item_category string
square_footage decimal

inventory Example Input:

item_i
item_type item_category square_footage
d
1374 prime_eligible mini refrigerator 68.00
item_i
item_type item_category square_footage
d
4245 not_prime standing lamp 26.40
2452 prime_eligible television 85.00
3255 not_prime side table 22.60
1672 prime_eligible laptop 8.50

Example Output:

item_type item_count
prime_eligibl
9285
e
not_prime 6
The dataset you are querying against may have different input & output - this is just an example!

SELECT
item_type,
SUM(square_footage) AS total_sqft,
COUNT(*) AS item_count
FROM inventory
GROUP BY item_type;
This query will return the total square footage and the item count for each item type (prime and
non-prime).

item_type total_sqft item_count


prime_eligible 555.20 6
not_prime 128.50 4

Step 2: Determine the maximum number of prime items that can be


stored in the 500,000 sq ft warehouse
Next, we need to figure out the maximum number of prime items that can be stored in the
warehouse by dividing the warehouse's area by the total area of prime items and rounding down to
the nearest integer.

WITH summary
AS (
SELECT
item_type,
SUM(square_footage) AS total_sqft,
COUNT(*) AS item_count
FROM inventory
GROUP BY item_type
)

SELECT
item_type,
total_sqft
FLOOR(500000/total_sqft) AS prime_item_combination_count,
(FLOOR(500000/total_sqft) * item_count) AS prime_item_count
FROM summary
WHERE item_type = 'prime_eligible';

item_type total_sqft prime_item_combination_count prime_item_count


prime_eligibl
555.20 900 5400
e
The result shows that the warehouse can stock 900 times of prime items which can be
mathematically expressed as:

900 times x 555.20 sq ft = 499,680 sq ft area occupied by prime items

The remaining warehouse space to stock non-prime items is 500,000 sq ft - 499,680 sq ft = 320 sq
ft.

Step 3: Calculate the maximum number of non-prime items that can be


stored
Finally, we need to calculate the maximum number of non-prime items that can be stored in the
warehouse.

(1) Calculate the maximum number of prime items

We've calculated this in the previous step and the maximum number of prime items is 5,400 prime
items.

(2) Calculate the maximum number of non-prime items


Storage area - (total area occupied by prime items) / area occupied by non-prime items * number of
non-prime items per area

= FLOOR(500,000 sq ft - (900 x 555.20 prime sq ft)) / 128.50 non-prime sq ft* 4 items = 8 non-prime
items

Output the results with prime items followed by non-prime items.

Here is the final query to accomplish this:

WITH summary AS (
SELECT
item_type,
SUM(square_footage) AS total_sqft,
COUNT(*) AS item_count
FROM inventory
GROUP BY item_type
),
prime_occupied_area AS (
SELECT
item_type,
total_sqft,
FLOOR(500000/total_sqft) AS prime_item_combination_count,
(FLOOR(500000/total_sqft) * item_count) AS prime_item_count
FROM summary
WHERE item_type = 'prime_eligible'
)

SELECT
item_type,
CASE
WHEN item_type = 'prime_eligible'
THEN (FLOOR(500000/total_sqft) * item_count)
WHEN item_type = 'not_prime'
THEN FLOOR((500000 -
(SELECT FLOOR(500000/total_sqft) * total_sqft FROM prime_occupied_area))
/ total_sqft)
* item_count
END AS item_count
FROM summary
ORDER BY item_type DESC;
Final results:

item_type item_count
prime_eligibl
5400
e
not_prime 8
Method #2: Using FILTER and UNION ALL operator

WITH summary AS (
SELECT
SUM(square_footage) FILTER (WHERE item_type = 'prime_eligible') AS
prime_sq_ft,
COUNT(item_id) FILTER (WHERE item_type = 'prime_eligible') AS
prime_item_count,
SUM(square_footage) FILTER (WHERE item_type = 'not_prime') AS not_prime_sq_ft,
COUNT(item_id) FILTER (WHERE item_type = 'not_prime') AS not_prime_item_count
FROM inventory
),
prime_occupied_area AS (
SELECT
FLOOR(500000/prime_sq_ft)*prime_sq_ft AS max_prime_area
FROM summary
)

SELECT
'prime_eligible' AS item_type,
FLOOR(500000/prime_sq_ft)*prime_item_count AS item_count
FROM summary

UNION ALL

SELECT
'not_prime' AS item_type,
FLOOR((500000-(SELECT max_prime_area FROM prime_occupied_area))
/ not_prime_sq_ft) * not_prime_item_count AS item_count
FROM summary;
Google's marketing team is making a Superbowl commercial and needs a simple statistic to put on
their TV ad: the median number of searches a person made last year.

However, at Google scale, querying the 2 trillion searches is too costly. Luckily, you have access to
the summary table which tells you the number of searches made last year and how many Google
users fall into that bucket.

Write a query to report the median of searches made by a user. Round the median to one decimal
point.

search_frequency Table:

Column
Type
Name
searches integer
num_users integer

search_frequency Example Input:

searche
num_users
s
1 2
2 2
3 3
4 1

Example Output:

median
2.5
By expanding the search_frequency table, we get [1, 1, 2, 2, 3, 3, 3, 4] which has a median of 2.5
searches per user.
The dataset you are querying against may have different input & output - this is just an example!

Solution
Start by forming a common table expression (CTE) or subquery that will expand each search by the
number of users value.

Using the GENERATE_SERIES() function, we can pass in 1 as the start value and the num_users
value as the stop value. This will give us a set of numbers, [1, 1, 2, 2, 3, 3, 3, 4] , for
example. See this link for more information on this function: GENERATE_SERIES()

Since PostgreSQL does not provide a function to help us find the median value, we can use
the PERCENTILE_CONT() function. This function takes the percentile required as an argument, in
this case it is 0.5, i.e. the 50th percentile (MEDIAN). The WITHIN GROUP clause creates an ordered
subset of search values returned by the CTE/Subquery mentioned above that can be used to perform
aggregations. Find the MEDIAN value

Finally, to make sure our output is rounded to one decimal place, you can use the ROUND() function.
Converting the value returned by the PERCENTILE_CONT to a decimal using ::decimal is
necessary since it may be an integer value which won't allow the ROUND function to work properly.

WITH searches_expanded AS (
SELECT searches
FROM search_frequency
GROUP BY
searches,
GENERATE_SERIES(1, num_users))

SELECT
ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (
ORDER BY searches)::DECIMAL, 1) AS median
FROM searches_expanded;
Assume you're given two tables containing data about Facebook Pages and their respective likes (as
in "Like a Facebook Page").

Write a query to return the IDs of the Facebook pages that have zero likes. The output should be
sorted in ascending order based on the page IDs.

pages Table:
Column
Type
Name
page_id integer
page_name varchar

pages Example Input:

page_id page_name
20001 SQL Solutions
20045 Brain Exercises
20701 Tips for Data Analysts

page_likes Table:

Column Name Type


user_id integer
page_id integer
liked_date datetime

page_likes Example Input:

page_i
user_id liked_date
d
111 20001 04/08/2022 00:00:00
121 20045 03/12/2022 00:00:00
156 20001 07/25/2022 00:00:00

Example Output:

page_id
20701
The dataset you are querying against may have different input & output - this is just an example!
Solution
To find the Facebook pages that do not possess any likes, we can use the EXCEPT operator. This
operator allows us to subtract the rows from one result set that exist in another result set.

Step 1: Retrieve all Facebook page IDs

In this step, we select all the page IDs from the pages table which gives us the initial set of
Facebook pages to consider.

Run the following query to obtain an overview of the data:

SELECT page_id
FROM pages;
Step 2: Retrieve Facebook page IDs with likes

Here, we select the page IDs from the page_likes table, representing the set of Facebook pages
that have received likes.

Execute the following query to get an overview of the data:

SELECT page_id
FROM page_likes;
Step 3: Find Facebook page IDs without likes

Using the EXCEPT operator, we subtract the page IDs with likes from the initial set of all page IDs.
The resulting query will give us the IDs of the Facebook pages that do not possess any likes.

SELECT page_id
FROM pages
EXCEPT
SELECT page_id
FROM page_likes;
Additionally, here are alternative methods to solve the problem:

Method #2: Using LEFT OUTER JOIN

In this method, a LEFT OUTER JOIN is performed between the pages and page_likes tables.

The LEFT OUTER JOIN selects all rows from the left table ( pages ) and the matching rows from
the right table ( page_likes ).

By checking for NULL values in the likes.page_id column, we can identify the Facebook pages
that do not possess any likes.
SELECT pages.page_id
FROM pages
LEFT OUTER JOIN page_likes AS likes
ON pages.page_id = likes.page_id
WHERE likes.page_id IS NULL;
Method #3: Using NOT IN` clause

SELECT page_id
FROM pages
WHERE page_id NOT IN (
SELECT page_id
FROM page_likes
WHERE page_id IS NOT NULL
);
Method #4: Using NOT EXISTS clause

This method utilizes the NOT EXISTS clause to check for the non-existence of matching records in
the page_likes table. It efficiently identifies the Facebook pages without any likes.

SELECT page_id
FROM pages
WHERE NOT EXISTS (
SELECT page_id
FROM page_likes AS likes
WHERE likes.page_id = pages.page_id
;)
Assume you're given the table on user viewership categorised by device type where the three types
are laptop, tablet, and phone.

Write a query that calculates the total viewership for laptops and mobile devices where mobile is
defined as the sum of tablet and phone viewership. Output the total viewership for laptops
as laptop_reviews and the total viewership for mobile devices as mobile_views .

Effective 15 April 2023, the solution has been updated with a more concise and easy-to-understand
approach.

viewership Table
Column
Type
Name
user_id integer
device_type string ('laptop', 'tablet', 'phone')
view_time timestamp

viewership Example Input

user_id device_type view_time


123 tablet 01/02/2022 00:00:00
125 laptop 01/07/2022 00:00:00
128 laptop 02/09/2022 00:00:00
129 phone 02/09/2022 00:00:00
145 tablet 02/24/2022 00:00:00

Example Output

laptop_views mobile_views
2 3

Explanation
Based on the example input, there are a total of 2 laptop views and 3 mobile views.

The dataset you are querying against may have different input & output - this is just an example!

Solution
To calculate the viewership on different devices (laptops vs. mobile devices), we can utilize the
aggregate function COUNT() along with the FILTER clause to apply conditional expressions.

SELECT
COUNT(*) FILTER (WHERE conditional_expression)
FROM table_name;
In the given example, the device types 'tablet' and 'phone' are considered as 'mobile' devices, while
'laptop' is treated as a separate device type.

The following query can be used to obtain the desired result:


SELECT
COUNT(*) FILTER (WHERE device_type = 'laptop') AS laptop_views,
COUNT(*) FILTER (WHERE device_type IN ('tablet', 'phone')) AS mobile_views
FROM viewership;
In the first column laptop_views , COUNT(*) FILTER (WHERE device_type =
'laptop') calculates the count of rows where the device type is labeled as 'laptop'.

In the second column mobile_views , COUNT(*) FILTER (WHERE device_type IN ('tablet',


'phone')) counts the number of rows where the device type is a tablet or a phone.

The result would have two columns, laptop_views and mobile_views displaying the respective
counts of views for each device type.

laptop_views mobile_views

2 3

Solution #2: Using SUM() & CASE statement

Learn about SUM() aggregate function here and CASE statement here.

SELECT
SUM(CASE WHEN device_type = 'laptop' THEN 1 ELSE 0 END) AS laptop_views,
SUM(CASE WHEN device_type IN ('tablet', 'phone') THEN 1 ELSE 0 END) AS
mobile_views
FROM viewership;

As the lead data analyst for a prominent music event management company, you have been
entrusted with a dataset containing concert revenue and detailed information about various artists.

Your mission is to unlock valuable insights by analyzing the concert revenue data and identifying the
top revenue-generating artists within each music genre.

Write a query to rank the artists within each genre based on their revenue per member and extract
the top revenue-generating artist from each genre. Display the output of the artist name, genre,
concert revenue, number of members, and revenue per band member, sorted by the highest revenue
per member within each genre.

concerts Schema:
Column Name Type Description
A unique identifier for each artist or band
artist_id integer
performing in the concert.
The name of the artist or band performing in the
artist_name varchar(100)
concert.
genre varchar(50) The music genre associated with the concert.
concert_revenue integer The total revenue generated from the concert.
year_of_formation integer The year that the artist or band was formed.
The country of origin or residence of the artist or
country varchar(50)
band.
number_of_members integer The number of members in the band.
The total number of albums released by the artist
album_released integer
or band.
The record label or music company associated
label varchar(100)
with the artist or band.

concerts Example Input:

arti artist ge concert year_of_ cou number_ album


st_i _nam nr _reven formatio ntr of_memb _releas label
d e e ue n y ers ed
Unit Repub
Taylo
Po ed lic
103 r 700000 2004 1 9
p Stat Recor
Swift
es ds
Sou
K- Big
th
104 BTS Po 800000 2013 7 7 Hit
Kor
p Music
ea
Unit
Colu
ed
Po mbia
105 Adele 600000 2006 Kin 1 3
p Recor
gdo
ds
m
109 Black K- 450000 2016 Sou 4 5 YG
pink Po th Entert
arti artist ge concert year_of_ cou number_ album
st_i _nam nr _reven formatio ntr of_memb _releas label
d e e ue n y ers ed
Kor ainme
p
ea nt
Unit
Maro Po ed
110 550000 1994 5
on 5 p Stat
es

Solution #1: Using CTE

WITH ranked_concerts_cte AS (
SELECT
artist_name,
concert_revenue,
genre,
number_of_members,
concert_revenue / number_of_members AS revenue_per_member,
RANK() OVER (
PARTITION BY genre
ORDER BY concert_revenue / number_of_members DESC) AS ranked_concerts
FROM concerts
)

SELECT *
FROM ranked_concerts_cte;
Solution #2: Using Subquery

SELECT
artist_name,
concert_revenue,
genre,
number_of_members,
concert_revenue / number_of_members AS revenue_per_member,
RANK() OVER (
PARTITION BY genre
ORDER BY concert_revenue / number_of_members DESC) AS ranked_concerts
FROM concerts;

Step 2: Selecting the Top Revenue-Generating Artists within Each Genre

In both solutions, we use either CTE or subquery results to extract the top revenue-generating artists
in each music genre.

Solution #1: Using CTE

WITH ranked_concerts_cte AS (
SELECT
artist_name,
concert_revenue,
genre,
number_of_members,
concert_revenue / number_of_members AS revenue_per_member,
RANK() OVER (
PARTITION BY genre
ORDER BY concert_revenue / number_of_members DESC) AS ranked_concerts
FROM concerts
)

SELECT
artist_name,
concert_revenue,
genre,
number_of_members,
revenue_per_member
FROM ranked_concerts_cte
WHERE ranked_concerts = 1
ORDER BY revenue_per_member DESC;
Solution #2: Using Subquery

SELECT
artist_name,
concert_revenue,
genre,
number_of_members,
revenue_per_member
FROM (
-- Subquery Result
SELECT
artist_name,
concert_revenue,
genre,
number_of_members,
concert_revenue / number_of_members AS revenue_per_member,
RANK() OVER (
PARTITION BY genre
ORDER BY concert_revenue / number_of_members DESC) AS ranked_concerts
FROM concerts) AS subquery
WHERE ranked_concerts = 1
ORDER BY revenue_per_member DESC;

HAVING
WHERE

When It
Values BEFORE Grouping Values AFTER Grouping
Filters

Operates On Aggregated Values from Groups


Individual Rows
Data From of Rows

SELECT country FROM


SELECT username, followers
instagram_data GROUP BY
Example FROM instagram_data WHERE
country HAVING AVG(followers)
followers > 1000;
> 100;
SELECT
CASE
WHEN condition_1 THEN result_1
WHEN condition_2 THEN result_2
WHEN ... THEN ...
ELSE result_3 -- If no condition matches, return the result in ELSE clause

END AS new_column_name SQL Wildcard Summary


Don't worry about memorizing each one of the SQL Wildcard patterns below – it's very
easy to look this up when you need it. Instead, use this table as a reference!
Example in Query Definition

WHERE first_name LIKE 'a%' Finds any values that starts with "a"

WHERE first_name LIKE '%a' Finds any values that ends with "a"

WHERE first_name LIKE '%ae


Finds any values that have "ae" in the middle
%'

WHERE first_name LIKE '_b%' Finds any values with "b" in the second position

WHERE first_name LIKE 'a%o' Finds any values that starts with "a" and ends with "o"

WHERE first_name LIKE Finds any value that starts with "a" and has 3
'a___' characters
1) Question: Given a table users with a column full_name, write a SQL query to
split the full_name into two separate columns first_name and last_name. Assume
names are always separated by a space.

Answer:

sqlCopy code

SELECT SUBSTRING(full_name, 1, CHARINDEX(' ', full_name) - 1) AS first_name,


SUBSTRING(full_name, CHARINDEX(' ', full_name) + 1, LEN(full_name)) AS
last_name FROM users;

2) Question: In a table products with a column description, write a query to


find all products where the description starts with 'Toy' and ends with 'Car'.

Answer:

sqlCopy code

SELECT * FROM products WHERE description LIKE 'Toy%Car';

3) Question: Given a table employees with a column email, write a SQL query to
extract the domain name from each email address.

Answer:

sqlCopy code

SELECT SUBSTRING(email, CHARINDEX('@', email) + 1, LEN(email) - CHARINDEX('@',


email)) AS domain_name FROM employees;

4) Question: In a table books with a column title, write a query to find all
books where the title has the word 'SQL' appearing more than once.
Answer:

sqlCopy code

SELECT * FROM books WHERE (LEN(title) - LEN(REPLACE(title, 'SQL', ''))) /


LEN('SQL') > 1;

5) Question: Given a table messages with a column text, write a SQL query to
replace all occurrences of the word 'urgent' with 'important'.

Answer:

sqlCopy code

SELECT REPLACE(text, 'urgent', 'important') AS updated_text FROM messages;

6) Question: In a table orders with a column product_codes (comma-separated


product codes in each row), write a SQL query to find all orders that contain
the product code 'P123'.

Answer:

sqlCopy code

SELECT * FROM orders WHERE ',' + product_codes + ',' LIKE '%,P123,%';

7) Question: Given a table comments with a column comment_text, write a SQL


query to find all comments that have more than three words.

Answer:

sqlCopy code

SELECT * FROM comments WHERE (LEN(comment_text) - LEN(REPLACE(comment_text, '


', ''))) + 1 > 3;

You might also like