0% found this document useful (0 votes)
10 views10 pages

Day77

Uploaded by

Lapi Lapil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Day77

Uploaded by

Lapi Lapil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Scenario Based

Interview
Question

Ganesh. R
Problem Statement

Problem Statement:
Assume you have an events table on Facebook app
analytics. Write a query to calculate the click-through
rate (CTR) for the app in 2022 and round the results to 2
decimal places.
Definition and note:
Percentage of click-through rate (CTR) = 100.0 *
Number of clicks / Number of impressions To avoid
integer division, multiply the CTR by 100.0, not 100.
Input Table Data

#Define the employee data


data = [ (1, "Emma Thompson", 3800, 1, 6),
(2, "Daniel Rodriguez", 2230, 1, 7),
(3, "Olivia Smith", 7000, 1, 8),
(4, "Noah Johnson", 6800, 2, 9),
(5, "Sophia Martinez", 1750, 1, 11),
(6, "Liam Brown", 13000, 3, None),
(7, "Ava Garcia", 12500, 3, None),
(8, "William Davis", 6800, 2, None),
(9, "Isabella Wilson", 11000, 3, None),
(10, "James Anderson", 4000, 1, 11),
(11, "Mia Taylor", 10800, 3, None),
(12, "Benjamin Hernandez", 9500, 3, 8),
(13, "Charlotte Miller", 7000, 2, 6),
(14, "Logan Moore", 8000, 2, 6),
(15, "Amelia Lee", 4000, 1, 7),]

# Define the schema


columns = ["employee_id", "name", "salary",
"department_id", "manager_id"]
# Create the DataFrame
employee_df = spark.createDataFrame(data,
schema=columns)

# display the DataFrame


employee_df.display()

# Define the department data


department_data = [ (1, "Data Analytics"),
(2, "Data Science"),
(3, "Data Engineering"),]

# Define the schema for the department DataFrame


department_columns = ["department_id",
"department_name"]

# Create the DataFrame for departments


department_df =
spark.createDataFrame(department_data,
schema=department_columns)

# Show the DataFrame


department_df.display()
Output Table

department
name salary
_name
Data Analytics Olivia Smith 7000
Data Analytics Amelia Lee 4000
Data Analytics James Anderson 4000
Data Analytics Emma Thompson 3800
Data Engineering Liam Brown 13000
Data Engineering Ava Garcia 12500
Data Engineering Isabella Wilson 11000
Data Science Logan Moore 8000
Data Science Charlotte Miller 7000
Data Science Noah Johnson 6800
Data Science William Davis 6800
Problem Statement:

As part of an ongoing analysis of salary distribution within the company, your manager has
requested a report identifying high earners in each department. A 'high earner' within a
department is defined as an employee with a salary ranking among the top three salaries within
that department.

You're tasked with identifying these high earners across all departments. Write a query to
display the employee's name along with their department name and salary. In case of
duplicates, sort the results of department name in ascending order, then by salary in descending
order. If multiple employees have the same salary, then order them alphabetically.

Note: Ensure to utilize the appropriate ranking window function to handle duplicate salaries
effectively.

As of June 18th, we have removed the requirement for unique salaries and revised the sorting
order for the results.

# Define the employee data


data = [
(1, "Emma Thompson", 3800, 1, 6),
(2, "Daniel Rodriguez", 2230, 1, 7),
(3, "Olivia Smith", 7000, 1, 8),
(4, "Noah Johnson", 6800, 2, 9),
(5, "Sophia Martinez", 1750, 1, 11),
(6, "Liam Brown", 13000, 3, None),
(7, "Ava Garcia", 12500, 3, None),
(8, "William Davis", 6800, 2, None),
(9, "Isabella Wilson", 11000, 3, None),
(10, "James Anderson", 4000, 1, 11),
(11, "Mia Taylor", 10800, 3, None),
(12, "Benjamin Hernandez", 9500, 3, 8),
(13, "Charlotte Miller", 7000, 2, 6),
(14, "Logan Moore", 8000, 2, 6),
(15, "Amelia Lee", 4000, 1, 7),
]

# Define the schema


columns = ["employee_id", "name", "salary", "department_id",
"manager_id"]

# Create the DataFrame


employee_df = spark.createDataFrame(data, schema=columns)

# display the DataFrame


employee_df.display()

# Define the department data


department_data = [
(1, "Data Analytics"),
(2, "Data Science"),
(3, "Data Engineering"),
]

# Define the schema for the department DataFrame


department_columns = ["department_id", "department_name"]

# Create the DataFrame for departments


department_df = spark.createDataFrame(department_data,
schema=department_columns)

# Show the DataFrame


department_df.display()

employee_df.createOrReplaceTempView("employee")
department_df.createOrReplaceTempView("department")

%sql
WITH ranked_salary AS (
SELECT
name,
salary,
department_id,
DENSE_RANK() OVER (
PARTITION BY department_id
ORDER BY
salary DESC
) AS ranking
FROM
employee
)
SELECT
d.department_name,
s.name,
s.salary
FROM
ranked_salary AS s
INNER JOIN department AS d ON s.department_id = d.department_id
WHERE
s.ranking <= 3
ORDER BY
d.department_name ASC,
s.salary DESC,
s.name ASC;

from pyspark.sql import functions as F


from pyspark.sql import Window

# Define a window specification for ranking


window_spec =
Window.partitionBy("department_id").orderBy(F.desc("salary"))

# Create a ranked DataFrame


ranked_salary_df = employee_df.withColumn("ranking",
F.dense_rank().over(window_spec))

# Join the ranked DataFrame with the department DataFrame


result_df = (
ranked_salary_df.join(department_df, "department_id")
.filter(ranked_salary_df.ranking <= 3)
.select(
department_df.department_name, ranked_salary_df.name,
ranked_salary_df.salary
)
.orderBy(
department_df.department_name.asc(),
ranked_salary_df.salary.desc(),
ranked_salary_df.name.asc(),
)
)

# Show the result


result_df.display()

Explanation:

Window Specification: A Window is defined to partition the data by department_id and order it
by salary in descending order.

Ranking: The dense_rank() function is used to calculate rankings for each employee within their
department.

Joining: The join() method combines the ranked employee DataFrame with the department
DataFrame on the department_id.

Filtering and Selecting: The filter condition restricts the results to only include the top 3 salaries
in each department, and specific columns are selected for the final output.

Ordering: Finally, the results are ordered according to the department name, salary, and
employee name as required.

You can run this PySpark code in your environment to get the top three employees by salary for
each department along with the department names.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.

Ganesh. R
THANK YOU
For Your Support

I Appreciate for your support on


My Account, I will Never Stop to Share the
Knowledge.

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

You might also like