0% found this document useful (0 votes)
45 views9 pages

Day 89

The document outlines a method to find employees hired in the last n months using PySpark by calculating the difference in months between the current date and the HireDate. It provides a step-by-step approach including schema definition, DataFrame creation, and filtering based on the calculated month difference. The example includes code snippets demonstrating the implementation of the solution.

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views9 pages

Day 89

The document outlines a method to find employees hired in the last n months using PySpark by calculating the difference in months between the current date and the HireDate. It provides a step-by-step approach including schema definition, DataFrame creation, and filtering based on the calculated month difference. The example includes code snippets demonstrating the implementation of the solution.

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Scenario Based

Interview
Question

Ganesh. R
Problem Statement

Problem Statement:

To find employees hired in the last n months using


DATEDIFF in PySpark, you can achieve this by
calculating the difference between the current date (or a
specific reference date) and the HireDate in months, and
then filtering based on that difference.

Here's how you can do it in PySpark:

Approach:
Use months_between to calculate the difference in
months between the HireDate and the current date (or a
specified reference date).
Filter the DataFrame based on whether the difference is
less than or equal to n.
Input Table

# Define schema for the DataFrame


emp_schema = StructType(
[
StructField("FirstName", StringType(),
nullable=False),
StructField("LastName", StringType(),
nullable=False),
StructField(
"HireDate", StringType(), nullable=True
), # Keep it as string initially
]
)
# Define the data with date strings
data = [ ("Alice", "Ciccu", "2021-01-07"),
("Paula", "Barreto de Mattos", "2021-01-06"),
("Alejandro", "McGuel", "2020-12-06"),
("Kendall", "Keil", "2020-11-05"),
("Ivo", "Salmre", "2020-10-04"),
("Paul", "Komosinski", "2020-08-04"),
("Ashvini", "Sharma", "2020-07-04"),
("Zheng", "Mu", "2020-04-03"),
("Stuart", "Munson", "2019-11-02"),
("Greg", "Alderson", "2019-10-02"),
("David", "Johnson", "2019-01-02"),]
# Create the DataFrame
emp_df = spark.createDataFrame(data,
schema=emp_schema)
# Convert the HireDate column from string to DateType
emp_df = emp_df.withColumn("HireDate",
F.to_date(emp_df["HireDate"], "yyyy-MM-dd"))
# Show the DataFrame to confirm the result
emp_df.display()
Output Table

FirstName LastName HireDate diff_month

Alice Ciccu 2021-01-07 0

Barreto de
Paula 2021-01-06 0
Mattos

Alejandro McGuel 2020-12-06 1

Kendall Keil 2020-11-05 2

Ivo Salmre 2020-10-04 3


Problem Statement:

To find employees hired in the last n months using DATEDIFF in PySpark, you can achieve this by
calculating the difference between the current date (or a specific reference date) and the
HireDate in months, and then filtering based on that difference.

Here's how you can do it in PySpark:

Approach: Use months_between to calculate the difference in months between the HireDate
and the current date (or a specified reference date). Filter the DataFrame based on whether the
difference is less than or equal to n.

from pyspark.sql.types import StructType, StructField, StringType,


DateType
from pyspark.sql import functions as F

# Define the schema for the Emp table


emp_schema = StructType(
[
StructField("FirstName", StringType(), nullable=False),
StructField("LastName", StringType(), nullable=False),
StructField(
"HireDate", StringType(), nullable=True
), # Keep it as string initially
]
)

# Define the data with date strings


data = [
("Alice", "Ciccu", "2021-01-07"),
("Paula", "Barreto de Mattos", "2021-01-06"),
("Alejandro", "McGuel", "2020-12-06"),
("Kendall", "Keil", "2020-11-05"),
("Ivo", "Salmre", "2020-10-04"),
("Paul", "Komosinski", "2020-08-04"),
("Ashvini", "Sharma", "2020-07-04"),
("Zheng", "Mu", "2020-04-03"),
("Stuart", "Munson", "2019-11-02"),
("Greg", "Alderson", "2019-10-02"),
("David", "Johnson", "2019-01-02"),
]

# Create the DataFrame


emp_df = spark.createDataFrame(data, schema=emp_schema)

# Convert the HireDate column from string to DateType using to_date


function
emp_df = emp_df.withColumn("HireDate", F.to_date(emp_df["HireDate"],
"yyyy-MM-dd"))
# Show the DataFrame to confirm the result
emp_df.display()

emp_df.printSchema()

root
|-- FirstName: string (nullable = false)
|-- LastName: string (nullable = false)
|-- HireDate: date (nullable = true)

emp_df.createOrReplaceTempView("emp")

%sql
SELECT
*,
DATEDIFF(MONTH, HireDate, '2021-02-01') as diff_month
FROM
emp
WHERE
DATEDIFF(MONTH, HireDate, '2021-02-01') <= 3;

from pyspark.sql import functions as F


from pyspark.sql.types import IntegerType

df_filtered = emp_df.withColumn(
"months_diff",
(F.months_between(F.lit("2021-02-01"),
F.col("HireDate"))).cast(IntegerType()),
).filter(F.col("months_diff") <= 3)

df_filtered.display()

Explanation:

F.months_between(F.current_date(), F.col("HireDate")): This calculates the difference in months


between the current date and the HireDate.

.cast(IntegerType()): Ensures the result is an integer, truncating fractional


months. .filter(F.col("months_diff") <= n): Filters the rows where the calculated months
difference is less than or equal to n.

This will return employees who were hired in the last n months. You can replace n with any
number of months you need.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.

Ganesh. R
THANK YOU
For Your Support

I Appreciate for your support on


My Account, I will Never Stop to Share the
Knowledge.

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

You might also like