Day 89
Day 89
Interview
Question
Ganesh. R
Problem Statement
Problem Statement:
Approach:
Use months_between to calculate the difference in
months between the HireDate and the current date (or a
specified reference date).
Filter the DataFrame based on whether the difference is
less than or equal to n.
Input Table
Barreto de
Paula 2021-01-06 0
Mattos
To find employees hired in the last n months using DATEDIFF in PySpark, you can achieve this by
calculating the difference between the current date (or a specific reference date) and the
HireDate in months, and then filtering based on that difference.
Approach: Use months_between to calculate the difference in months between the HireDate
and the current date (or a specified reference date). Filter the DataFrame based on whether the
difference is less than or equal to n.
emp_df.printSchema()
root
|-- FirstName: string (nullable = false)
|-- LastName: string (nullable = false)
|-- HireDate: date (nullable = true)
emp_df.createOrReplaceTempView("emp")
%sql
SELECT
*,
DATEDIFF(MONTH, HireDate, '2021-02-01') as diff_month
FROM
emp
WHERE
DATEDIFF(MONTH, HireDate, '2021-02-01') <= 3;
df_filtered = emp_df.withColumn(
"months_diff",
(F.months_between(F.lit("2021-02-01"),
F.col("HireDate"))).cast(IntegerType()),
).filter(F.col("months_diff") <= 3)
df_filtered.display()
Explanation:
This will return employees who were hired in the last n months. You can replace n with any
number of months you need.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.
Ganesh. R
THANK YOU
For Your Support