Pyspark Hands on
Pyspark Hands on
data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB
OUTPUT
OUTPUT
ACTOR_ID DIRECTOR_ID
1 1
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB
schema = StructType([
StructField("ActorId",IntegerType(),True),
StructField("DirectorId",IntegerType(),True),
StructField("timestamp",IntegerType(),True)
])
data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB
df=spark.createDataFrame(data,schema)
df.show()
df_group=df.groupBy('ActorId','DirectorId').count()
df_group.show()
+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 2| 2|
| 1| 1| 3|
| 2| 1| 2|
+-------+----------+-----+
+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 1| 3|
+-------+----------+-----+
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2
Ads Performance
Write an pyspark code to find the ctr of each Ad.Round ctr to 2
decimal points. Order the result table by ctr in descending order
and by ad_id in ascending order in case of a tie.
Ctr=Clicked/(Clicked+Viewed)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2
INPUT
AD_ID USER_ID ACTION
1 1 Clicked
2 2 Clicked
3 3 Viewed
5 5 Ignored
1 7 Ignored
2 7 Viewed
3 5 Clicked
1 4 Viewed
2 11 Viewed
1 2 Clicked
OUTPUT
OUTPUT
AD_ID CTR
1 0.67
3 0.5
2 0.33
5 0
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2
# Define the data for the Ads table
data = [
(1, 1, 'Clicked'),
(2, 2, 'Clicked'),
(3, 3, 'Viewed'),
(5, 5, 'Ignored'),
(1, 7, 'Ignored'),
(2, 7, 'Viewed'),
(3, 5, 'Clicked'),
(1, 4, 'Viewed'),
(2, 11, 'Viewed'),
(1, 2, 'Clicked')
]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2
ctr_df = (
ads_df.groupBy("ad_id")
.agg(
F.sum(F.when(ads_df["action"] == "Clicked",
1).otherwise(0)).alias("click_count"),
F.sum(F.when(ads_df["action"] == "Viewed",
1).otherwise(0)).alias("view_count")
)
.withColumn("ctr", F.round(F.col("click_count") /
(F.col("click_count") + F.col("view_count")), 2))
)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3
Combine Two DF
Write a Pyspark program to report the first name, last name, city, and state of each person in the
Person dataframe. If the address of a personId is not present in the Address dataframe,
report null instead.
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3
INPUT-1 persons
PERSONID LASTNAME FIRSTNAME
1 Wang Allen
2 Alice Bob
INPUT-2 addresses
ADDRESSID PERSONID CITY STATE
OUTPUT
OUTPUT
FIRSTNAME LASTNAME CITY STATE
Bob Alice New York City New York
Allen Wang
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3
person_df.show()
address_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3
person_df.join(address_df,person_df.personId==address_df.perso
nId,'left')\
.select('firstName','lastName','city','state')\
.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4
DataFrame:
# Define the schema for the "employees"
employees_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("managerId", IntegerType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4
OUTPUT
OUTPUT
NAME
Joe
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4
emp_df1 = emp_df.alias("e1")
emp_df2 = emp_df.alias("e2")
self_joined_df.filter(self_joined_df.salary>self_joined_df.msal).sele
ct("name").show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB
Duplicate Emails
Write a Pyspark program to report all the duplicate emails.
Note that it's guaranteed that the email field is not NULL.
Difficult Level : EASY
DataFrame:
# Define the schema for the "employees"
employees_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("managerId", IntegerType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB
INPUT
ID EMAIL
1 [email protected]
2 [email protected]
3 [email protected]
OUTPUT
OUTPUT
EMAIL
[email protected]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB
df_group=df.groupby("email").count()
df_group.filter(df_group["count"] > 1).show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6
INPUT - 2 orders
ID CUSTOMERID
1 3
2 1
OUTPUT
OUTPUT
NAME
Max
Henry
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6
customers_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6
# Define data for the "Orders"
orders_data = [
(1, 3),
(2, 1)
]
cus_df.show()
ord_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6
cus_df.join(ord_df,cus_df.id == ord_df.customerId,"left_anti")\
.select("name").show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7
07_Rising Temperature
Write a solution to find all dates' Id with higher
temperatures compared to its previous dates (yesterday).
DataFrame:
# Define the schema for the "Weather" table
weather_schema = StructType([
StructField("id", IntegerType(), True),
StructField("recordDate", StringType(), True),
StructField("temperature", IntegerType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7
INPUT
ID RECORDDATE TEMPERATURE
1 2015-01-01 10
2 2015-01-02 25
3 2015-01-03 20
4 2015-01-04 30
OUTPUT
OUTPUT
ID
2
4
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7
lag_df=temp_df.withColumn("prev_day",lag(temp_df.temperature).
over(Window.orderBy(temp_df.recordDate)))
lag_df.show()
lag_df.filter(lag_df["temperature"] >
lag_df["prev_day"] ).select("id").show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8
INPUT
PLAYER_ID DEVICE_ID EVENT_DATE GAMES_PLAYED
1 2 2016-03-01 5
1 2 2016-05-02 6
2 3 2017-06-25 1
3 1 2016-03-02 0
3 4 2018-07-03 5
OUTPUT
OUTPUT
PLAYER_ID FISRT_LOGIN
1 2016-03-01
2 2017-06-25
3 2016-03-02
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8
rank_df=activity_df.withColumn("RK",rank().over(Window.partition
By(activity_df['player_id']).orderBy(activity_df['event_date'])))
rank_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8
rank_df.filter(rank_df["RK"] ==
1).select("player_id",rank_df["event_date"].alias("First_Login")).sh
ow()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9
INPUT
PLAYER_ID DEVICE_ID EVENT_DATE GAMES_PLAYED
1 2 2016-03-01 5
1 2 2016-05-02 6
2 3 2017-06-25 1
3 1 2016-03-02 0
3 4 2018-07-03 5
OUTPUT
OUTPUT
PLAYER_ID DEVICE_ID
1 2
2 3
3 1
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9
rank_df=df.withColumn("rk",rank().over(Window.partitionBy(df["pla
yer_id"]).orderBy(df["event_date"])))
rank_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9
rank_df.filter(rank_df["rk"] ==
1).select("player_id","device_id").show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
10_Employee Bonus
Write a solution to report the name and bonus amount of
each employee with a bonus less than 1000.
Return the result table in any order
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
# Define data for the "Bonus"
bonus_data = [
(2, 500),
(4, 2000)
]
INPUT-2 BONUS
EMPID BONUS
2 500
4 2,000
OUTPUT
OUTPUT
NAME BONUS
Brad
John
Dan 500
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
bonus_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("bonus", IntegerType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
join_df=emp_df.join(bonus_df,emp_df.empId==bonus_df.empId,"lef
t")
join_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11
INPUT
INPUT
ID NAME REFEREE_ID
1 Will
2 Jane
3 Alex 2
4 Bill
5 Zack 1
6 Mark 2
OUTPUT
OUTPUT
NAME
Will
Jane
Bill
Zack
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11
+-----+
| name|
+-----+
| Will|
| Jane|
| Bill|
| Zack|
+-----+
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
INPUT-1 trade
INPUT - 2 user
OUTPUT
OUTPUT
CITY COUNT()
San Francisco 3
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
Boston 2
Denver 1
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
StructField("city", StringType(), True),
StructField("email", StringType(), True),
StructField("signup_date", StringType(), True)
])
Trade_df=spark.createDataFrame(trades_data,trades_schema)
User_df=spark.createDataFrame(users_data,users_schema)
Trade_df.show()
User_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
join_df=Trade_df.join(User_df,Trade_df['user_id']==User_df['user_i
d'],"inner")
join_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
join_df.filter(join_df['status'] ==
'Completed').groupby(join_df['city']).count()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13
INPUT - 1 PAGES
PAGE_ID PAGE_NAME
INPUT - 2 PAGES_LIEKS
OUTPUT
OUTPUT
PAGE_ID
20701
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13
(156, 20001, '2022-07-25 00:00:00')
]
page_df=spark.createDataFrame(pages_data,pages_schema)
page_like_df=spark.createDataFrame(page_likes_data,page_likes_
schema)
page_df.show()
page_like_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14
# Define data
# Define data
data = [
(213824, 'printer', 20, "2022-06-27 "),
(212312, 'hair dryer', 5, "2022-06-28 "),
(132842, 'printer', 18, "2022-06-28 "),
(284730, 'standing lamp', 8, "2022-07-05 ")
]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14
INPUT
ORDER_ID PRODUCT_TYPE QUANTITY ORDER_DATE
OUTPUT
OUTPUT
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14
# Define data
# Define data
data = [
(213824, 'printer', 20, "2022-06-27 "),
(212312, 'hair dryer', 5, "2022-06-28 "),
(132842, 'printer', 18, "2022-06-28 "),
(284730, 'standing lamp', 8, "2022-07-05 ")
]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14
order_df=spark.createDataFrame(data,schema)
order_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15
INPUT
OUTPUT
OUTPUT
SENDER_ID COUNT(*)
3601 2
4500 1
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15
schema = StructType([
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", StringType(), True),
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15
teams_df = spark.createDataFrame(data,schema)
teams_df.show()
filter_df=teams_df.filter(teams_df['sent_date'].like("2022-08%"))
filter_df.show()
result_df=filter_df.groupby(filter_df['sender_id']).count()
result_df=result_df.orderBy(desc(result_df['count'])).limit(2)
result_df.show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
16_Select in pyspark
Write a pyspark code perform below function
● Write a pyspark code to get all employee detail.
● Write a query to get only "FirstName" column from emp_df
● Write a Pyspark code to get FirstName in upper case as "First
Name".
● Write a pyspark code to get FirstName in lower case
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])
emp_df=spark.createDataFrame(data,schema)
emp_df.show()
# Method 1
emp_df.select("First_Name").show()
# Method 2
emp_df.select(col("First_Name")).show()
# Method 3
emp_df.createOrReplaceTempView("emp_table")
spark.sql("select First_Name from emp_table").show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
# 3. Write a Pyspark code to get FirstName in upper case as "First
Name".
emp_df.select(upper("First_Name")).show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
17_Select in pyspark
Write a pyspark code perform below function
● Write a pyspark code for combine FirstName and LastName
and display it as "Name" (also include white space between
first name & last name)
● Select employee detail whose name is "Vikas"
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with letter 'a'.
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
# Methos 1
from pyspark.sql.functions import col
emp_df.filter(col("First_Name") == 'Vikas' ).show(truncate=False)
# Methos 2
emp_df.filter(emp_df.First_Name == 'Vikas' ).show(truncate=False)
# Methos 3
emp_df.filter(emp_df['First_Name'] == 'Vikas' ).show(truncate=False)
# Methos 4
emp_df.where(emp_df['First_Name'] == 'Vikas' ).show(truncate=False)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with letter 'a'.
# Method 1
from pyspark.sql.functions import lower
emp_df.filter(lower(emp_df['First_Name']).like("a%")).show()
# Method 2
emp_df.filter((emp_df['First_Name'].like("a%")) |
(emp_df['First_Name'].like("A%")) ).show()
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
18_Select in pyspark
Write a pyspark code perform below function
● Get all employee details from EmployeeDetail table whose
"FirstName" contains 'k'
● Get all employee details from EmployeeDetail table whose
"FirstName" end with 'h'
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with any single character between 'a-p'
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
]
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19
19_Select in pyspark
Write a pyspark code perform below function
● Get all employee detail from emp_df whose "Gender" end
with 'le' and contain 4 letters. The Underscore(_) Wildcard
Character represents any single character.
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with # 'A' and contain 5 letters.
● Get all unique "Department" from EmployeeDetail table.
● Get the highest "Salary" from EmployeeDetail table.
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19
emp_df=spark.createDataFrame(data,schema)
Get all employee detail from emp_df whose "Gender" end with
'le'and contain 4 letters. The Underscore(_) Wildcard Character
represents any single character.
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19
# 'A' and contain 5 letters.
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20
20_Date in pyspark
Write a pyspark code perform below function
● Get the lowest "Salary" from EmployeeDetail table.
● Show "JoiningDate" in "dd mmm yyyy" format, ex- "15 Feb
2013"
● Show "JoiningDate" in "yyyy/mm/dd" format, ex- "2013/02/15"
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21
21_Date in pyspark
Write a pyspark code perform below function
● Get only Year part of "JoiningDate"
● Get only Month part of "JoiningDate".
● Get only date part of "JoiningDate".
● Get the current system date using DataFrame API
● Get the current UTC date and time using DataFrame API
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22
22_Date in pyspark
Write a pyspark code perform below function
● Get the first name, current date, joiningdate and diff between
current date and joining date in months.
● Get the first name, current date, joiningdate and diff between
current date and joining date in days.
● Get all employee details from EmployeeDetail table whose
joining year is 2013
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23
23_Date in pyspark
Write a pyspark code perform below function
● Get all employee details from EmployeeDetail table whose
joining month is Jan(1).
● Get all employee details from EmployeeDetail table whose
joining date between 2013-01-01" and "2013-12-01".
● Get how many employee exist in "EmployeeDetail" table.
● Select all employee detail with First name "Vikas","Ashish", and
"Nikhil".
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
StructField("Gender", StringType(), True)
])
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25
25_operator in pyspark
Write a pyspark code perform below function
● Select first name from "EmployeeDetail" table prifixed with
"Hello "
● Get employee details from "EmployeeDetail" table whose
Salary greater than 600000
● Get employee details from "EmployeeDetail" table whose
Salary less than 700000
● Get employee details from "EmployeeDetail" table whose
Salary between 500000 than 600000
● Select second highest salary from "EmployeeDetail" table
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26
26_groupby in pyspark
Write a pyspark code perform below function
● Write the query to get the department and department wise
total(sum) salary from "EmployeeDetail" table.
● Write the query to get the department and department wise
total(sum) salary, display it in ascending order according to
salary.
● Write the query to get the department and department wise
total(sum) salary, display it in descending order according to
salary.
● Write the query to get the department, total no. of
departments, total(sum) salary with respect to department
from "EmployeeDetail" table.
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26
]
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27
27_groupby in pyspark
Write a pyspark code perform below function
● 46. Get department wise average salary from
"EmployeeDetail" table order by salary ascending
● 47. Get department wise maximum salary from
"EmployeeDetail" table order by salary ascending
● 48. Get department wise minimum salary from
"EmployeeDetail" table order by salary ascending
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27
StructField("Gender", StringType(), True)
])
emp_df=spark.createDataFrame(data,schema)
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28
28_Join_in_pyspark
Write a pyspark code perform below function
● Write down the query to fetch Project name assign to more
than one Employee
● Get employee name, project name order by firstname from
"EmployeeDetail" and"ProjectDetail" for those employee which
have assigned project already.
DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28
pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29
29_Join_in_pyspark
Write a pyspark code perform below function
● 52. Get employee name, project name order by firstname from
"EmployeeDetail" and "ProjectDetail" for all employee even
they have not assigned project.
● 53 Get employee name, project name order by firstname from
"EmployeeDetail" and "ProjectDetail" for all employee if
project is not assigned then display "-No Project Assigned".
DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29
pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30
30_Join_in_pyspark
Write a pyspark code perform below function
● 56. Write a pyspark code to find out the employeename who
has not assigned any project, and display "-No Project
Assigned"( tables :- [EmployeeDetail],[ProjectDetail]).
● 57. Write a pyspark code to find out the project name which is
not assigned to any employee( tables :-
[EmployeeDetail],[ProjectDetail]).
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30
pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31
31_Histogram of Tweets
write a query to obtain a histogram of tweets posted per user in 2022. Output the
tweet count per user as the bucket and the number of Twitter users who fall into that
bucket.
In other words, group the users by the number of tweets they posted in 2022 and
count the number of users in each group.
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31
2022-03-23
231574 148 I no longer have a manager. I can't be managed 0:00:00
OUTPUT
OUTPUT
BUCKET USER_NUM
1 2
2 1
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32
32_pyspark_transformation
Write a pyspark code to transform the DataFrame to
display each student's marks in Math and English as
separate columns.
Difficult Level : EASY
DataFrame:
data=[
('Rudra','math',79),
('Rudra','eng',60),
('Shivu','math', 68),
('Shivu','eng', 59),
('Anu','math', 65),
('Anu','eng',80)
]
schema = StructType([
StructField("Name", StringType(), True),
StructField("Sub", StringType(), True),
StructField("Marks", IntegerType(), True)
])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32
INPUT
OUTPUT
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33
# Create a DataFrame
df = spark.createDataFrame(data, ["name", "hobbies"])
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33
OUTPUT
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34
INPUT
TWEET_DAT
TWEET_ID USER_ID MSG E
2022-03-23
231574 148 I no longer have a manager. I can't be managed 0:00:00
OUTPUT
OUTPUT
BUCKET USER_NUM
1 2
2 1
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35
INPUT
OUTPUT
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36
OUTPUT
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37
A pupil Tim gets homework to identify whether three line segments could
possibly form a triangle.
However, this assignment is very heavy because there are hundreds of records
to calculate.
Could you help Tim by writing a pyspark code to judge whether these three
sides can form a triangle,
assuming df triangle holds the length of the three sides x, y and z.
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37
OUTPUT
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38
# Your data
data = [(8,), (8,), (3,), (3,), (1,), (4,), (5,), (6,)]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38
OUTPUT
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39
# Your data
data = [
(1, "War", "great 3D", 8.9),
(2, "Science", "fiction", 8.5),
(3, "Irish", "boring", 6.2),
(4, "Ice song", "Fantasy", 8.6),
(5, "House card", "Interesting", 9.1)
]
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39
OUTPUT
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40
OUTPUT
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40
WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR