0% found this document useful (0 votes)
192 views

Pyspark Hands on

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views

Pyspark Hands on

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 189

PYSPARK LEARNING HUB

Step - 1 : Problem Statement

Actors and Directors Who Cooperated At Least


Three Times
Write a pyspark Program for a report that provides the pairs
(actor_id, director_id) where the actor has cooperated with
the director at least 3 times.

Difficult Level : EASY


DataFrame:
schema = StructType([
StructField("ActorId",IntegerType(),True),
StructField("DirectorId",IntegerType(),True),
StructField("timestamp",IntegerType(),True)
])

data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 2 : Identifying The Input Data And Expected


Output
INPUT
INPUT
ACTOR_ID DIRECTOR_ID TIMESTAMP
1 1 0
1 1 1
1 1 2
1 2 3
1 2 4
2 1 5
2 1 6

OUTPUT
OUTPUT
ACTOR_ID DIRECTOR_ID
1 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType,StructField,IntegerType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

schema = StructType([
StructField("ActorId",IntegerType(),True),
StructField("DirectorId",IntegerType(),True),
StructField("timestamp",IntegerType(),True)
])

data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

df=spark.createDataFrame(data,schema)
df.show()

df_group=df.groupBy('ActorId','DirectorId').count()
df_group.show()

+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 2| 2|
| 1| 1| 3|
| 2| 1| 2|
+-------+----------+-----+

df_group.filter(df_group['count'] >= 3).show()

+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 1| 3|
+-------+----------+-----+

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

Step - 1 : Problem Statement

Ads Performance
Write an pyspark code to find the ctr of each Ad.Round ctr to 2
decimal points. Order the result table by ctr in descending order
and by ad_id in ascending order in case of a tie.
Ctr=Clicked/(Clicked+Viewed)

Difficult Level : EASY


DataFrame:
# Define the schema for the Ads table
schema=StructType([
StructField('AD_ID',IntegerType(),True)
,StructField('USER_ID',IntegerType(),True)
,StructField('ACTION',StringType(),True)
])

# Define the data for the Ads table


data = [
(1, 1, 'Clicked'),
(2, 2, 'Clicked'),
(3, 3, 'Viewed'),
(5, 5, 'Ignored'),
(1, 7, 'Ignored'),
(2, 7, 'Viewed'),
(3, 5, 'Clicked'),
(1, 4, 'Viewed'),
(2, 11, 'Viewed'),
(1, 2, 'Clicked')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT
AD_ID USER_ID ACTION
1 1 Clicked
2 2 Clicked
3 3 Viewed
5 5 Ignored
1 7 Ignored
2 7 Viewed
3 5 Clicked
1 4 Viewed
2 11 Viewed
1 2 Clicked
OUTPUT
OUTPUT
AD_ID CTR
1 0.67
3 0.5
2 0.33
5 0

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session

from pyspark.sql import SparkSession


from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define the schema for the Ads table


schema=StructType([
StructField('AD_ID',IntegerType(),True)
,StructField('USER_ID',IntegerType(),True)
,StructField('ACTION',StringType(),True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2
# Define the data for the Ads table
data = [
(1, 1, 'Clicked'),
(2, 2, 'Clicked'),
(3, 3, 'Viewed'),
(5, 5, 'Ignored'),
(1, 7, 'Ignored'),
(2, 7, 'Viewed'),
(3, 5, 'Clicked'),
(1, 4, 'Viewed'),
(2, 11, 'Viewed'),
(1, 2, 'Clicked')
]

# Create a PySpark DataFrame


df=spark.createDataFrame(data,schema)
df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

ctr_df = (
ads_df.groupBy("ad_id")
.agg(
F.sum(F.when(ads_df["action"] == "Clicked",
1).otherwise(0)).alias("click_count"),
F.sum(F.when(ads_df["action"] == "Viewed",
1).otherwise(0)).alias("view_count")
)
.withColumn("ctr", F.round(F.col("click_count") /
(F.col("click_count") + F.col("view_count")), 2))
)

# Order the result table by CTR in descending order and by ad_id in


ascending order
window_spec = Window.orderBy(F.col("ctr").desc(),
F.col("ad_id").asc())
result_df = ctr_df.withColumn("rank", F.rank().over(window_spec))

# Show the result DataFrame


result_df.select('ad_id','ctr').show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

Step - 1 : Problem Statement

Combine Two DF
Write a Pyspark program to report the first name, last name, city, and state of each person in the
Person dataframe. If the address of a personId is not present in the Address dataframe,
report null instead.

Difficult Level : EASY


DataFrame:
# Define schema for the 'persons' table
persons_schema = StructType([
StructField("personId", IntegerType(), True),
StructField("lastName", StringType(), True),
StructField("firstName", StringType(), True)
])

# Define schema for the 'addresses' table


addresses_schema = StructType([
StructField("addressId", IntegerType(), True),
StructField("personId", IntegerType(), True),
StructField("city", StringType(), True),
StructField("state", StringType(), True)
])

# Define data for the 'persons' table


persons_data = [
(1, 'Wang', 'Allen'),
(2, 'Alice', 'Bob')
]

# Define data for the 'addresses' table


addresses_data = [
(1, 2, 'New York City', 'New York'),
(2, 3, 'Leetcode', 'California')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT-1 persons
PERSONID LASTNAME FIRSTNAME
1 Wang Allen
2 Alice Bob

INPUT-2 addresses
ADDRESSID PERSONID CITY STATE

1 2 New York City New York


2 3 Leetcode California

OUTPUT

OUTPUT
FIRSTNAME LASTNAME CITY STATE
Bob Alice New York City New York
Allen Wang

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session

from pyspark.sql import SparkSession


from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()
# Define schema for the 'persons' table
persons_schema = StructType([
StructField("personId", IntegerType(), True),
StructField("lastName", StringType(), True),
StructField("firstName", StringType(), True)
])

# Define schema for the 'addresses' table


addresses_schema = StructType([
StructField("addressId", IntegerType(), True),
StructField("personId", IntegerType(), True),
StructField("city", StringType(), True),
StructField("state", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

# Define data for the 'persons' table


persons_data = [
(1, 'Wang', 'Allen'),
(2, 'Alice', 'Bob')
]

# Define data for the 'addresses' table


addresses_data = [
(1, 2, 'New York City', 'New York'),
(2, 3, 'Leetcode', 'California')
]

# Create a PySpark DataFrame


person_df=spark.createDataFrame(persons_data,persons_schema)
address_df=spark.createDataFrame(addresses_data,addresses_schema)

person_df.show()
address_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

person_df.join(address_df,person_df.personId==address_df.perso
nId,'left')\
.select('firstName','lastName','city','state')\
.show()

# Show the result DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

Step - 1 : Problem Statement

04_Employees Earning More Than Their Managers


Write a Pyspark program to find Employees Earning More Than Their
Managers

Difficult Level : EASY

DataFrame:
# Define the schema for the "employees"
employees_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("managerId", IntegerType(), True)
])

# Define data for the "employees"


employees_data = [
(1, 'Joe', 70000, 3),
(2, 'Henry', 80000, 4),
(3, 'Sam', 60000, None),
(4, 'Max', 90000, None)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

Step - 2 : Identifying The Input Data And Expected


Output
INPUT
INPUT
ID NAME SALARY MANAGERID
1 Joe 70,000 3
2 Henry 80,000 4
3 Sam 60,000
4 Max 90,000

OUTPUT

OUTPUT
NAME
Joe

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session

from pyspark.sql import SparkSession


from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()
# Define the schema for the "employees"
employees_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("managerId", IntegerType(), True)
])

# Define data for the "employees"


employees_data = [
(1, 'Joe', 70000, 3),
(2, 'Henry', 80000, 4),
(3, 'Sam', 60000, None),
(4, 'Max', 90000, None)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

# Create a PySpark DataFrame


emp_df=spark.createDataFrame(employees_data,employees_sche
ma)
emp_df.show()

emp_df1 = emp_df.alias("e1")
emp_df2 = emp_df.alias("e2")

self_joined_df = emp_df1.join(emp_df2, col("e1.id") ==


col("e2.managerId"),
"inner") .select(col("e2.name"),col("e2.salary"),col("e1.sa
lary").alias("msal"))

self_joined_df.filter(self_joined_df.salary>self_joined_df.msal).sele
ct("name").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 1 : Problem Statement

Duplicate Emails
Write a Pyspark program to report all the duplicate emails.
Note that it's guaranteed that the email field is not NULL.
Difficult Level : EASY

DataFrame:
# Define the schema for the "employees"
employees_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("managerId", IntegerType(), True)
])

# Define data for the "employees"


employees_data = [
(1, 'Joe', 70000, 3),
(2, 'Henry', 80000, 4),
(3, 'Sam', 60000, None),
(4, 'Max', 90000, None)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT
ID EMAIL
1 [email protected]
2 [email protected]
3 [email protected]

OUTPUT

OUTPUT
EMAIL
[email protected]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session

from pyspark.sql import SparkSession


from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define the schema for the "emails" table


emails_schema = StructType([
StructField("id", IntegerType(), True),
StructField("email", StringType(), True)
])

# Define data for the "emails" table


emails_data = [
(1, '[email protected]'),
(2, '[email protected]'),
(3, '[email protected]')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

# Create a PySpark DataFrame


df=spark.createDataFrame(emails_data,emails_schema)
df.show()

df_group=df.groupby("email").count()
df_group.filter(df_group["count"] > 1).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

Step - 1 : Problem Statement

06_Customers Who Never Order


Write a Pyspark program to find all customers who never
order anything.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Customers"
customers_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Define data for the "Customers"


customers_data = [
(1, 'Joe'),
(2, 'Henry'),
(3, 'Sam'),
(4, 'Max')
]

# Define the schema for the "Orders"


orders_schema = StructType([
StructField("id", IntegerType(), True),
StructField("customerId", IntegerType(), True)
])

# Define data for the "Orders"


orders_data = [
(1, 3),
(2, 1)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

Step - 2 : Identifying The Input Data And Expected


Output
INPUT
INPUT -1 customers
ID NAME
1 Joe
2 Henry
3 Sam
4 Max

INPUT - 2 orders
ID CUSTOMERID
1 3
2 1

OUTPUT

OUTPUT
NAME
Max
Henry

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

customers_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Define data for the "Customers"


customers_data = [
(1, 'Joe'),
(2, 'Henry'),
(3, 'Sam'),
(4, 'Max')
]

# Define the schema for the "Orders"


orders_schema = StructType([
StructField("id", IntegerType(), True),
StructField("customerId", IntegerType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6
# Define data for the "Orders"
orders_data = [
(1, 3),
(2, 1)
]

# Create a PySpark DataFrame


cus_df=spark.createDataFrame(customers_data,customers_schem
a)
ord_df=spark.createDataFrame(orders_data,orders_schema)

cus_df.show()
ord_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

cus_df.join(ord_df,cus_df.id == ord_df.customerId,"left_anti")\
.select("name").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

Step - 1 : Problem Statement

07_Rising Temperature
Write a solution to find all dates' Id with higher
temperatures compared to its previous dates (yesterday).

Return the result table in any order.

Difficult Level : EASY

DataFrame:
# Define the schema for the "Weather" table
weather_schema = StructType([
StructField("id", IntegerType(), True),
StructField("recordDate", StringType(), True),
StructField("temperature", IntegerType(), True)
])

# Define data for the "Weather" table


weather_data = [
(1, '2015-01-01', 10),
(2, '2015-01-02', 25),
(3, '2015-01-03', 20),
(4, '2015-01-04', 30)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT
ID RECORDDATE TEMPERATURE
1 2015-01-01 10
2 2015-01-02 25
3 2015-01-03 20
4 2015-01-04 30

OUTPUT

OUTPUT
ID
2
4

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define the schema for the "Weather" table


weather_schema = StructType([
StructField("id", IntegerType(), True),
StructField("recordDate", StringType(), True),
StructField("temperature", IntegerType(), True)
])

# Define data for the "Weather" table


weather_data = [
(1, '2015-01-01', 10),
(2, '2015-01-02', 25),
(3, '2015-01-03', 20),
(4, '2015-01-04', 30)
]

# Create a PySpark DataFrame


temp_df=spark.createDataFrame(weather_data,weather_schema)
temp_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

lag_df=temp_df.withColumn("prev_day",lag(temp_df.temperature).
over(Window.orderBy(temp_df.recordDate)))
lag_df.show()

lag_df.filter(lag_df["temperature"] >
lag_df["prev_day"] ).select("id").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

Step - 1 : Problem Statement

08_Game Play Analysis I


Write a solution to find the first login date for each player.
Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Activity"
activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"


activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT
PLAYER_ID DEVICE_ID EVENT_DATE GAMES_PLAYED
1 2 2016-03-01 5
1 2 2016-05-02 6
2 3 2017-06-25 1
3 1 2016-03-02 0
3 4 2018-07-03 5

OUTPUT

OUTPUT
PLAYER_ID FISRT_LOGIN
1 2016-03-01
2 2017-06-25
3 2016-03-02

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define the schema for the "Activity"


activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"


activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

# Create a PySpark DataFrame


activity_df=spark.createDataFrame(activity_data,activity_schema)
activity_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

rank_df=activity_df.withColumn("RK",rank().over(Window.partition
By(activity_df['player_id']).orderBy(activity_df['event_date'])))
rank_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

rank_df.filter(rank_df["RK"] ==
1).select("player_id",rank_df["event_date"].alias("First_Login")).sh
ow()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 1 : Problem Statement

09_Game Play Analysis II


Write a pyspark code that reports the device that is first
logged in for each player.
Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Activity"
activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"


activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT
PLAYER_ID DEVICE_ID EVENT_DATE GAMES_PLAYED
1 2 2016-03-01 5
1 2 2016-05-02 6
2 3 2017-06-25 1
3 1 2016-03-02 0
3 4 2018-07-03 5

OUTPUT

OUTPUT
PLAYER_ID DEVICE_ID
1 2
2 3
3 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define the schema for the "Activity"


activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"


activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

# Create a PySpark DataFrame


df=spark.createDataFrame(activity_data,activity_schema)
df.show()

rank_df=df.withColumn("rk",rank().over(Window.partitionBy(df["pla
yer_id"]).orderBy(df["event_date"])))
rank_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

rank_df.filter(rank_df["rk"] ==
1).select("player_id","device_id").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10

Step - 1 : Problem Statement

10_Employee Bonus
Write a solution to report the name and bonus amount of
each employee with a bonus less than 1000.
Return the result table in any order

Difficult Level : EASY


DataFrame:
# Define the schema for the "Employee"
employee_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("name", StringType(), True),
StructField("supervisor", IntegerType(), True),
StructField("salary", IntegerType(), True)
])

# Define data for the "Employee"


employee_data = [
(3, 'Brad', None, 4000),
(1, 'John', 3, 1000),
(2, 'Dan', 3, 2000),
(4, 'Thomas', 3, 4000)
]

# Define the schema for the "Bonus"


bonus_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("bonus", IntegerType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
# Define data for the "Bonus"
bonus_data = [
(2, 500),
(4, 2000)
]

Step - 2 : Identifying The Input Data And Expected


Output
INPUT
INPUT-1 EMPLOYEE
EMPID NAME SUPERVISOR SALARY
3 Brad 4,000
1 John 3 1,000
2 Dan 3 2,000
4 Thomas 3 4,000

INPUT-2 BONUS
EMPID BONUS
2 500
4 2,000

OUTPUT
OUTPUT
NAME BONUS
Brad
John
Dan 500

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define the schema for the "Employee"


employee_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("name", StringType(), True),
StructField("supervisor", IntegerType(), True),
StructField("salary", IntegerType(), True)
])

# Define data for the "Employee"


employee_data = [
(3, 'Brad', None, 4000),
(1, 'John', 3, 1000),
(2, 'Dan', 3, 2000),
(4, 'Thomas', 3, 4000)
]

# Define the schema for the "Bonus"

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
bonus_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("bonus", IntegerType(), True)
])

# Define data for the "Bonus"


bonus_data = [
(2, 500),
(4, 2000)
]

# Create a PySpark DataFrame


emp_df =
spark.createDataFrame(employee_data,employee_schema)
bonus_df = spark.createDataFrame(bonus_data,bonus_schema)
emp_df.show()
bonus_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
join_df=emp_df.join(bonus_df,emp_df.empId==bonus_df.empId,"lef
t")
join_df.show()

join_df.filter( (join_df.bonus < 1000) | col("bonus").isNull() ).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

Step - 1 : Problem Statement

11_Find Customer Referee


Find the names of the customer that are not referred by the
customer with id = 2.
Return the result table in any order

Difficult Level : EASY


DataFrame:

# Define the schema for the Customer table


schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("referee_id", IntegerType(), True)
])

# Create an RDD with the data


data = [
(1, 'Will', None),
(2, 'Jane', None),
(3, 'Alex', 2),
(4, 'Bill', None),
(5, 'Zack', 1),
(6, 'Mark', 2)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

Step - 2 : Identifying The Input Data And Expected


Output

INPUT

INPUT
ID NAME REFEREE_ID
1 Will
2 Jane
3 Alex 2
4 Bill
5 Zack 1
6 Mark 2

OUTPUT

OUTPUT
NAME
Will
Jane
Bill
Zack

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define the schema for the Customer table


schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("referee_id", IntegerType(), True)
])

# Create an RDD with the data


data = [
(1, 'Will', None),
(2, 'Jane', None),
(3, 'Alex', 2),
(4, 'Bill', None),
(5, 'Zack', 1),
(6, 'Mark', 2)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

# Create a PySpark DataFrame


customer_df = spark.createDataFrame(data ,schema )

# Filter customers not referred by customer with id = 2


result_df = customer_df.filter((col("referee_id").isNull()) |
(col("referee_id") != 2))

# Select only the 'name' column


result_df = result_df.select("name")

+-----+
| name|
+-----+
| Will|
| Jane|
| Bill|
| Zack|
+-----+

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

Step - 1 : Problem Statement

12_Cities With Completed Trades


Write a pypsark code to retrieve the top three cities that
have the highest number of completed trade orders listed in
descending order. Output the city name and the
corresponding number of completed trade orders.

Difficult Level : EASY


DataFrame:
# Define the schema for the trades
trades_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("price", FloatType(), True),
StructField("quantity", IntegerType(), True),
StructField("status", StringType(), True),
StructField("timestamp", StringType(), True)
])

# Define the schema for the users


users_schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("city", StringType(), True),
StructField("email", StringType(), True),
StructField("signup_date", StringType(), True)
])

# Create an RDD with the data for trades


trades_data = [
(100101, 111, 9.80, 10, 'Cancelled', '2022-08-17 12:00:00'),
(100102, 111, 10.00, 10, 'Completed', '2022-08-17 12:00:00'),
(100259, 148, 5.10, 35, 'Completed', '2022-08-25 12:00:00'),
(100264, 148, 4.80, 40, 'Completed', '2022-08-26 12:00:00'),
(100305, 300, 10.00, 15, 'Completed', '2022-09-05 12:00:00'),
(100400, 178, 9.90, 15, 'Completed', '2022-09-09 12:00:00'),
(100565, 265, 25.60, 5, 'Completed', '2022-12-19 12:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

# Create an RDD with the data for users


users_data = [
(111, 'San Francisco', '[email protected]', '2021-08-03 12:00:00'),
(148, 'Boston', '[email protected]', '2021-08-20 12:00:00'),
(178, 'San Francisco', '[email protected]', '2022-01-05
12:00:00'),
(265, 'Denver', '[email protected]', '2022-02-26 12:00:00'),
(300, 'San Francisco', '[email protected]', '2022-06-30
12:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT-1 trade

ORDER_ID USER_ID PRICE QUANTITY STATUS TIMESTAMP

100101 111 9.8 10 Cancelled 2022-08-17 12:00:00

100102 111 10 10 Completed 2022-08-17 12:00:00

100259 148 5.1 35 Completed 2022-08-25 12:00:00

100264 148 4.8 40 Completed 2022-08-26 12:00:00

100305 300 10 15 Completed 2022-09-05 12:00:00

100400 178 9.9 15 Completed 2022-09-09 12:00:00

100565 265 25.6 5 Completed 2022-12-19 12:00:00

INPUT - 2 user

USER_ID CITY EMAIL SIGNUP_DATE

111 San Francisco [email protected] 2021-08-03 12:00:00

148 Boston [email protected] 2021-08-20 12:00:00

178 San Francisco [email protected] 2022-01-05 12:00:00

265 Denver [email protected] 2022-02-26 12:00:00

300 San Francisco [email protected] 2022-06-30 12:00:00

OUTPUT
OUTPUT

CITY COUNT()

San Francisco 3

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
Boston 2

Denver 1

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define the schema for the trades


trades_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("price", FloatType(), True),
StructField("quantity", IntegerType(), True),
StructField("status", StringType(), True),
StructField("timestamp", StringType(), True)
])
# Define the schema for the users
users_schema = StructType([
StructField("user_id", IntegerType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
StructField("city", StringType(), True),
StructField("email", StringType(), True),
StructField("signup_date", StringType(), True)
])

# Create an RDD with the data for trades


trades_data = [
(100101, 111, 9.80, 10, 'Cancelled', '2022-08-17 12:00:00'),
(100102, 111, 10.00, 10, 'Completed', '2022-08-17 12:00:00'),
(100259, 148, 5.10, 35, 'Completed', '2022-08-25 12:00:00'),
(100264, 148, 4.80, 40, 'Completed', '2022-08-26 12:00:00'),
(100305, 300, 10.00, 15, 'Completed', '2022-09-05 12:00:00'),
(100400, 178, 9.90, 15, 'Completed', '2022-09-09 12:00:00'),
(100565, 265, 25.60, 5, 'Completed', '2022-12-19 12:00:00')
]

# Create an RDD with the data for users


users_data = [
(111, 'San Francisco', '[email protected]', '2021-08-03
12:00:00'),
(148, 'Boston', '[email protected]', '2021-08-20 12:00:00'),
(178, 'San Francisco', '[email protected]', '2022-
01-05 12:00:00'),
(265, 'Denver', '[email protected]', '2022-02-26
12:00:00'),
(300, 'San Francisco', '[email protected]',
'2022-06-30 12:00:00')
]

Trade_df=spark.createDataFrame(trades_data,trades_schema)
User_df=spark.createDataFrame(users_data,users_schema)
Trade_df.show()
User_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

join_df=Trade_df.join(User_df,Trade_df['user_id']==User_df['user_i
d'],"inner")
join_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

join_df.filter(join_df['status'] ==
'Completed').groupby(join_df['city']).count()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13

Step - 1 : Problem Statement

13_Page With No Likes


Write a pyspark code to return the IDs of the Facebook pages
that have zero likes. The output should be sorted in
ascending order based on the page IDs.

Difficult Level : EASY


DataFrame:
# Define the schema for the pages
pages_schema = StructType([
StructField("page_id", IntegerType(), True),
StructField("page_name", StringType(), True)
])
# Define the schema for the page_likes table
page_likes_schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("page_id", IntegerType(), True),
StructField("liked_date", StringType(), True)
])
# Create an RDD with the data for pages
pages_data = [
(20001, 'SQL Solutions'),
(20045, 'Brain Exercises'),
(20701, 'Tips for Data Analysts')
]
# Create an RDD with the data for page_likes table
page_likes_data = [
(111, 20001, '2022-04-08 00:00:00'),
(121, 20045, '2022-03-12 00:00:00'),
(156, 20001, '2022-07-25 00:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT - 1 PAGES

PAGE_ID PAGE_NAME

20001 SQL Solutions

20045 Brain Exercises


Tips for Data
20701 Analysts

INPUT - 2 PAGES_LIEKS

USER_ID PAGE_ID LIKED_DATE

111 20001 2022-04-08 0:00:00

121 20045 2022-03-12 0:00:00

156 20001 2022-07-25 0:00:00

OUTPUT

OUTPUT

PAGE_ID

20701

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()
# Define the schema for the pages
pages_schema = StructType([
StructField("page_id", IntegerType(), True),
StructField("page_name", StringType(), True)
])

# Define the schema for the page_likes table


page_likes_schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("page_id", IntegerType(), True),
StructField("liked_date", StringType(), True)
])

# Create an RDD with the data for pages


pages_data = [
(20001, 'SQL Solutions'),
(20045, 'Brain Exercises'),
(20701, 'Tips for Data Analysts')
]
# Create an RDD with the data for page_likes table
page_likes_data = [
(111, 20001, '2022-04-08 00:00:00'),
(121, 20045, '2022-03-12 00:00:00'),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13
(156, 20001, '2022-07-25 00:00:00')
]

page_df=spark.createDataFrame(pages_data,pages_schema)
page_like_df=spark.createDataFrame(page_likes_data,page_likes_
schema)
page_df.show()
page_like_df.show()

# Perform a left anti join to get pages with zero likes


zero_likes_pages = page_df.join(page_like_df, 'page_id',
'left_anti')
# Select and sort the result
result = zero_likes_pages.select("page_id").orderBy("page_id")
# Show the result
result.show()
+-------+
|page_id|
+-------+
| 20701|

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

Step - 1 : Problem Statement

14_Purchasing Activity by Product


Type
We have been given purchasing activity DF and we need
to find out cumulative purchases of each product over
time.
Difficult Level : EASY
DataFrame:
# Define schema for the DataFrame
schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("product_type", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("order_date", StringType(), True),
])

# Define data
# Define data
data = [
(213824, 'printer', 20, "2022-06-27 "),
(212312, 'hair dryer', 5, "2022-06-28 "),
(132842, 'printer', 18, "2022-06-28 "),
(284730, 'standing lamp', 8, "2022-07-05 ")
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT
ORDER_ID PRODUCT_TYPE QUANTITY ORDER_DATE

213824 printer 20 2022-06-27 12:00:00

212312 hair dryer 5 2022-06-28 12:00:00

132842 printer 18 2022-06-28 12:00:00

284730 standing lamp 8 2022-07-05 12:00:00

OUTPUT

OUTPUT

ORDER_DATE PRODUCT_TYPE CUM_PURCHASED

2022-06-27 12:00:00 printer 20


2022-06-28 12:00:00 hair dryer 5

2022-06-28 12:00:00 printer 38

2022-07-05 12:00:00 standing lamp 8

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define schema for the DataFrame


schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("product_type", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("order_date", StringType(), True),
])

# Define data
# Define data
data = [
(213824, 'printer', 20, "2022-06-27 "),
(212312, 'hair dryer', 5, "2022-06-28 "),
(132842, 'printer', 18, "2022-06-28 "),
(284730, 'standing lamp', 8, "2022-07-05 ")
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

order_df=spark.createDataFrame(data,schema)
order_df.show()

# Define a Window specification based on the 'order_date' column


window_spec =
Window.partitionBy("product_type").orderBy("order_date").rowsBe
tween(Window.unboundedPreceding, 0)

# Add a new column 'cumulative_purchases' representing the


cumulative sum
result_df = order_df.withColumn("cumulative_purchases",
F.sum("quantity").over(window_spec))
result_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

Step - 1 : Problem Statement


15_Teams Power Users
Write a pyspark code to identify the top 2 Power Users
who sent the highest number of messages on Microsoft
Teams in August 2022. Display the IDs of these 2 users
along with the total number of messages they sent.
Output the results in descending order based on the
count of the messages.
Difficult Level : EASY
DataFrame:
schema = StructType([
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", StringType(), True),
])

# Define the data


data = [
(901, 3601, 4500, 'You up?', '2022-08-03 00:00:00'),
(902, 4500, 3601, 'Only if you\'re buying', '2022-08-03 00:00:00'),
(743, 3601, 8752, 'Let\'s take this offline', '2022-06-14 00:00:00'),
(922, 3601, 4500, 'Get on the call', '2022-08-10 00:00:00'),
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT

MESSAGE_ID SENDER_ID RECEIVER_ID CONTENT SENT_DATE


2022-08-03
901 3601 4500 You up? 0:00:00
Only if you're 2022-08-03
902 4500 3601 buying 0:00:00
Let's take this 2022-06-14
743 3601 8752 offline 0:00:00
2022-08-10
922 3601 4500 Get on the call 0:00:00

OUTPUT

OUTPUT

SENDER_ID COUNT(*)

3601 2

4500 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

schema = StructType([
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", StringType(), True),
])

# Define the data


data = [
(901, 3601, 4500, 'You up?', '2022-08-03 00:00:00'),
(902, 4500, 3601, 'Only if you\'re buying', '2022-08-03 00:00:00'),
(743, 3601, 8752, 'Let\'s take this offline', '2022-06-14 00:00:00'),
(922, 3601, 4500, 'Get on the call', '2022-08-10 00:00:00'),
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

teams_df = spark.createDataFrame(data,schema)
teams_df.show()

filter_df=teams_df.filter(teams_df['sent_date'].like("2022-08%"))
filter_df.show()

result_df=filter_df.groupby(filter_df['sender_id']).count()
result_df=result_df.orderBy(desc(result_df['count'])).limit(2)
result_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16

Step - 1 : Problem Statement

16_Select in pyspark
Write a pyspark code perform below function
● Write a pyspark code to get all employee detail.
● Write a query to get only "FirstName" column from emp_df
● Write a Pyspark code to get FirstName in upper case as "First
Name".
● Write a pyspark code to get FirstName in lower case

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT",
"Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT",
"Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR",
"Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll",
"Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16

Step - 2 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290",
"IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR",
"Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793",
"IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793",
"HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793",
"Payroll", "Male"],
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

#1. Write a pyspark code to get all employee detail

emp_df.show()

# 2. Write a query to get only "FirstName" column from emp_df

# Method 1
emp_df.select("First_Name").show()

# Method 2
emp_df.select(col("First_Name")).show()

# Method 3
emp_df.createOrReplaceTempView("emp_table")
spark.sql("select First_Name from emp_table").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
# 3. Write a Pyspark code to get FirstName in upper case as "First
Name".

emp_df.select(upper("First_Name")).show()

#4. Write a pyspark code to get FirstName in lower case

from pyspark.sql.functions import lower


emp_df.select(lower("First_Name")).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17

Step - 1 : Problem Statement

17_Select in pyspark
Write a pyspark code perform below function
● Write a pyspark code for combine FirstName and LastName
and display it as "Name" (also include white space between
first name & last name)
● Select employee detail whose name is "Vikas"
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with letter 'a'.

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT",
"Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT",
"Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR",
"Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll",
"Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290",
"IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR",
"Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793",
"IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793",
"HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793",
"Payroll", "Male"],
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

#1. Write a pyspark code for combine FirstName and


LastName and display it as "Name" (also include white space
between first name & last name)

from pyspark.sql.functions import concat_ws


emp_df.select(concat_ws(" ","First_Name","Last_Name")\
.alias("Name")).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17

# 2. Select employee detail whose name is "Vikas"

# Methos 1
from pyspark.sql.functions import col
emp_df.filter(col("First_Name") == 'Vikas' ).show(truncate=False)

# Methos 2
emp_df.filter(emp_df.First_Name == 'Vikas' ).show(truncate=False)

# Methos 3
emp_df.filter(emp_df['First_Name'] == 'Vikas' ).show(truncate=False)

# Methos 4
emp_df.where(emp_df['First_Name'] == 'Vikas' ).show(truncate=False)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with letter 'a'.

# Method 1
from pyspark.sql.functions import lower
emp_df.filter(lower(emp_df['First_Name']).like("a%")).show()

# Method 2

emp_df.filter((emp_df['First_Name'].like("a%")) |
(emp_df['First_Name'].like("A%")) ).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18

Step - 1 : Problem Statement

18_Select in pyspark
Write a pyspark code perform below function
● Get all employee details from EmployeeDetail table whose
"FirstName" contains 'k'
● Get all employee details from EmployeeDetail table whose
"FirstName" end with 'h'
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with any single character between 'a-p'

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT",
"Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT",
"Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR",
"Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll",
"Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290",
"IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR",
"Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793",
"IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793",
"HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793",
"Payroll", "Male"],

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

#1. Get all employee details from EmployeeDetail table


whose "FirstName" contains 'k'

from pyspark.sql.functions import col


emp_df.filter(emp_df["First_Name"].like("%k%")).show(
)

# Get all employee details from EmployeeDetail table whose


"FirstName" end with 'h'
emp_df.filter(emp_df["First_Name"].like("%h")).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18

# Get all employee detail from EmployeeDetail table whose


"FirstName" start with any single character between 'a-p'
emp_df.filter(emp_df["First_Name"].rlike("[^a-pA-P%]")).show()

# Get all employee detail from EmployeeDetail table whose


"FirstName" start with
# any single character between 'a-p'
emp_df.filter(~(emp_df["First_Name"].rlike("[^a-pA-
P%]"))).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

Step - 1 : Problem Statement

19_Select in pyspark
Write a pyspark code perform below function
● Get all employee detail from emp_df whose "Gender" end
with 'le' and contain 4 letters. The Underscore(_) Wildcard
Character represents any single character.
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with # 'A' and contain 5 letters.
● Get all unique "Department" from EmployeeDetail table.
● Get the highest "Salary" from EmployeeDetail table.

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

Step - 2 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

emp_df=spark.createDataFrame(data,schema)

Get all employee detail from emp_df whose "Gender" end with
'le'and contain 4 letters. The Underscore(_) Wildcard Character
represents any single character.

# Get all employee detail from EmployeeDetail table whose


"FirstName" start with

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19
# 'A' and contain 5 letters.

# Get all unique "Department" from EmployeeDetail table.

# Get the highest "Salary" from EmployeeDetail table.

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20

Step - 1 : Problem Statement

20_Date in pyspark
Write a pyspark code perform below function
● Get the lowest "Salary" from EmployeeDetail table.
● Show "JoiningDate" in "dd mmm yyyy" format, ex- "15 Feb
2013"
● Show "JoiningDate" in "yyyy/mm/dd" format, ex- "2013/02/15"

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

Step - 1 : Problem Statement

21_Date in pyspark
Write a pyspark code perform below function
● Get only Year part of "JoiningDate"
● Get only Month part of "JoiningDate".
● Get only date part of "JoiningDate".
● Get the current system date using DataFrame API
● Get the current UTC date and time using DataFrame API

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

Step - 2 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21
emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

Step - 1 : Problem Statement

22_Date in pyspark
Write a pyspark code perform below function
● Get the first name, current date, joiningdate and diff between
current date and joining date in months.
● Get the first name, current date, joiningdate and diff between
current date and joining date in days.
● Get all employee details from EmployeeDetail table whose
joining year is 2013

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

Step - 2 : Writing the pyspark code to solve the


problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22
emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

Step - 1 : Problem Statement

23_Date in pyspark
Write a pyspark code perform below function
● Get all employee details from EmployeeDetail table whose
joining month is Jan(1).
● Get all employee details from EmployeeDetail table whose
joining date between 2013-01-01" and "2013-12-01".
● Get how many employee exist in "EmployeeDetail" table.
● Select all employee detail with First name "Vikas","Ashish", and
"Nikhil".

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

Step - 2 : Writing the pyspark code to solve the


problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24

Step - 1 : Problem Statement

24_Trim and case in pyspark


Write a pyspark code perform below function
● Select all employee detail with First name not in
"Vikas","Ashish", and "Nikhil".
● Select first name from "EmployeeDetail" df after removing
white spaces from right side
● Select first name from "EmployeeDetail" table after removing
white spaces from left side
● Display first name and Gender as M/F.(if male then M, if
Female then F)

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
])

Step - 2 : Writing the pyspark code to solve the


problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25

Step - 1 : Problem Statement

25_operator in pyspark
Write a pyspark code perform below function
● Select first name from "EmployeeDetail" table prifixed with
"Hello "
● Get employee details from "EmployeeDetail" table whose
Salary greater than 600000
● Get employee details from "EmployeeDetail" table whose
Salary less than 700000
● Get employee details from "EmployeeDetail" table whose
Salary between 500000 than 600000
● Select second highest salary from "EmployeeDetail" table

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve the


problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

Step - 1 : Problem Statement

26_groupby in pyspark
Write a pyspark code perform below function
● Write the query to get the department and department wise
total(sum) salary from "EmployeeDetail" table.
● Write the query to get the department and department wise
total(sum) salary, display it in ascending order according to
salary.
● Write the query to get the department and department wise
total(sum) salary, display it in descending order according to
salary.
● Write the query to get the department, total no. of
departments, total(sum) salary with respect to department
from "EmployeeDetail" table.

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve the


problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27

Step - 1 : Problem Statement

27_groupby in pyspark
Write a pyspark code perform below function
● 46. Get department wise average salary from
"EmployeeDetail" table order by salary ascending
● 47. Get department wise maximum salary from
"EmployeeDetail" table order by salary ascending
● 48. Get department wise minimum salary from
"EmployeeDetail" table order by salary ascending

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27

Step - 2 : Writing the pyspark code to solve the


problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Create a list of rows from the image


data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

Step - 1 : Problem Statement

28_Join_in_pyspark
Write a pyspark code perform below function
● Write down the query to fetch Project name assign to more
than one Employee
● Get employee name, project name order by firstname from
"EmployeeDetail" and"ProjectDetail" for those employee which
have assigned project already.

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])

# Create the data as a list of tuples


pro_data = [
(1, 1, "Task Track"),
(2, 1, "CLP"),
(3, 1, "Survey Management"),
(4, 2, "HR Management"),
(5, 3, "Task Track"),
(6, 3, "GRS"),
(7, 3, "DDS"),
(8, 4, "HR Management"),
(9, 6, "GL Management")
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

Step - 2 : Writing the pyspark code to solve the


problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

Step - 1 : Problem Statement

29_Join_in_pyspark
Write a pyspark code perform below function
● 52. Get employee name, project name order by firstname from
"EmployeeDetail" and "ProjectDetail" for all employee even
they have not assigned project.
● 53 Get employee name, project name order by firstname from
"EmployeeDetail" and "ProjectDetail" for all employee if
project is not assigned then display "-No Project Assigned".

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])

# Create the data as a list of tuples


pro_data = [
(1, 1, "Task Track"),
(2, 1, "CLP"),
(3, 1, "Survey Management"),
(4, 2, "HR Management"),
(5, 3, "Task Track"),
(6, 3, "GRS"),
(7, 3, "DDS"),
(8, 4, "HR Management"),
(9, 6, "GL Management")
]

Step - 2 : Writing the pyspark code to solve the


problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

Step - 1 : Problem Statement

30_Join_in_pyspark
Write a pyspark code perform below function
● 56. Write a pyspark code to find out the employeename who
has not assigned any project, and display "-No Project
Assigned"( tables :- [EmployeeDetail],[ProjectDetail]).
● 57. Write a pyspark code to find out the project name which is
not assigned to any employee( tables :-
[EmployeeDetail],[ProjectDetail]).

Difficult Level : EASY


DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

# Create a schema for the DataFrame


schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])

# Create the data as a list of tuples


pro_data = [
(1, 1, "Task Track"),
(2, 1, "CLP"),
(3, 1, "Survey Management"),
(4, 2, "HR Management"),
(5, 3, "Task Track"),
(6, 3, "GRS"),
(7, 3, "DDS"),
(8, 4, "HR Management"),
(9, 6, "GL Management")
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

Step - 2 : Writing the pyspark code to solve the


problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

Step - 1 : Problem Statement

31_Histogram of Tweets
write a query to obtain a histogram of tweets posted per user in 2022. Output the
tweet count per user as the bucket and the number of Twitter users who fall into that
bucket.

In other words, group the users by the number of tweets they posted in 2022 and
count the number of users in each group.

Difficult Level : EASY


DataFrame:
schema = StructType([
StructField("tweet_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("msg", StringType(), True),
StructField("tweet_date", StringType(), True)
])

# Define the data


data = [
(214252, 111, 'Am considering taking Tesla private at $420. Funding
secured.', '2021-12-30 00:00:00'),
(739252, 111, 'Despite the constant negative press covfefe', '2022-01-01
00:00:00'),
(846402, 111, 'Following @NickSinghTech on Twitter changed my life!',
'2022-02-14 00:00:00'),
(241425, 254, 'If the salary is so competitive why won’t you tell me what
it is?', '2022-03-01 00:00:00'),
(231574, 148, 'I no longer have a manager. I can\'t be managed', '2022-03-
23 00:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

Step - 2 : Identifying The Input Data And Expected


Output
INPUT
INPUT
TWEET_DAT
TWEET_ID USER_ID MSG E

Am considering taking Tesla private at $420. 2021-12-30


214252 111 Funding secured. 0:00:00
2022-01-01
739252 111 Despite the constant negative press covfefe 0:00:00
Following @NickSinghTech on Twitter changed 2022-02-14
846402 111 my life! 0:00:00
If the salary is so competitive why won’t you tell 2022-03-01
241425 254 me what it is? 0:00:00

2022-03-23
231574 148 I no longer have a manager. I can't be managed 0:00:00

OUTPUT

OUTPUT

BUCKET USER_NUM

1 2

2 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32

Step - 1 : Problem Statement

32_pyspark_transformation
Write a pyspark code to transform the DataFrame to
display each student's marks in Math and English as
separate columns.
Difficult Level : EASY
DataFrame:
data=[
('Rudra','math',79),
('Rudra','eng',60),
('Shivu','math', 68),
('Shivu','eng', 59),
('Anu','math', 65),
('Anu','eng',80)
]

schema = StructType([
StructField("Name", StringType(), True),
StructField("Sub", StringType(), True),
StructField("Marks", IntegerType(), True)
])

Step - 2 : Identifying The Input Data And Expected


Output

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32
INPUT

OUTPUT

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

Step - 1 : Problem Statement

33_Hobbies Data Transformation


Problem Statement:
Transform a dataset with individuals' names and
associated hobbies into a new format using PySpark.
Convert the comma-separated hobbies into separate
rows, creating a DataFrame with individual rows for
each person and their respective hobbies.
Difficult Level : EASY
DataFrame:
# Sample input data
data = [("Alice", "badminton,tennis"),
("Bob", "tennis,cricket"),
("Julie", "cricket,carroms")]

# Create a DataFrame
df = spark.createDataFrame(data, ["name", "hobbies"])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

Step - 1 : Problem Statement

34_ Histogram of Tweets


write a pyspark code to obtain a histogram of tweets
posted per user in 2022. Output the tweet count per
user as the bucket and the number of Twitter users who
fall into that bucket.In other words, group the users by
the number of tweets they posted in 2022 and count the
number of users in each group.
Difficult Level : EASY
DataFrame:
# Define the schema for the tweets DataFrame
schema = StructType([
StructField("tweet_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("msg", StringType(), True),
StructField("tweet_date", StringType(), True)
])

# Create the tweets DataFrame


data = [
(214252, 111, 'Am considering taking Tesla private at $420. Funding
secured.', '2021-12-30 00:00:00'),
(739252, 111, 'Despite the constant negative press covfefe', '2022-01-01
00:00:00'),
(846402, 111, 'Following @NickSinghTech on Twitter changed my life!',
'2022-02-14 00:00:00'),
(241425, 254, 'If the salary is so competitive why won’t you tell me what
it is?', '2022-03-01 00:00:00'),
(231574, 148, 'I no longer have a manager. I can\'t be managed', '2022-03-
23 00:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT
TWEET_DAT
TWEET_ID USER_ID MSG E

Am considering taking Tesla private at $420. 2021-12-30


214252 111 Funding secured. 0:00:00
2022-01-01
739252 111 Despite the constant negative press covfefe 0:00:00
Following @NickSinghTech on Twitter changed 2022-02-14
846402 111 my life! 0:00:00
If the salary is so competitive why won’t you tell 2022-03-01
241425 254 me what it is? 0:00:00

2022-03-23
231574 148 I no longer have a manager. I can't be managed 0:00:00

OUTPUT

OUTPUT

BUCKET USER_NUM
1 2

2 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35

Step - 1 : Problem Statement

35_Classes More Than 5 Students


Write a pyspark code to find all the classes that have at
least five students.Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("StudentID", StringType(), True),
StructField("ClassName", StringType(), True)
])
# Data to be inserted into the DataFrame
data = [
('A', 'Math'),
('B', 'English'),
('C', 'Math'),
('D', 'Biology'),
('E', 'Math'),
('F', 'Computer'),
('G', 'Math'),
('H', 'Math'),
('I', 'Math')
]

Step - 2 : Identifying The Input Data And Expected


Output

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35
INPUT

OUTPUT

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

Step - 1 : Problem Statement

36_Rank Scores Problem


Write a pyspark code to rank scores. If there is a tie between
two scores, both should have the same ranking. Note that
after a tie, the next ranking number should be the next
consecutive integer value.In other words, there should be no
“holes” between ranks.

Difficult Level : MED


DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("Id", IntegerType(), True),
StructField("Score", FloatType(), True)
])

# Data to be inserted into the DataFrame


data = [
(1, 3.50),
(2, 3.65),
(3, 4.00),
(4, 3.85),
(5, 4.00),
(6, 3.65)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

Step - 1 : Problem Statement

37_Triangle Judgement Problem

A pupil Tim gets homework to identify whether three line segments could
possibly form a triangle.
However, this assignment is very heavy because there are hundreds of records
to calculate.
Could you help Tim by writing a pyspark code to judge whether these three
sides can form a triangle,
assuming df triangle holds the length of the three sides x, y and z.

Difficult Level : EASY


DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("x", IntegerType(), True),
StructField("y", IntegerType(), True),
StructField("z", IntegerType(), True)
])

# Data to be inserted into the DataFrame


data = [
(13, 15, 30),
(10, 20, 15)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

Step - 1 : Problem Statement

38_Biggest Single Number Problem

Df contains many numbers in column num including duplicated ones.


Can you write a pyspark code to find the biggest number, which only
appears once.

Difficult Level : EASY


DataFrame:
# Define the schema for the DataFrame
schema = StructType([StructField("num", IntegerType(), True)])

# Your data
data = [(8,), (8,), (3,), (3,), (1,), (4,), (5,), (6,)]

# Create a PySpark DataFrame


df = spark.createDataFrame(data, schema=schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

Step - 1 : Problem Statement

39_Not Boring Movies Problem


X city opened a new cinema, many people would like to go to this
cinema. The cinema also gives out a poster indicating the movies’ ratings
and descriptions. Please write a Pyspark Code to output movies with an
odd numbered ID and a description that is not ‘boring’. Order the result
by rating.

Difficult Level : EASY


DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("id", IntegerType(), True),
StructField("movie", StringType(), True),
StructField("description", StringType(), True),
StructField("rating", FloatType(), True)
])

# Your data
data = [
(1, "War", "great 3D", 8.9),
(2, "Science", "fiction", 8.5),
(3, "Irish", "boring", 6.2),
(4, "Ice song", "Fantasy", 8.6),
(5, "House card", "Interesting", 9.1)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

Step - 1 : Problem Statement

40_Swap Gender Problem


Given a df salary, such as the one below, that has m=male
and f=female values. Swap all f and m values (i.e., change all f
values to m and vice versa)

Difficult Level : EASY


DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("sex", StringType(), True),
StructField("salary", IntegerType(), True),
])

# Define the data


data = [
(1, "A", "m", 2500),
(2, "B", "f", 1500),
(3, "C", "m", 5500),
(4, "D", "f", 500),
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

Step - 3 : Writing the pyspark code to solve


the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR

You might also like