0% found this document useful (0 votes)

192 views

Pyspark Hands on

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

192 views

Pyspark Hands on

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 189

PYSPARK LEARNING HUB

Step - 1 : Problem Statement

Actors and Directors Who Cooperated At Least

Three Times
Write a pyspark Program for a report that provides the pairs
(actor_id, director_id) where the actor has cooperated with
the director at least 3 times.

Difficult Level : EASY

DataFrame:
schema = StructType([
StructField("ActorId",IntegerType(),True),
StructField("DirectorId",IntegerType(),True),
StructField("timestamp",IntegerType(),True)
])

data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT
ACTOR_ID DIRECTOR_ID TIMESTAMP
1 1 0
1 1 1
1 1 2
1 2 3
1 2 4
2 1 5
2 1 6

OUTPUT
OUTPUT
ACTOR_ID DIRECTOR_ID
1 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType,StructField,IntegerType

#creating spark session

schema = StructType([
StructField("ActorId",IntegerType(),True),
StructField("DirectorId",IntegerType(),True),
StructField("timestamp",IntegerType(),True)
])

data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

df=spark.createDataFrame(data,schema)
df.show()

df_group=df.groupBy('ActorId','DirectorId').count()
df_group.show()

+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 2| 2|
| 1| 1| 3|
| 2| 1| 2|
+-------+----------+-----+

df_group.filter(df_group['count'] >= 3).show()

+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 1| 3|
+-------+----------+-----+

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

Step - 1 : Problem Statement

Ads Performance
Write an pyspark code to find the ctr of each Ad.Round ctr to 2
decimal points. Order the result table by ctr in descending order
and by ad_id in ascending order in case of a tie.
Ctr=Clicked/(Clicked+Viewed)

Difficult Level : EASY

DataFrame:
# Define the schema for the Ads table
schema=StructType([
StructField('AD_ID',IntegerType(),True)
,StructField('USER_ID',IntegerType(),True)
,StructField('ACTION',StringType(),True)
])

# Define the data for the Ads table

data = [
(1, 1, 'Clicked'),
(2, 2, 'Clicked'),
(3, 3, 'Viewed'),
(5, 5, 'Ignored'),
(1, 7, 'Ignored'),
(2, 7, 'Viewed'),
(3, 5, 'Clicked'),
(1, 4, 'Viewed'),
(2, 11, 'Viewed'),
(1, 2, 'Clicked')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
AD_ID USER_ID ACTION
1 1 Clicked
2 2 Clicked
3 3 Viewed
5 5 Ignored
1 7 Ignored
2 7 Viewed
3 5 Clicked
1 4 Viewed
2 11 Viewed
1 2 Clicked
OUTPUT
OUTPUT
AD_ID CTR
1 0.67
3 0.5
2 0.33
5 0

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session

# Define the schema for the Ads table

schema=StructType([
StructField('AD_ID',IntegerType(),True)
,StructField('USER_ID',IntegerType(),True)
,StructField('ACTION',StringType(),True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2
# Define the data for the Ads table
data = [
(1, 1, 'Clicked'),
(2, 2, 'Clicked'),
(3, 3, 'Viewed'),
(5, 5, 'Ignored'),
(1, 7, 'Ignored'),
(2, 7, 'Viewed'),
(3, 5, 'Clicked'),
(1, 4, 'Viewed'),
(2, 11, 'Viewed'),
(1, 2, 'Clicked')
]

# Create a PySpark DataFrame

df=spark.createDataFrame(data,schema)
df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

ctr_df = (
ads_df.groupBy("ad_id")
.agg(
F.sum(F.when(ads_df["action"] == "Clicked",
1).otherwise(0)).alias("click_count"),
F.sum(F.when(ads_df["action"] == "Viewed",
1).otherwise(0)).alias("view_count")
)
.withColumn("ctr", F.round(F.col("click_count") /
(F.col("click_count") + F.col("view_count")), 2))
)

# Order the result table by CTR in descending order and by ad_id in

ascending order
window_spec = Window.orderBy(F.col("ctr").desc(),
F.col("ad_id").asc())
result_df = ctr_df.withColumn("rank", F.rank().over(window_spec))

# Show the result DataFrame

result_df.select('ad_id','ctr').show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

Step - 1 : Problem Statement

Combine Two DF
Write a Pyspark program to report the first name, last name, city, and state of each person in the
Person dataframe. If the address of a personId is not present in the Address dataframe,
report null instead.

Difficult Level : EASY

DataFrame:
# Define schema for the 'persons' table
persons_schema = StructType([
StructField("personId", IntegerType(), True),
StructField("lastName", StringType(), True),
StructField("firstName", StringType(), True)
])

# Define schema for the 'addresses' table

addresses_schema = StructType([
StructField("addressId", IntegerType(), True),
StructField("personId", IntegerType(), True),
StructField("city", StringType(), True),
StructField("state", StringType(), True)
])

# Define data for the 'persons' table

persons_data = [
(1, 'Wang', 'Allen'),
(2, 'Alice', 'Bob')
]

# Define data for the 'addresses' table

addresses_data = [
(1, 2, 'New York City', 'New York'),
(2, 3, 'Leetcode', 'California')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT-1 persons
PERSONID LASTNAME FIRSTNAME
1 Wang Allen
2 Alice Bob

INPUT-2 addresses
ADDRESSID PERSONID CITY STATE

1 2 New York City New York

2 3 Leetcode California

OUTPUT

OUTPUT
FIRSTNAME LASTNAME CITY STATE
Bob Alice New York City New York
Allen Wang

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session

spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()
# Define schema for the 'persons' table
persons_schema = StructType([
StructField("personId", IntegerType(), True),
StructField("lastName", StringType(), True),
StructField("firstName", StringType(), True)
])

# Define schema for the 'addresses' table

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

# Define data for the 'persons' table

persons_data = [
(1, 'Wang', 'Allen'),
(2, 'Alice', 'Bob')
]

# Define data for the 'addresses' table

addresses_data = [
(1, 2, 'New York City', 'New York'),
(2, 3, 'Leetcode', 'California')
]

# Create a PySpark DataFrame

person_df=spark.createDataFrame(persons_data,persons_schema)
address_df=spark.createDataFrame(addresses_data,addresses_schema)

person_df.show()
address_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

person_df.join(address_df,person_df.personId==address_df.perso
nId,'left')\
.select('firstName','lastName','city','state')\
.show()

# Show the result DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

Step - 1 : Problem Statement

04_Employees Earning More Than Their Managers

Write a Pyspark program to find Employees Earning More Than Their
Managers

Difficult Level : EASY

DataFrame:
# Define the schema for the "employees"
employees_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("managerId", IntegerType(), True)
])

# Define data for the "employees"

employees_data = [
(1, 'Joe', 70000, 3),
(2, 'Henry', 80000, 4),
(3, 'Sam', 60000, None),
(4, 'Max', 90000, None)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT
ID NAME SALARY MANAGERID
1 Joe 70,000 3
2 Henry 80,000 4
3 Sam 60,000
4 Max 90,000

OUTPUT

OUTPUT
NAME
Joe

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session

spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()
# Define the schema for the "employees"
employees_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("managerId", IntegerType(), True)
])

# Define data for the "employees"

employees_data = [
(1, 'Joe', 70000, 3),
(2, 'Henry', 80000, 4),
(3, 'Sam', 60000, None),
(4, 'Max', 90000, None)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

# Create a PySpark DataFrame

emp_df=spark.createDataFrame(employees_data,employees_sche
ma)
emp_df.show()

emp_df1 = emp_df.alias("e1")
emp_df2 = emp_df.alias("e2")

self_joined_df = emp_df1.join(emp_df2, col("e1.id") ==

col("e2.managerId"),
"inner") .select(col("e2.name"),col("e2.salary"),col("e1.sa
lary").alias("msal"))

self_joined_df.filter(self_joined_df.salary>self_joined_df.msal).sele
ct("name").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 1 : Problem Statement

Duplicate Emails
Write a Pyspark program to report all the duplicate emails.
Note that it's guaranteed that the email field is not NULL.
Difficult Level : EASY

# Define data for the "employees"

employees_data = [
(1, 'Joe', 70000, 3),
(2, 'Henry', 80000, 4),
(3, 'Sam', 60000, None),
(4, 'Max', 90000, None)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
ID EMAIL
1 [email protected]
2 [email protected]
3 [email protected]

OUTPUT

OUTPUT
EMAIL
[email protected]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session

# Define the schema for the "emails" table

emails_schema = StructType([
StructField("id", IntegerType(), True),
StructField("email", StringType(), True)
])

# Define data for the "emails" table

emails_data = [
(1, '[email protected]'),
(2, '[email protected]'),
(3, '[email protected]')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

# Create a PySpark DataFrame

df=spark.createDataFrame(emails_data,emails_schema)
df.show()

df_group=df.groupby("email").count()
df_group.filter(df_group["count"] > 1).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

Step - 1 : Problem Statement

06_Customers Who Never Order

Write a Pyspark program to find all customers who never
order anything.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Customers"
customers_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Define data for the "Customers"

customers_data = [
(1, 'Joe'),
(2, 'Henry'),
(3, 'Sam'),
(4, 'Max')
]

# Define the schema for the "Orders"

orders_schema = StructType([
StructField("id", IntegerType(), True),
StructField("customerId", IntegerType(), True)
])

# Define data for the "Orders"

orders_data = [
(1, 3),
(2, 1)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT -1 customers
ID NAME
1 Joe
2 Henry
3 Sam
4 Max

INPUT - 2 orders
ID CUSTOMERID
1 3
2 1

OUTPUT

OUTPUT
NAME
Max
Henry

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

customers_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Define data for the "Customers"

customers_data = [
(1, 'Joe'),
(2, 'Henry'),
(3, 'Sam'),
(4, 'Max')
]

# Define the schema for the "Orders"

orders_schema = StructType([
StructField("id", IntegerType(), True),
StructField("customerId", IntegerType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6
# Define data for the "Orders"
orders_data = [
(1, 3),
(2, 1)
]

# Create a PySpark DataFrame

cus_df=spark.createDataFrame(customers_data,customers_schem
a)
ord_df=spark.createDataFrame(orders_data,orders_schema)

cus_df.show()
ord_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

cus_df.join(ord_df,cus_df.id == ord_df.customerId,"left_anti")\
.select("name").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

Step - 1 : Problem Statement

07_Rising Temperature
Write a solution to find all dates' Id with higher
temperatures compared to its previous dates (yesterday).

Return the result table in any order.

Difficult Level : EASY

DataFrame:
# Define the schema for the "Weather" table
weather_schema = StructType([
StructField("id", IntegerType(), True),
StructField("recordDate", StringType(), True),
StructField("temperature", IntegerType(), True)
])

# Define data for the "Weather" table

weather_data = [
(1, '2015-01-01', 10),
(2, '2015-01-02', 25),
(3, '2015-01-03', 20),
(4, '2015-01-04', 30)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
ID RECORDDATE TEMPERATURE
1 2015-01-01 10
2 2015-01-02 25
3 2015-01-03 20
4 2015-01-04 30

OUTPUT

OUTPUT
ID
2
4

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the "Weather" table

weather_schema = StructType([
StructField("id", IntegerType(), True),
StructField("recordDate", StringType(), True),
StructField("temperature", IntegerType(), True)
])

# Define data for the "Weather" table

weather_data = [
(1, '2015-01-01', 10),
(2, '2015-01-02', 25),
(3, '2015-01-03', 20),
(4, '2015-01-04', 30)
]

# Create a PySpark DataFrame

temp_df=spark.createDataFrame(weather_data,weather_schema)
temp_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

lag_df=temp_df.withColumn("prev_day",lag(temp_df.temperature).
over(Window.orderBy(temp_df.recordDate)))
lag_df.show()

lag_df.filter(lag_df["temperature"] >
lag_df["prev_day"] ).select("id").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

Step - 1 : Problem Statement

08_Game Play Analysis I

Write a solution to find the first login date for each player.
Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Activity"
activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"

activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
PLAYER_ID DEVICE_ID EVENT_DATE GAMES_PLAYED
1 2 2016-03-01 5
1 2 2016-05-02 6
2 3 2017-06-25 1
3 1 2016-03-02 0
3 4 2018-07-03 5

OUTPUT

OUTPUT
PLAYER_ID FISRT_LOGIN
1 2016-03-01
2 2017-06-25
3 2016-03-02

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the "Activity"

activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"

activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

# Create a PySpark DataFrame

activity_df=spark.createDataFrame(activity_data,activity_schema)
activity_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

rank_df=activity_df.withColumn("RK",rank().over(Window.partition
By(activity_df['player_id']).orderBy(activity_df['event_date'])))
rank_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

rank_df.filter(rank_df["RK"] ==
1).select("player_id",rank_df["event_date"].alias("First_Login")).sh
ow()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 1 : Problem Statement

09_Game Play Analysis II

Write a pyspark code that reports the device that is first
logged in for each player.
Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Activity"
activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"

activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
PLAYER_ID DEVICE_ID EVENT_DATE GAMES_PLAYED
1 2 2016-03-01 5
1 2 2016-05-02 6
2 3 2017-06-25 1
3 1 2016-03-02 0
3 4 2018-07-03 5

OUTPUT

OUTPUT
PLAYER_ID DEVICE_ID
1 2
2 3
3 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the "Activity"

# Define data for the "Activity"

activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

# Create a PySpark DataFrame

df=spark.createDataFrame(activity_data,activity_schema)
df.show()

rank_df=df.withColumn("rk",rank().over(Window.partitionBy(df["pla
yer_id"]).orderBy(df["event_date"])))
rank_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

rank_df.filter(rank_df["rk"] ==
1).select("player_id","device_id").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10

Step - 1 : Problem Statement

10_Employee Bonus
Write a solution to report the name and bonus amount of
each employee with a bonus less than 1000.
Return the result table in any order

Difficult Level : EASY

DataFrame:
# Define the schema for the "Employee"
employee_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("name", StringType(), True),
StructField("supervisor", IntegerType(), True),
StructField("salary", IntegerType(), True)
])

# Define data for the "Employee"

employee_data = [
(3, 'Brad', None, 4000),
(1, 'John', 3, 1000),
(2, 'Dan', 3, 2000),
(4, 'Thomas', 3, 4000)
]

# Define the schema for the "Bonus"

bonus_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("bonus", IntegerType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
# Define data for the "Bonus"
bonus_data = [
(2, 500),
(4, 2000)
]

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT-1 EMPLOYEE
EMPID NAME SUPERVISOR SALARY
3 Brad 4,000
1 John 3 1,000
2 Dan 3 2,000
4 Thomas 3 4,000

INPUT-2 BONUS
EMPID BONUS
2 500
4 2,000

OUTPUT
OUTPUT
NAME BONUS
Brad
John
Dan 500

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the "Employee"

employee_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("name", StringType(), True),
StructField("supervisor", IntegerType(), True),
StructField("salary", IntegerType(), True)
])

# Define data for the "Employee"

employee_data = [
(3, 'Brad', None, 4000),
(1, 'John', 3, 1000),
(2, 'Dan', 3, 2000),
(4, 'Thomas', 3, 4000)
]

# Define the schema for the "Bonus"

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
bonus_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("bonus", IntegerType(), True)
])

# Define data for the "Bonus"

bonus_data = [
(2, 500),
(4, 2000)
]

# Create a PySpark DataFrame

emp_df =
spark.createDataFrame(employee_data,employee_schema)
bonus_df = spark.createDataFrame(bonus_data,bonus_schema)
emp_df.show()
bonus_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
join_df=emp_df.join(bonus_df,emp_df.empId==bonus_df.empId,"lef
t")
join_df.show()

join_df.filter( (join_df.bonus < 1000) | col("bonus").isNull() ).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

Step - 1 : Problem Statement

11_Find Customer Referee

Find the names of the customer that are not referred by the
customer with id = 2.
Return the result table in any order

Difficult Level : EASY

DataFrame:

# Define the schema for the Customer table

schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("referee_id", IntegerType(), True)
])

# Create an RDD with the data

data = [
(1, 'Will', None),
(2, 'Jane', None),
(3, 'Alex', 2),
(4, 'Bill', None),
(5, 'Zack', 1),
(6, 'Mark', 2)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

Step - 2 : Identifying The Input Data And Expected

Output

INPUT

INPUT
ID NAME REFEREE_ID
1 Will
2 Jane
3 Alex 2
4 Bill
5 Zack 1
6 Mark 2

OUTPUT

OUTPUT
NAME
Will
Jane
Bill
Zack

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the Customer table

schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("referee_id", IntegerType(), True)
])

# Create an RDD with the data

data = [
(1, 'Will', None),
(2, 'Jane', None),
(3, 'Alex', 2),
(4, 'Bill', None),
(5, 'Zack', 1),
(6, 'Mark', 2)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

# Create a PySpark DataFrame

customer_df = spark.createDataFrame(data ,schema )

# Filter customers not referred by customer with id = 2

result_df = customer_df.filter((col("referee_id").isNull()) |
(col("referee_id") != 2))

# Select only the 'name' column

result_df = result_df.select("name")

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

Step - 1 : Problem Statement

12_Cities With Completed Trades

Write a pypsark code to retrieve the top three cities that
have the highest number of completed trade orders listed in
descending order. Output the city name and the
corresponding number of completed trade orders.

Difficult Level : EASY

DataFrame:
# Define the schema for the trades
trades_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("price", FloatType(), True),
StructField("quantity", IntegerType(), True),
StructField("status", StringType(), True),
StructField("timestamp", StringType(), True)
])

# Define the schema for the users

users_schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("city", StringType(), True),
StructField("email", StringType(), True),
StructField("signup_date", StringType(), True)
])

# Create an RDD with the data for trades

trades_data = [
(100101, 111, 9.80, 10, 'Cancelled', '2022-08-17 12:00:00'),
(100102, 111, 10.00, 10, 'Completed', '2022-08-17 12:00:00'),
(100259, 148, 5.10, 35, 'Completed', '2022-08-25 12:00:00'),
(100264, 148, 4.80, 40, 'Completed', '2022-08-26 12:00:00'),
(100305, 300, 10.00, 15, 'Completed', '2022-09-05 12:00:00'),
(100400, 178, 9.90, 15, 'Completed', '2022-09-09 12:00:00'),
(100565, 265, 25.60, 5, 'Completed', '2022-12-19 12:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

# Create an RDD with the data for users

users_data = [
(111, 'San Francisco', '[email protected]', '2021-08-03 12:00:00'),
(148, 'Boston', '[email protected]', '2021-08-20 12:00:00'),
(178, 'San Francisco', '[email protected]', '2022-01-05
12:00:00'),
(265, 'Denver', '[email protected]', '2022-02-26 12:00:00'),
(300, 'San Francisco', '[email protected]', '2022-06-30
12:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT-1 trade

ORDER_ID USER_ID PRICE QUANTITY STATUS TIMESTAMP

100101 111 9.8 10 Cancelled 2022-08-17 12:00:00

100102 111 10 10 Completed 2022-08-17 12:00:00

100259 148 5.1 35 Completed 2022-08-25 12:00:00

100264 148 4.8 40 Completed 2022-08-26 12:00:00

100305 300 10 15 Completed 2022-09-05 12:00:00

100400 178 9.9 15 Completed 2022-09-09 12:00:00

100565 265 25.6 5 Completed 2022-12-19 12:00:00

INPUT - 2 user

USER_ID CITY EMAIL SIGNUP_DATE

111 San Francisco [email protected] 2021-08-03 12:00:00

148 Boston [email protected] 2021-08-20 12:00:00

178 San Francisco [email protected] 2022-01-05 12:00:00

265 Denver [email protected] 2022-02-26 12:00:00

300 San Francisco [email protected] 2022-06-30 12:00:00

OUTPUT
OUTPUT

CITY COUNT()

San Francisco 3

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
Boston 2

Denver 1

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the trades

trades_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("price", FloatType(), True),
StructField("quantity", IntegerType(), True),
StructField("status", StringType(), True),
StructField("timestamp", StringType(), True)
])
# Define the schema for the users
users_schema = StructType([
StructField("user_id", IntegerType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
StructField("city", StringType(), True),
StructField("email", StringType(), True),
StructField("signup_date", StringType(), True)
])

# Create an RDD with the data for trades

# Create an RDD with the data for users

users_data = [
(111, 'San Francisco', '[email protected]', '2021-08-03
12:00:00'),
(148, 'Boston', '[email protected]', '2021-08-20 12:00:00'),
(178, 'San Francisco', '[email protected]', '2022-
01-05 12:00:00'),
(265, 'Denver', '[email protected]', '2022-02-26
12:00:00'),
(300, 'San Francisco', '[email protected]',
'2022-06-30 12:00:00')
]

Trade_df=spark.createDataFrame(trades_data,trades_schema)
User_df=spark.createDataFrame(users_data,users_schema)
Trade_df.show()
User_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

join_df=Trade_df.join(User_df,Trade_df['user_id']==User_df['user_i
d'],"inner")
join_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

join_df.filter(join_df['status'] ==
'Completed').groupby(join_df['city']).count()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13

Step - 1 : Problem Statement

13_Page With No Likes

Write a pyspark code to return the IDs of the Facebook pages
that have zero likes. The output should be sorted in
ascending order based on the page IDs.

Difficult Level : EASY

DataFrame:
# Define the schema for the pages
pages_schema = StructType([
StructField("page_id", IntegerType(), True),
StructField("page_name", StringType(), True)
])
# Define the schema for the page_likes table
page_likes_schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("page_id", IntegerType(), True),
StructField("liked_date", StringType(), True)
])
# Create an RDD with the data for pages
pages_data = [
(20001, 'SQL Solutions'),
(20045, 'Brain Exercises'),
(20701, 'Tips for Data Analysts')
]
# Create an RDD with the data for page_likes table
page_likes_data = [
(111, 20001, '2022-04-08 00:00:00'),
(121, 20045, '2022-03-12 00:00:00'),
(156, 20001, '2022-07-25 00:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT - 1 PAGES

PAGE_ID PAGE_NAME

20001 SQL Solutions

20045 Brain Exercises

Tips for Data
20701 Analysts

INPUT - 2 PAGES_LIEKS

USER_ID PAGE_ID LIKED_DATE

111 20001 2022-04-08 0:00:00

121 20045 2022-03-12 0:00:00

156 20001 2022-07-25 0:00:00

OUTPUT

PAGE_ID

20701

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()
# Define the schema for the pages
pages_schema = StructType([
StructField("page_id", IntegerType(), True),
StructField("page_name", StringType(), True)
])

# Define the schema for the page_likes table

page_likes_schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("page_id", IntegerType(), True),
StructField("liked_date", StringType(), True)
])

# Create an RDD with the data for pages

pages_data = [
(20001, 'SQL Solutions'),
(20045, 'Brain Exercises'),
(20701, 'Tips for Data Analysts')
]
# Create an RDD with the data for page_likes table
page_likes_data = [
(111, 20001, '2022-04-08 00:00:00'),
(121, 20045, '2022-03-12 00:00:00'),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13
(156, 20001, '2022-07-25 00:00:00')
]

page_df=spark.createDataFrame(pages_data,pages_schema)
page_like_df=spark.createDataFrame(page_likes_data,page_likes_
schema)
page_df.show()
page_like_df.show()

# Perform a left anti join to get pages with zero likes

zero_likes_pages = page_df.join(page_like_df, 'page_id',
'left_anti')
# Select and sort the result
result = zero_likes_pages.select("page_id").orderBy("page_id")
# Show the result
result.show()
+-------+
|page_id|
+-------+
| 20701|

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

Step - 1 : Problem Statement

14_Purchasing Activity by Product

Type
We have been given purchasing activity DF and we need
to find out cumulative purchases of each product over
time.
Difficult Level : EASY
DataFrame:
# Define schema for the DataFrame
schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("product_type", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("order_date", StringType(), True),
])

# Define data
# Define data
data = [
(213824, 'printer', 20, "2022-06-27 "),
(212312, 'hair dryer', 5, "2022-06-28 "),
(132842, 'printer', 18, "2022-06-28 "),
(284730, 'standing lamp', 8, "2022-07-05 ")
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
ORDER_ID PRODUCT_TYPE QUANTITY ORDER_DATE

213824 printer 20 2022-06-27 12:00:00

212312 hair dryer 5 2022-06-28 12:00:00

132842 printer 18 2022-06-28 12:00:00

284730 standing lamp 8 2022-07-05 12:00:00

OUTPUT

ORDER_DATE PRODUCT_TYPE CUM_PURCHASED

2022-06-27 12:00:00 printer 20

2022-06-28 12:00:00 hair dryer 5

2022-06-28 12:00:00 printer 38

2022-07-05 12:00:00 standing lamp 8

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define schema for the DataFrame

schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("product_type", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("order_date", StringType(), True),
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

order_df=spark.createDataFrame(data,schema)
order_df.show()

# Define a Window specification based on the 'order_date' column

window_spec =
Window.partitionBy("product_type").orderBy("order_date").rowsBe
tween(Window.unboundedPreceding, 0)

# Add a new column 'cumulative_purchases' representing the

cumulative sum
result_df = order_df.withColumn("cumulative_purchases",
F.sum("quantity").over(window_spec))
result_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

Step - 1 : Problem Statement

15_Teams Power Users
Write a pyspark code to identify the top 2 Power Users
who sent the highest number of messages on Microsoft
Teams in August 2022. Display the IDs of these 2 users
along with the total number of messages they sent.
Output the results in descending order based on the
count of the messages.
Difficult Level : EASY
DataFrame:
schema = StructType([
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", StringType(), True),
])

# Define the data

data = [
(901, 3601, 4500, 'You up?', '2022-08-03 00:00:00'),
(902, 4500, 3601, 'Only if you\'re buying', '2022-08-03 00:00:00'),
(743, 3601, 8752, 'Let\'s take this offline', '2022-06-14 00:00:00'),
(922, 3601, 4500, 'Get on the call', '2022-08-10 00:00:00'),
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT

MESSAGE_ID SENDER_ID RECEIVER_ID CONTENT SENT_DATE

2022-08-03
901 3601 4500 You up? 0:00:00
Only if you're 2022-08-03
902 4500 3601 buying 0:00:00
Let's take this 2022-06-14
743 3601 8752 offline 0:00:00
2022-08-10
922 3601 4500 Get on the call 0:00:00

OUTPUT

SENDER_ID COUNT(*)

3601 2

4500 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

schema = StructType([
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", StringType(), True),
])

# Define the data

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

teams_df = spark.createDataFrame(data,schema)
teams_df.show()

filter_df=teams_df.filter(teams_df['sent_date'].like("2022-08%"))
filter_df.show()

result_df=filter_df.groupby(filter_df['sender_id']).count()
result_df=result_df.orderBy(desc(result_df['count'])).limit(2)
result_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16

Step - 1 : Problem Statement

16_Select in pyspark
Write a pyspark code perform below function
● Write a pyspark code to get all employee detail.
● Write a query to get only "FirstName" column from emp_df
● Write a Pyspark code to get FirstName in upper case as "First
Name".
● Write a pyspark code to get FirstName in lower case

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT",
"Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT",
"Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR",
"Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll",
"Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290",
"IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR",
"Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793",
"IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793",
"HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793",
"Payroll", "Male"],
]

# Create a schema for the DataFrame

schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

#1. Write a pyspark code to get all employee detail

emp_df.show()

# 2. Write a query to get only "FirstName" column from emp_df

# Method 1
emp_df.select("First_Name").show()

# Method 2
emp_df.select(col("First_Name")).show()

# Method 3
emp_df.createOrReplaceTempView("emp_table")
spark.sql("select First_Name from emp_table").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
# 3. Write a Pyspark code to get FirstName in upper case as "First
Name".

emp_df.select(upper("First_Name")).show()

#4. Write a pyspark code to get FirstName in lower case

from pyspark.sql.functions import lower

emp_df.select(lower("First_Name")).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17

Step - 1 : Problem Statement

17_Select in pyspark
Write a pyspark code perform below function
● Write a pyspark code for combine FirstName and LastName
and display it as "Name" (also include white space between
first name & last name)
● Select employee detail whose name is "Vikas"
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with letter 'a'.

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

#1. Write a pyspark code for combine FirstName and

LastName and display it as "Name" (also include white space
between first name & last name)

from pyspark.sql.functions import concat_ws

emp_df.select(concat_ws(" ","First_Name","Last_Name")\
.alias("Name")).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17

# 2. Select employee detail whose name is "Vikas"

# Methos 1
from pyspark.sql.functions import col
emp_df.filter(col("First_Name") == 'Vikas' ).show(truncate=False)

# Methos 2
emp_df.filter(emp_df.First_Name == 'Vikas' ).show(truncate=False)

# Methos 3
emp_df.filter(emp_df['First_Name'] == 'Vikas' ).show(truncate=False)

# Methos 4
emp_df.where(emp_df['First_Name'] == 'Vikas' ).show(truncate=False)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with letter 'a'.

# Method 1
from pyspark.sql.functions import lower
emp_df.filter(lower(emp_df['First_Name']).like("a%")).show()

# Method 2

emp_df.filter((emp_df['First_Name'].like("a%")) |
(emp_df['First_Name'].like("A%")) ).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18

Step - 1 : Problem Statement

18_Select in pyspark
Write a pyspark code perform below function
● Get all employee details from EmployeeDetail table whose
"FirstName" contains 'k'
● Get all employee details from EmployeeDetail table whose
"FirstName" end with 'h'
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with any single character between 'a-p'

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
]

# Create a schema for the DataFrame

schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

#1. Get all employee details from EmployeeDetail table

whose "FirstName" contains 'k'

from pyspark.sql.functions import col

emp_df.filter(emp_df["First_Name"].like("%k%")).show(
)

# Get all employee details from EmployeeDetail table whose

"FirstName" end with 'h'
emp_df.filter(emp_df["First_Name"].like("%h")).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18

# Get all employee detail from EmployeeDetail table whose

"FirstName" start with any single character between 'a-p'
emp_df.filter(emp_df["First_Name"].rlike("[^a-pA-P%]")).show()

# Get all employee detail from EmployeeDetail table whose

"FirstName" start with
# any single character between 'a-p'
emp_df.filter(~(emp_df["First_Name"].rlike("[^a-pA-
P%]"))).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

Step - 1 : Problem Statement

19_Select in pyspark
Write a pyspark code perform below function
● Get all employee detail from emp_df whose "Gender" end
with 'le' and contain 4 letters. The Underscore(_) Wildcard
Character represents any single character.
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with # 'A' and contain 5 letters.
● Get all unique "Department" from EmployeeDetail table.
● Get the highest "Salary" from EmployeeDetail table.

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

emp_df=spark.createDataFrame(data,schema)

Get all employee detail from emp_df whose "Gender" end with
'le'and contain 4 letters. The Underscore(_) Wildcard Character
represents any single character.

# Get all employee detail from EmployeeDetail table whose

"FirstName" start with

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19
# 'A' and contain 5 letters.

# Get all unique "Department" from EmployeeDetail table.

# Get the highest "Salary" from EmployeeDetail table.

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20

Step - 1 : Problem Statement

20_Date in pyspark
Write a pyspark code perform below function
● Get the lowest "Salary" from EmployeeDetail table.
● Show "JoiningDate" in "dd mmm yyyy" format, ex- "15 Feb
2013"
● Show "JoiningDate" in "yyyy/mm/dd" format, ex- "2013/02/15"

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

Step - 1 : Problem Statement

21_Date in pyspark
Write a pyspark code perform below function
● Get only Year part of "JoiningDate"
● Get only Month part of "JoiningDate".
● Get only date part of "JoiningDate".
● Get the current system date using DataFrame API
● Get the current UTC date and time using DataFrame API

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21
emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

Step - 1 : Problem Statement

22_Date in pyspark
Write a pyspark code perform below function
● Get the first name, current date, joiningdate and diff between
current date and joining date in months.
● Get the first name, current date, joiningdate and diff between
current date and joining date in days.
● Get all employee details from EmployeeDetail table whose
joining year is 2013

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22
emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

Step - 1 : Problem Statement

23_Date in pyspark
Write a pyspark code perform below function
● Get all employee details from EmployeeDetail table whose
joining month is Jan(1).
● Get all employee details from EmployeeDetail table whose
joining date between 2013-01-01" and "2013-12-01".
● Get how many employee exist in "EmployeeDetail" table.
● Select all employee detail with First name "Vikas","Ashish", and
"Nikhil".

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24

Step - 1 : Problem Statement

24_Trim and case in pyspark

Write a pyspark code perform below function
● Select all employee detail with First name not in
"Vikas","Ashish", and "Nikhil".
● Select first name from "EmployeeDetail" df after removing
white spaces from right side
● Select first name from "EmployeeDetail" table after removing
white spaces from left side
● Display first name and Gender as M/F.(if male then M, if
Female then F)

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
])

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25

Step - 1 : Problem Statement

25_operator in pyspark
Write a pyspark code perform below function
● Select first name from "EmployeeDetail" table prifixed with
"Hello "
● Get employee details from "EmployeeDetail" table whose
Salary greater than 600000
● Get employee details from "EmployeeDetail" table whose
Salary less than 700000
● Get employee details from "EmployeeDetail" table whose
Salary between 500000 than 600000
● Select second highest salary from "EmployeeDetail" table

Difficult Level : EASY

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

Step - 1 : Problem Statement

26_groupby in pyspark
Write a pyspark code perform below function
● Write the query to get the department and department wise
total(sum) salary from "EmployeeDetail" table.
● Write the query to get the department and department wise
total(sum) salary, display it in ascending order according to
salary.
● Write the query to get the department and department wise
total(sum) salary, display it in descending order according to
salary.
● Write the query to get the department, total no. of
departments, total(sum) salary with respect to department
from "EmployeeDetail" table.

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

# Create a schema for the DataFrame

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26
]

# Create a schema for the DataFrame

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27

Step - 1 : Problem Statement

27_groupby in pyspark
Write a pyspark code perform below function
● 46. Get department wise average salary from
"EmployeeDetail" table order by salary ascending
● 47. Get department wise maximum salary from
"EmployeeDetail" table order by salary ascending
● 48. Get department wise minimum salary from
"EmployeeDetail" table order by salary ascending

Difficult Level : EASY

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

Step - 1 : Problem Statement

28_Join_in_pyspark
Write a pyspark code perform below function
● Write down the query to fetch Project name assign to more
than one Employee
● Get employee name, project name order by firstname from
"EmployeeDetail" and"ProjectDetail" for those employee which
have assigned project already.

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

# Create a schema for the DataFrame

pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])

# Create the data as a list of tuples

pro_data = [
(1, 1, "Task Track"),
(2, 1, "CLP"),
(3, 1, "Survey Management"),
(4, 2, "HR Management"),
(5, 3, "Task Track"),
(6, 3, "GRS"),
(7, 3, "DDS"),
(8, 4, "HR Management"),
(9, 6, "GL Management")
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

Step - 2 : Writing the pyspark code to solve the

problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

Step - 1 : Problem Statement

29_Join_in_pyspark
Write a pyspark code perform below function
● 52. Get employee name, project name order by firstname from
"EmployeeDetail" and "ProjectDetail" for all employee even
they have not assigned project.
● 53 Get employee name, project name order by firstname from
"EmployeeDetail" and "ProjectDetail" for all employee if
project is not assigned then display "-No Project Assigned".

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

# Create a schema for the DataFrame

pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])

# Create the data as a list of tuples

Step - 2 : Writing the pyspark code to solve the

problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

Step - 1 : Problem Statement

30_Join_in_pyspark
Write a pyspark code perform below function
● 56. Write a pyspark code to find out the employeename who
has not assigned any project, and display "-No Project
Assigned"( tables :- [EmployeeDetail],[ProjectDetail]).
● 57. Write a pyspark code to find out the project name which is
not assigned to any employee( tables :-
[EmployeeDetail],[ProjectDetail]).

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

# Create a schema for the DataFrame

pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])

# Create the data as a list of tuples

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

Step - 2 : Writing the pyspark code to solve the

problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

Step - 1 : Problem Statement

31_Histogram of Tweets
write a query to obtain a histogram of tweets posted per user in 2022. Output the
tweet count per user as the bucket and the number of Twitter users who fall into that
bucket.

In other words, group the users by the number of tweets they posted in 2022 and
count the number of users in each group.

Difficult Level : EASY

DataFrame:
schema = StructType([
StructField("tweet_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("msg", StringType(), True),
StructField("tweet_date", StringType(), True)
])

# Define the data

data = [
(214252, 111, 'Am considering taking Tesla private at $420. Funding
secured.', '2021-12-30 00:00:00'),
(739252, 111, 'Despite the constant negative press covfefe', '2022-01-01
00:00:00'),
(846402, 111, 'Following @NickSinghTech on Twitter changed my life!',
'2022-02-14 00:00:00'),
(241425, 254, 'If the salary is so competitive why won’t you tell me what
it is?', '2022-03-01 00:00:00'),
(231574, 148, 'I no longer have a manager. I can\'t be managed', '2022-03-
23 00:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT
TWEET_DAT
TWEET_ID USER_ID MSG E

Am considering taking Tesla private at $420. 2021-12-30

214252 111 Funding secured. 0:00:00
2022-01-01
739252 111 Despite the constant negative press covfefe 0:00:00
Following @NickSinghTech on Twitter changed 2022-02-14
846402 111 my life! 0:00:00
If the salary is so competitive why won’t you tell 2022-03-01
241425 254 me what it is? 0:00:00

2022-03-23
231574 148 I no longer have a manager. I can't be managed 0:00:00

OUTPUT

BUCKET USER_NUM

1 2

2 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32

Step - 1 : Problem Statement

32_pyspark_transformation
Write a pyspark code to transform the DataFrame to
display each student's marks in Math and English as
separate columns.
Difficult Level : EASY
DataFrame:
data=[
('Rudra','math',79),
('Rudra','eng',60),
('Shivu','math', 68),
('Shivu','eng', 59),
('Anu','math', 65),
('Anu','eng',80)
]

schema = StructType([
StructField("Name", StringType(), True),
StructField("Sub", StringType(), True),
StructField("Marks", IntegerType(), True)
])

Step - 2 : Identifying The Input Data And Expected

Output

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32
INPUT

OUTPUT

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

Step - 1 : Problem Statement

33_Hobbies Data Transformation

Problem Statement:
Transform a dataset with individuals' names and
associated hobbies into a new format using PySpark.
Convert the comma-separated hobbies into separate
rows, creating a DataFrame with individual rows for
each person and their respective hobbies.
Difficult Level : EASY
DataFrame:
# Sample input data
data = [("Alice", "badminton,tennis"),
("Bob", "tennis,cricket"),
("Julie", "cricket,carroms")]

# Create a DataFrame
df = spark.createDataFrame(data, ["name", "hobbies"])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

Step - 1 : Problem Statement

34_ Histogram of Tweets

write a pyspark code to obtain a histogram of tweets
posted per user in 2022. Output the tweet count per
user as the bucket and the number of Twitter users who
fall into that bucket.In other words, group the users by
the number of tweets they posted in 2022 and count the
number of users in each group.
Difficult Level : EASY
DataFrame:
# Define the schema for the tweets DataFrame
schema = StructType([
StructField("tweet_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("msg", StringType(), True),
StructField("tweet_date", StringType(), True)
])

# Create the tweets DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
TWEET_DAT
TWEET_ID USER_ID MSG E

Am considering taking Tesla private at $420. 2021-12-30

2022-03-23
231574 148 I no longer have a manager. I can't be managed 0:00:00

OUTPUT

BUCKET USER_NUM
1 2

2 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35

Step - 1 : Problem Statement

35_Classes More Than 5 Students

Write a pyspark code to find all the classes that have at
least five students.Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("StudentID", StringType(), True),
StructField("ClassName", StringType(), True)
])
# Data to be inserted into the DataFrame
data = [
('A', 'Math'),
('B', 'English'),
('C', 'Math'),
('D', 'Biology'),
('E', 'Math'),
('F', 'Computer'),
('G', 'Math'),
('H', 'Math'),
('I', 'Math')
]

Step - 2 : Identifying The Input Data And Expected

Output

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35
INPUT

OUTPUT

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

Step - 1 : Problem Statement

36_Rank Scores Problem

Write a pyspark code to rank scores. If there is a tie between
two scores, both should have the same ranking. Note that
after a tie, the next ranking number should be the next
consecutive integer value.In other words, there should be no
“holes” between ranks.

Difficult Level : MED

DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("Id", IntegerType(), True),
StructField("Score", FloatType(), True)
])

# Data to be inserted into the DataFrame

data = [
(1, 3.50),
(2, 3.65),
(3, 4.00),
(4, 3.85),
(5, 4.00),
(6, 3.65)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

Step - 1 : Problem Statement

37_Triangle Judgement Problem

A pupil Tim gets homework to identify whether three line segments could
possibly form a triangle.
However, this assignment is very heavy because there are hundreds of records
to calculate.
Could you help Tim by writing a pyspark code to judge whether these three
sides can form a triangle,
assuming df triangle holds the length of the three sides x, y and z.

Difficult Level : EASY

DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("x", IntegerType(), True),
StructField("y", IntegerType(), True),
StructField("z", IntegerType(), True)
])

# Data to be inserted into the DataFrame

data = [
(13, 15, 30),
(10, 20, 15)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

Step - 1 : Problem Statement

38_Biggest Single Number Problem

Df contains many numbers in column num including duplicated ones.

Can you write a pyspark code to find the biggest number, which only
appears once.

Difficult Level : EASY

DataFrame:
# Define the schema for the DataFrame
schema = StructType([StructField("num", IntegerType(), True)])

# Your data
data = [(8,), (8,), (3,), (3,), (1,), (4,), (5,), (6,)]

# Create a PySpark DataFrame

df = spark.createDataFrame(data, schema=schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

Step - 1 : Problem Statement

39_Not Boring Movies Problem

X city opened a new cinema, many people would like to go to this
cinema. The cinema also gives out a poster indicating the movies’ ratings
and descriptions. Please write a Pyspark Code to output movies with an
odd numbered ID and a description that is not ‘boring’. Order the result
by rating.

Difficult Level : EASY

DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("id", IntegerType(), True),
StructField("movie", StringType(), True),
StructField("description", StringType(), True),
StructField("rating", FloatType(), True)
])

# Your data
data = [
(1, "War", "great 3D", 8.9),
(2, "Science", "fiction", 8.5),
(3, "Irish", "boring", 6.2),
(4, "Ice song", "Fantasy", 8.6),
(5, "House card", "Interesting", 9.1)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

Step - 1 : Problem Statement

40_Swap Gender Problem

Given a df salary, such as the one below, that has m=male
and f=female values. Swap all f and m values (i.e., change all f
values to m and vice versa)

Difficult Level : EASY

DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("sex", StringType(), True),
StructField("salary", IntegerType(), True),
])

# Define the data

data = [
(1, "A", "m", 2500),
(2, "B", "f", 1500),
(3, "C", "m", 5500),
(4, "D", "f", 500),
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR

Guide To Building AI Agents From Scratch
100% (4)
Guide To Building AI Agents From Scratch
17 pages
Azure Data Factory Interview Questions
0% (1)
Azure Data Factory Interview Questions
14 pages
Video Editing Tools
100% (3)
Video Editing Tools
21 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Pyspark IQ FREE Guide
No ratings yet
Pyspark IQ FREE Guide
57 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
PySpark 30 Days Practice Guide?
No ratings yet
PySpark 30 Days Practice Guide?
35 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
Siva
No ratings yet
Siva
4 pages
How To Land On Azure Data Engineer Job
No ratings yet
How To Land On Azure Data Engineer Job
5 pages
Vaiks Snowflake
No ratings yet
Vaiks Snowflake
183 pages
Dp203 Notes
No ratings yet
Dp203 Notes
87 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
Azure DataEngineering End To End Videos
No ratings yet
Azure DataEngineering End To End Videos
21 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
HowToCrackInterview Udemy
No ratings yet
HowToCrackInterview Udemy
58 pages
Abdul_SnowflakeDeveloper
No ratings yet
Abdul_SnowflakeDeveloper
3 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
A Data Pipeline Should Address These Issues:: Topics To Study
No ratings yet
A Data Pipeline Should Address These Issues:: Topics To Study
10 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Interview DE by Company Azurelib Dot Com
No ratings yet
Interview DE by Company Azurelib Dot Com
14 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Best Practices of Apache Airflow
No ratings yet
Best Practices of Apache Airflow
3 pages
Amazon, Data Engineer I - Interview Experience.
No ratings yet
Amazon, Data Engineer I - Interview Experience.
3 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Azure Data Engineer Resume
No ratings yet
Azure Data Engineer Resume
2 pages
What Is Azure Data Engineer
No ratings yet
What Is Azure Data Engineer
74 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Lab 3 - Enabling Team Based Data Science With Azure Databricks
No ratings yet
Lab 3 - Enabling Team Based Data Science With Azure Databricks
18 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
Bhaskar ADE - Altimetrik
No ratings yet
Bhaskar ADE - Altimetrik
3 pages
azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
Oracle Essbase 9 Implementation Guide
From Everand
Oracle Essbase 9 Implementation Guide
Joseph Sydney Gomez
No ratings yet
ORACLE 12C Complete Self-Assessment Guide
From Everand
ORACLE 12C Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Day 89
No ratings yet
Day 89
9 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Data Modelling
No ratings yet
Data Modelling
40 pages
spark - groupByKey vs reduceByKey
No ratings yet
spark - groupByKey vs reduceByKey
3 pages
Master Airflow With This Amazing Document!
No ratings yet
Master Airflow With This Amazing Document!
63 pages
Azure DE Roadmap2024
No ratings yet
Azure DE Roadmap2024
10 pages
Prompting Techniques
100% (1)
Prompting Techniques
14 pages
PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
?????? ???????? ??????????
No ratings yet
?????? ???????? ??????????
5 pages
Python Portfolio Project For Data Analyst
No ratings yet
Python Portfolio Project For Data Analyst
13 pages
20+ Key Difference in Spark
No ratings yet
20+ Key Difference in Spark
9 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
No ratings yet
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
1 page
Step-By-Step Method To Find Drop Off Points in A User Flow
No ratings yet
Step-By-Step Method To Find Drop Off Points in A User Flow
17 pages
Full Load
No ratings yet
Full Load
16 pages
SC c06 Cag HBK Irm ST 00868
No ratings yet
SC c06 Cag HBK Irm ST 00868
4 pages
ISYE 6420 Syllabus
No ratings yet
ISYE 6420 Syllabus
1 page
6CN017HK - Assignment Brief 1 - 2023-2024
No ratings yet
6CN017HK - Assignment Brief 1 - 2023-2024
7 pages
Installation of The Flash File For EMCP 3.2 and EMCP 3.3 (4490, 7620) (REHS3671-01)
No ratings yet
Installation of The Flash File For EMCP 3.2 and EMCP 3.3 (4490, 7620) (REHS3671-01)
17 pages
Modbus RTU Communication Protocol Between ELC
No ratings yet
Modbus RTU Communication Protocol Between ELC
7 pages
Easy Downloader
100% (1)
Easy Downloader
6 pages
Topical Revision Qns - Computer Studies (Paper 1)
No ratings yet
Topical Revision Qns - Computer Studies (Paper 1)
66 pages
Computer Lab Survey - 13 December 2022
No ratings yet
Computer Lab Survey - 13 December 2022
13 pages
Syllabus PDF
No ratings yet
Syllabus PDF
87 pages
MUST-Digital Trip PDF
No ratings yet
MUST-Digital Trip PDF
9 pages
Basic Linux Networking Commands
No ratings yet
Basic Linux Networking Commands
9 pages
2017 and 2018 Batch Students Data
No ratings yet
2017 and 2018 Batch Students Data
12 pages
Amdocs FS
No ratings yet
Amdocs FS
4 pages
UniQ Instruction Manual Rev01
No ratings yet
UniQ Instruction Manual Rev01
31 pages
Salary Negotiation
No ratings yet
Salary Negotiation
32 pages
Training Activity Matrix: Visual Graphics Design NC Iii Develop Design Studies
No ratings yet
Training Activity Matrix: Visual Graphics Design NC Iii Develop Design Studies
3 pages
Blockchain Technology Introduction
No ratings yet
Blockchain Technology Introduction
29 pages
Hardware Hacking Security Education Platform (Haha Sep V2) : Enabling Hands-On Applied Research of Hardware Security Theory & Principles
No ratings yet
Hardware Hacking Security Education Platform (Haha Sep V2) : Enabling Hands-On Applied Research of Hardware Security Theory & Principles
2 pages
PuTTY Log 2015
No ratings yet
PuTTY Log 2015
304 pages
Scientific Development of Smart Farming Technologies and Their Application in Brazil
No ratings yet
Scientific Development of Smart Farming Technologies and Their Application in Brazil
12 pages
13-58_SM-N900_Boot_Recovery_Guide_rev1.0
No ratings yet
13-58_SM-N900_Boot_Recovery_Guide_rev1.0
4 pages
Theorist's Toolkit Lecture 8: High Dimensional Geometry and Geometric Random Walks
No ratings yet
Theorist's Toolkit Lecture 8: High Dimensional Geometry and Geometric Random Walks
8 pages
Timing Analysis
No ratings yet
Timing Analysis
3 pages
Data Mining Query Language
0% (1)
Data Mining Query Language
7 pages
Advanced Military Spying and Bomb Disposal Robot: Senthamizh.R1, Subbu Lakshmi.P1, Subhashree.P1, Prof. M.Priyadarshini2
No ratings yet
Advanced Military Spying and Bomb Disposal Robot: Senthamizh.R1, Subbu Lakshmi.P1, Subhashree.P1, Prof. M.Priyadarshini2
3 pages
Nokia Command List
No ratings yet
Nokia Command List
15 pages
NFPA 72 2010 Update
100% (1)
NFPA 72 2010 Update
99 pages
PayPal Software Engineer JD
No ratings yet
PayPal Software Engineer JD
1 page
IA 124: Introduction To IT Security: Malware & Intrusion Detection
No ratings yet
IA 124: Introduction To IT Security: Malware & Intrusion Detection
27 pages