0% found this document useful (0 votes)
90 views

Pyspark Learning Hub

Uploaded by

Sozha Vendhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

Pyspark Learning Hub

Uploaded by

Sozha Vendhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

PYSPARK LEARNING HUB : DAY - 9

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 1 : Problem Statement

09_Game Play Analysis II


Write a pyspark code that reports the device that is first
logged in for each player.
Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Activity"
activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"


activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 2 : Identifying The Input Data And Expected


Output
INPUT

INPUT
PLAYER_ID DEVICE_ID EVENT_DATE GAMES_PLAYED
1 2 2016-03-01 5
1 2 2016-05-02 6
2 3 2017-06-25 1
3 1 2016-03-02 0
3 4 2018-07-03 5

OUTPUT

OUTPUT
PLAYER_ID DEVICE_ID
1 2
2 3
3 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

# Define the schema for the "Activity"


activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"


activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

# Create a PySpark DataFrame


df=spark.createDataFrame(activity_data,activity_schema)
df.show()

rank_df=df.withColumn("rk",rank().over(Window.partitionBy(df["pla
yer_id"]).orderBy(df["event_date"])))
rank_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

rank_df.filter(rank_df["rk"] ==
1).select("player_id","device_id").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR

You might also like