0% found this document useful (0 votes)
34 views8 pages

CS 2018 042

This document contains code to analyze employee data using PySpark. It reads employee data from a file into a DataFrame and performs operations like filtering employees over 30, creating a SQL table from the DataFrame, and retrieving the highest paid USA employee. The code shows the schema, filters for age over 30, and selects the top USA earner.

Uploaded by

veeresvaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views8 pages

CS 2018 042

This document contains code to analyze employee data using PySpark. It reads employee data from a file into a DataFrame and performs operations like filtering employees over 30, creating a SQL table from the DataFrame, and retrieving the highest paid USA employee. The code shows the schema, filters for age over 30, and selects the top USA earner.

Uploaded by

veeresvaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CS/2018/042

44033 Big Data Analysis


Assignment 01

G.D.Vindika shehan

CS/2018/042
CS/2018/042

Question 01
(a) Consider the data file ‘Words.txt’ and write PySpark code
segments to do the following.
(b) Read the data in the file and count how many times each
word appears using RDDs.

Code
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
input_rdd = sc.textFile("sample_data/Words.txt")
words_rdd = input_rdd.flatMap(lambda line: line.split(" "))
word_count_rdd = words_rdd.countByValue()
for value, count in word_count_rdd.items():
print(f"{value}: {count}")

Output
CS/2018/042
CS/2018/042

Question 02
(a) Create a data frame from the following employee data.
FName LName Age Salary Country
Jane Doe 34 123900 USA
Harvey Spectur 28 234590 USA
Jaing Xu 32 1090890 China
Won Gu 25 1903490 China

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType, StructField,
StringType,IntegerType,LongType

spark = SparkSession.builder.appName("DataFrame to Table


Example").getOrCreate()

employeeData = [("Jane","Doe", 34,123900,"USA"),


("Harvey","Spectur", 28,234590,"USA"),
("Jaing","Xu", 32,1090890,"China"),
("Won","Gu", 25,1903490,"China")]

# Define the schema


tableStructure = StructType([
StructField("FName", StringType(), True),
StructField("LName", StringType(), True),
StructField("Age",IntegerType(),True),
StructField("Salary",IntegerType(),True),
StructField("Country",StringType(),True)
])

# Create a DataFrame
dataFrame = spark.createDataFrame(employeeData, tableStructure)
CS/2018/042

(b) Write PySpark code to retrieve the following.


(i) Display the schema of the data Frame created in
the part(a)

Code
dataFrame.printSchema()

Output

(ii) Filter employees whose age is above 30 years.


Code
result1 = spark.sql("SELECT * FROM people WHERE Age > 30")

result1.show()
CS/2018/042

Output

(iii) Create a temporary SQL table from the data


frame created.
dataFrame.createOrReplaceTempView("people")

(iv) Write SparkSQL to retrieve employees who earn


the highest salary among the ones who lived in
USA.
Code

result2 = spark.sql("SELECT * FROM people WHERE Country='USA'


ORDER BY Salary DESC ").limit(1)

result2.show()
CS/2018/042

Output

Full Code

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType, StructField,
StringType,IntegerType,LongType

spark = SparkSession.builder.appName("DataFrame to Table


Example").getOrCreate()

employeeData = [("Jane","Doe", 34,123900,"USA"),


("Harvey","Spectur", 28,234590,"USA"),
("Jaing","Xu", 32,1090890,"China"),
("Won","Gu", 25,1903490,"China")]

# Define the schema


tableStructure = StructType([
StructField("FName", StringType(), True),
StructField("LName", StringType(), True),
StructField("Age",IntegerType(),True),
StructField("Salary",IntegerType(),True),
StructField("Country",StringType(),True)
])

# Create a DataFrame
dataFrame = spark.createDataFrame(employeeData, tableStructure)

dataFrame.printSchema()
CS/2018/042

dataFrame.createOrReplaceTempView("people")

result1 = spark.sql("SELECT * FROM people WHERE Age > 30")

result1.show()

result2 = spark.sql("SELECT * FROM people WHERE Country='USA' ORDER BY


Salary DESC ").limit(1)

result2.show()

Output

You might also like