0% found this document useful (0 votes)
0 views

unit 4 Spark SQL

The document outlines various methods to create DataFrames in Spark, including converting RDDs to DataFrames and reading from CSV files. It also provides examples of using Spark SQL for querying data, performing joins, and aggregating results. Additionally, it includes practice questions with solutions related to DataFrame operations and SQL queries in Spark.

Uploaded by

varunjas21
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

unit 4 Spark SQL

The document outlines various methods to create DataFrames in Spark, including converting RDDs to DataFrames and reading from CSV files. It also provides examples of using Spark SQL for querying data, performing joins, and aggregating results. Additionally, it includes practice questions with solutions related to DataFrame operations and SQL queries in Spark.

Uploaded by

varunjas21
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Methods to create Data

Frame
Example
import spark.implicits._ --import package

val col = Seq(“team”, “matches”) --create data

val data = Seq((“India”, 300),(“Australia”, 280),(“England”,275))

Lets create RDD from the above data


val rdd = spark.sparkContext.parallelize(data)
Method1 toDF()
val df1 = rdd.toDF() -- convert rdd to dataframe

df1.show() -- show the data similar to take of rdd

This will give us the output in the form of rows and column but column names are
not provided as we have not mentioned.
val df1 = rdd.toDF(“team”, “matches”)

df1.show() -- DF with col names


Method2 createDataFrame()
val df2 = spark.createDataFrame(rdd).toDF(col:_*)

Where rdd is input data and col is passed to toDF()

df2.show() will result into another dataframe.


Method 3 Create a DF using CSV File
val df3 = spark.read.csv(“file path”)
Example:
val df=spark.read.csv("C:/Users/Dell/Desktop/a.csv")
df3.show()
• Note:
In order to create a data frame from a json file we will use .json instead
of .csv in the above syntax. For text file we have to use .text
create sparksql session
# Import SparkSession
import org.apache.spark.sql.SparkSession

Go to paste mode using :paste

# Create SparkSession
val spark = SparkSession.builder().master("local[1]")
.appName("SparkExample")
.getOrCreate()
Create a Data Frame from CSV

Execute below code in paste mode

val df = spark.read.option("header",true)
.csv("C:/Users/Dell/Desktop/a.csv")
df.printSchema()
df.show()
Convert CSV File into table using
Views
• Execute below code in paste mode

df.createOrReplaceTempView("Zipcodes")
Use of SQL Queries in Spark
• Execute below code in paste mode

spark.sql("SELECT country, city, zipcode FROM ZIPCODES")


.show(5)

spark.sql(""" SELECT country, city, zipcode FROM ZIPCODES


WHERE country = 'US' """)
.show(5)
spark.sql(""" SELECT country, city, zipcode FROM ZIPCODES
WHERE country in ('IND') order by city """)
.show(5)

spark.sql(""" SELECT city, count(*) as count FROM ZIPCODES


GROUP BY city""")
.show()
Practice Question 1
Consider a dataset named employee_data with the following schema:
| Name | Age | Department | Salary |
|----------|-----|------------|--------|
| John | 30 | IT | 50000 |
| Alice | 35 | HR | 60000 |
| Bob | 40 | Finance | 70000 |
Write SparkSQL queries to perform the following tasks:
1. Load the data into a Spark DataFrame.
2. Find the average age of employees.
3. Retrieve the names of employees who belong to the Finance
department.
4. Calculate the total salary budget of the company.
Solution
import spark.implicits._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().master("local[1]")
.appName("SparkExample")
.getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq(
("John", 30, "IT", 50000),
("Alice", 35, "HR", 60000),
("Bob", 40, "Finance", 70000)
))
val df = rdd.toDF("Name", "Age", "Department", "Salary")
Solution (Cont.)
df.createOrReplaceTempView("emp")

val avgAge = spark.sql("SELECT AVG(Age) AS Avg_Age FROM emp")


avgAge.show()

val financeEmployees = spark.sql("SELECT Name FROM emp WHERE Department =


'Finance'")
financeEmployees.show()

val totalSalary = spark.sql("SELECT SUM(Salary) AS Total_Salary_Budget FROM emp")


totalSalary.show()
Practice Question 2
Consider a dataset named sales_data with the following schema:
| Transaction_ID | Product | Amount |
|----------------|---------|--------|
|1 | Laptop | 1000 |
|2 | Phone | 500 |
|3 | Laptop | 1200 |
|4 | TV | 800 |
Write SparkSQL queries to perform the following tasks:
1. Load the data into a Spark DataFrame.
2. Find the total amount earned from Laptop sales.
3. Calculate the average amount of all transactions.
4. Count the number of transactions where the amount is greater than 1000.
Solution
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("SparkSQL Example")
.master("local[*]")
.getOrCreate()

import spark.implicits._
val rdd = spark.sparkContext.parallelize(Seq(
Row(1, "Laptop", 1000),
Row(2, "Phone", 500),
Row(3, "Laptop", 1200),
Row(4, "TV", 800)
))
Solution (Contd.)
val df=rdd.toDF("id","product","amount")

df.createOrReplaceTempView("sales_data")

val laptopSales = spark.sql("SELECT SUM(amount) AS Total_Laptop_Sales FROM sales_data WHERE product =


'Laptop'")
laptopSales.show()

val avgAmount = spark.sql("SELECT AVG(amount) AS Average_Transaction_Amount FROM sales_data")


avgAmount.show()

val countTransactions = spark.sql("SELECT COUNT(*) AS Transactions_Greater_Than_1000 FROM sales_data


WHERE amount > 1000")
countTransactions.show()
Joins in Spark SQL
Joins in SQL
Example
• Lets create first DF:

val emp1 = sc.parallelize(List((10,”Inventory”,”Hybd”),


(20,”Finance”,”bglr”), (30,”HR”,”Mumbai”),
(40,”Admin”,”che”))).toDF(“Deptno”,”Dname”,”Loc”)

emp1.show()
Example (continues)
• Lets create second table

val emp2 = sc.parallelize(List((111,”Saketh”,”analyst”,444,10),


(222,”Sudha”,”clerk”,333,20), (333,”Jagan”,”Manager”,111,10),
(444,”madhu”,”engineer”,222,40))).toDF(“Empno”,”Ename”,”job”,”Mgr”,
”DeptNo”)

emp2.show()
Inner Join
• Inner join can be performed on two tables emp1 and emp2 using
below code to get the common of both table:

emp1.join(emp2,”Deptno”).show()

OR

emp1.join(emp2,Seq(“Deptno”),”inner”).show()
• Both the ways are for inner joins.
Left Outer join
• It will display the content of left table along with common.

emp1.join(emp2,Seq(“Deptno”),”left_outer”).show()

• This will display the inner join result along with extra content of left
table.
Right Outer join
• It will display the content of right table along with common.

emp1.join(emp2,Seq(“Deptno”),”right_outer”).show()

• This will display the inner join result along with extra content of right
table.
Full Outer Join
• It will display the content of both the tables

emp1.join(emp2,Seq(“Deptno”),”full_outer”).show()

• This will display the content of both emp1 and emp2.


Cross Join
• Cross join is not recommended to use because of complexity over
large tables.

emp1.crossJoin(emp2).show()

• This join multiplies all the rows of one table with other table and
shows unnecessary result
NOTE
• All these previous join works with common column In both the tables.

• If there isn’t any common column in any two tables then we cannot
perform join on those tables using inner or outer join. Hence we need
to join these tables based on any condition.

• In Self join a table will be joined to itself.


Practice Question
Create two tables:
employees:
| employee_id | name | department_id | salary |
|1 | Alice | 101 | 60000 |
|2 | Bob | 102 | 55000 |
|3 | Charlie | 101 | 62000 |
|4 | David | 103 | 58000 |
departments:
| department_id | department_name |
|---------------|-----------------|
| 101 | HR |
| 102 | Finance |
| 103 | Marketing |
1) Creating the employees table:
val employeesData = Seq(
(1, "Alice", 101, 60000),
(2, "Bob", 102, 55000),
(3, "Charlie", 101, 62000),
(4, "David", 103, 58000)
)
val employeesDF = spark.createDataFrame(employeesData).toDF("employee_id",
"name", "department_id", "salary")

employeesDF.createOrReplaceTempView("employees")
2) Creating the departments table:
val departmentsData = Seq(
(101, "HR"),
(102, "Finance"),
(103, "Marketing")
)
val
departmentsDF=spark.createDataFrame(departmentsData).toDF("depa
rtment_id", "department_name")

departmentsDF.createOrReplaceTempView("departments")
Practice Question 1
1. Retrieve all the columns from the employees and departments
tables where there is a matching key 'department_id' between
them.
Solution ( inner join)
val innerJoinResult = spark.sql(
"""
SELECT e.*, d.*
FROM employees e
INNER JOIN departments d
ON e.department_id = d.department_id
"""
)
innerJoinResult.show()
Practice Question 2
• Retrieve all records from the employees table and matching records
from the departments table based on the key 'department_id'. If
there is no match in the departments table, display NULL values for
the corresponding department columns.
Solution (Left Outer Join)

val leftOuterJoinResult = spark.sql(


"""
SELECT e.*, d.*
FROM employees e
LEFT OUTER JOIN departments d
ON e.department_id = d.department_id
"""
)
leftOuterJoinResult.show()
Practice Question 3
• Retrieve all records from the departments table and matching records
from the employees table based on the key 'department_id'. If there
is no match in the employees table, display NULL values for the
corresponding employee columns.
Solution ( Right Outer Join)
val rightOuterJoinResult = spark.sql(
"""
SELECT e.*, d.*
FROM employees e
RIGHT OUTER JOIN departments d
ON e.department_id = d.department_id
"""
)
rightOuterJoinResult.show()
Practice Question 4
• Retrieve all records from both the employees and departments
tables. If there's no match for a record in either table, display NULL
values for the corresponding columns.
Solution ( Full Outer Join)

val fullOuterJoinResult = spark.sql(


"""
SELECT e.*, d.*
FROM employees e
FULL OUTER JOIN departments d
ON e.department_id = d.department_id
"""
)
fullOuterJoinResult.show()
Practice Question 5
• Generate all possible combinations of records between the
employees and departments tables.
Solution (Cross Join)
val crossJoinResult = spark.sql(
"""
SELECT e.*, d.*
FROM employees e
CROSS JOIN departments d
"""
)
crossJoinResult.show()
Practice Question 6
• Retrieve records from the employees table where the value in column
'salary' is greater than 50000 and there is a matching record in the
departments table based on the key 'department_id'.
Solution
val filteredJoinResult = spark.sql(
"""
SELECT e.*, d.*
FROM employees e
JOIN departments d
ON e.department_id = d.department_id
WHERE e.salary > 50000
"""
)
filteredJoinResult.show()
Practice Question 7
• Join the employees, departments, and managers tables based on the
keys 'department_id' and 'manager_id', where
employees.department_id = departments.department_id and
employees.manager_id = managers.manager_id.
Solution
managers.createOrReplaceTempView("managers")

val multipleJoinResult = spark.sql(


"""
SELECT e.*, d.*, m.*
FROM employees e
JOIN departments d ON e.department_id = d.department_id
JOIN managers m ON e.manager_id = m.manager_id
"""
)
multipleJoinResult.show()
Practice Question 8
• Join the employees table to itself to compare records based on the
manager's ID.
Solution
val selfJoinResult = spark.sql(
"""
SELECT e1.*, e2.*
FROM employees e1
JOIN employees e2 ON e1.manager_id = e2.employee_id
"""
)
selfJoinResult.show()
Practice Question 9
• Join the employees and departments tables and calculate the average
salary for each department.
Solution

val aggregationJoinResult = spark.sql(


"""
SELECT d.department_id, d.department_name, AVG(e.salary) AS avg_salary
FROM employees e
JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
"""
)
aggregationJoinResult.show()
Practice Question 10
• Join the employees and departments tables with aliases for better
readability and clarity.
Solution
val aliasJoinResult = spark.sql(
"""
SELECT e.*, d.*
FROM employees e
JOIN departments d
ON e.department_id = d.department_id
"""
)
aliasJoinResult.show()

You might also like