unit 4 Spark SQL
unit 4 Spark SQL
Frame
Example
import spark.implicits._ --import package
This will give us the output in the form of rows and column but column names are
not provided as we have not mentioned.
val df1 = rdd.toDF(“team”, “matches”)
# Create SparkSession
val spark = SparkSession.builder().master("local[1]")
.appName("SparkExample")
.getOrCreate()
Create a Data Frame from CSV
val df = spark.read.option("header",true)
.csv("C:/Users/Dell/Desktop/a.csv")
df.printSchema()
df.show()
Convert CSV File into table using
Views
• Execute below code in paste mode
df.createOrReplaceTempView("Zipcodes")
Use of SQL Queries in Spark
• Execute below code in paste mode
import spark.implicits._
val rdd = spark.sparkContext.parallelize(Seq(
Row(1, "Laptop", 1000),
Row(2, "Phone", 500),
Row(3, "Laptop", 1200),
Row(4, "TV", 800)
))
Solution (Contd.)
val df=rdd.toDF("id","product","amount")
df.createOrReplaceTempView("sales_data")
emp1.show()
Example (continues)
• Lets create second table
emp2.show()
Inner Join
• Inner join can be performed on two tables emp1 and emp2 using
below code to get the common of both table:
emp1.join(emp2,”Deptno”).show()
OR
emp1.join(emp2,Seq(“Deptno”),”inner”).show()
• Both the ways are for inner joins.
Left Outer join
• It will display the content of left table along with common.
emp1.join(emp2,Seq(“Deptno”),”left_outer”).show()
• This will display the inner join result along with extra content of left
table.
Right Outer join
• It will display the content of right table along with common.
emp1.join(emp2,Seq(“Deptno”),”right_outer”).show()
• This will display the inner join result along with extra content of right
table.
Full Outer Join
• It will display the content of both the tables
emp1.join(emp2,Seq(“Deptno”),”full_outer”).show()
emp1.crossJoin(emp2).show()
• This join multiplies all the rows of one table with other table and
shows unnecessary result
NOTE
• All these previous join works with common column In both the tables.
• If there isn’t any common column in any two tables then we cannot
perform join on those tables using inner or outer join. Hence we need
to join these tables based on any condition.
employeesDF.createOrReplaceTempView("employees")
2) Creating the departments table:
val departmentsData = Seq(
(101, "HR"),
(102, "Finance"),
(103, "Marketing")
)
val
departmentsDF=spark.createDataFrame(departmentsData).toDF("depa
rtment_id", "department_name")
departmentsDF.createOrReplaceTempView("departments")
Practice Question 1
1. Retrieve all the columns from the employees and departments
tables where there is a matching key 'department_id' between
them.
Solution ( inner join)
val innerJoinResult = spark.sql(
"""
SELECT e.*, d.*
FROM employees e
INNER JOIN departments d
ON e.department_id = d.department_id
"""
)
innerJoinResult.show()
Practice Question 2
• Retrieve all records from the employees table and matching records
from the departments table based on the key 'department_id'. If
there is no match in the departments table, display NULL values for
the corresponding department columns.
Solution (Left Outer Join)