0% found this document useful (0 votes)

41 views

Spark

1. Map and flatMap are transformations that allow processing of each element or splitting elements into multiple parts. 2. DataFrames and Datasets provide a more optimized way of working with structured data compared to RDDs. 3. User-Defined Functions (UDFs) allow running custom logic on DataFrames and Datasets.

Uploaded by

Anuj Kulhari

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Spark

Uploaded by

Anuj Kulhari

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

If task in parallel - same stage ..

eg - map
If task need shuffling between the nodes then a new stage is created - eg countByValue
Now each stage is broken into task which may be distributed across a cluster..
Finally the tasks are scheduled across the cluster and executed.

Map - each element of the collection into a new element

val rdd = sc.parallelize(Seq("Hello World", "Goodbye World"))

// Map each element to its length

val result = rdd.map(s => s.length)

// Output: [11, 12]

result.collect()

Maps can also create key value pairs - rdd.map(x => (x,1))

Spark can do these special stuff on key value pairs -

1. reduceByKey((x,y) => x+y) - combine values using same key.

X = sum
Y = next value
Both x, y are values

Myavgfriends -
val mapval = filtereddata.mapValues(x=> (x,1)).reduceByKey( (x,y) => (x._1+y._1, x._2+y._2))
val xyz = mapval.mapValues(x => x._1 / x._2 )
2. groupByKey() Group values with same key

3. sortByKey() Sort RDD by key

4. keys(), Values() - Create RDD of just keys or just values

5 - SQL style joins - join, right outer join, left outer join, cogroup, substractByKey

countByValue() is a transformation operation in Spark that is used to count the

number of occurrences of each unique value in an RDD

Wordcount program -
Val lines = sc.textfile(“file.txt”) // gives each lines from file
val data2 = lines.flatMap(x => x.split(" "))
// val data3 = data2.map(x => (x,1))
// val data4 = data3.reduceByKey((x,y) => x+y).sortBy(_._2)
val data4 = data2.countByValue()

Map vs flat map -

Map - one to one

flat map - one to many

Flat map - each element of the collection into zero or more new elements.

val rdd = sc.parallelize(Seq("Hello World", "Goodbye World"))

// Split each string into words

val result = rdd.flatMap(s => s.split(" "))

// Output: ["Hello", "World", "Goodbye", "World"]

result.collect()
Equals - compared for equality. The method returns true if the argument is not null,
is of the same class as the object being compared, and has the same values for all of its
fields as the object being compared.

case class Person(name: String, age: Int)

val person1 = Person("Alice", 30)

val person2 = Person("Alice", 30)
val person3 = Person("Bob", 25)

println(person1.equals(person2)) // true
println(person1.equals(person3)) // false

Broadcast variable - if the dataset is small (if it could load to the memory) then we could
load it driver program, spark automatically forward it to each executor when needed.
sc.broadcast()
.value() to get the object back

But if tables are massive. we ‘d only want to transfer it once to each executor and keep it there.

Refer mybroadcast.scala - com/sundogsoftware/spark/mybroadcast.scala

val broadcastMap = spark.sparkContext.broadcast(nameId())

val momap: Int => String = (movieID:Int) => broadcastMap.value(movieID)
val lookupNameUDF = udf(momap)
val moviesWithNames =
movieCounts.withColumn("movieTitle",lookupNameUDF(col("movieID")))

UDF - In Spark Scala, a User-Defined Function (UDF) allows you to define custom functions
that can be applied to DataFrame columns or used in SQL expressions. UDFs provide flexibility
in performing custom operations on data within Spark.

import org.apache.spark.sql.functions.udf
// Define the UDF
val myUDF = udf((input: DataType) => {
// Custom logic to process the input and return a result
result})
val df: DataFrame = ... // Your DataFrame
val resultDF = df.withColumn("newColumn", myUDF(col("inputColumn")))
resultDF.show()
Example -
val squareUDF = udf((x: Int) => x * x)
// Apply the UDF to a column
val resultDF = df.withColumn("squared", squareUDF(df("number")))

Spark-submit
1. Create jar by going in project structure => artifacts => include dep + code(without lib)=>
build
2. This will create a out folder and a .jar file in it
3. Navigate in the folder where your jar is placed (for ubuntu) and run the following
command -

spark-submit --class com.sundogsoftware.spark.HelloWorld hello2.jar

spark-submit --class com.example.MyApp --master local myapp.jar

By SBT -
Remove code which is not suitable for cluster like local[*](cant take adv of cluster), file path.

Sbt.build file

name := "MovieSimilarities1MDataset"

version := "1.0"

organization := "com.sundogsoftware"

scalaVersion := "2.11.8"

libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "2.4.5" % "provided", // provided means it is installed
at server(EMR)
"org.apache.spark" %% "spark-sql" % "2.4.5" % "provided"
)
//Spark and scala should have compatible version
RDD DataFrame Dataset

Resilient Distributed Structured data with Strongly typed structured

Data Structure
Dataset schema data with schema

Immutable Yes Yes Yes

Schema No schema enforcement Schema enforcement Schema enforcement

Optimization Limited optimization Catalyst optimizer Catalyst optimizer

Lower performance Higher performance

Performance Comparable to DataFrames
compared to others compared to RDDs

Lower-level API with more Higher-level API with Higher-level API with type
API
flexibility ease of use safety
Uses Java serialization by Uses optimized Uses optimized
Serialization
default serialization formats serialization formats

Compatible with any JVM Compatible with any Compatible with any JVM
Interoperability
language JVM language language

Supports both structured Supports both structured

Integration Best for structured data
and unstructured data and unstructured data

Partially type-safe
Type Safety Not type-safe Type-safe
(based on DataFrame)

Compile-time
No No Yes
checks

Catalyst optimizer No Yes Yes

RDD creation -

import org.apache.spark.SparkContext

1.
val data = Seq(1, 2, 3, 4, 5)
val rdd = sparkContext.parallelize(data)

2.
val sc = new SparkContext("local[*]","myavgfriend")
val data = sc.textFile("data/fakefriends-noheader.csv")

3. Spark supports reading data from various external data sources, such as HDFS, Amazon S3,
Apache Cassandra, JDBC databases, etc

val rdd = sparkContext.textFile("hdfs://path/to/file.txt")

Dataframe creation -

import org.apache.spark.sql.SparkSession

1. RDD TO Dataframe - RDD to dataframe by toDF() method

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()

.appName("RDD to DataFrame")
.master("local")
.getOrCreate()

import spark.implicits._

val rdd = spark.sparkContext.parallelize(Seq(("Alice", 25), ("Bob", 30), ("Charlie",

35)))

val df = rdd.toDF("Name", "Age")

2. createDataFrame
import spark.implicits._
val data = Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35))
val df = spark.createDataFrame(data).toDF("Name", "Age")

3. Read file with header

val df = spark.read.format("csv")
.option("header", "true")
.load("/path/to/file.csv")

4. File without header

import org.apache.spark.sql.{SparkSession, Row}

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val moviesSchema = new StructType()

.add("userID", IntegerType, nullable = true)
.add("movieID", IntegerType, nullable = true)
.add("rating", IntegerType, nullable = true)
.add("timestamp", LongType, nullable = true)

import spark.implicits._

// Load up movie data as dataset

val moviesDS = spark.read
.option("sep", "\t")
.schema(moviesSchema)
.csv("data/ml-100k/u.data")

Dataset creation -

1. DF to DS

case class UserRatings(userID: Int, movieID: Int, rating: Int, timestamp: Long)

val userRatingsSchema = new StructType()

.add("userID", IntegerType, nullable = true)
.add("movieID", IntegerType, nullable = true)
.add("rating", IntegerType, nullable = true)
.add("timestamp", LongType, nullable = true)

val ratingsDS = spark.read

.option("sep", "\t")
.schema(userRatingsSchema)
.csv("data/ml-100k/u.data")
.as[UserRatings]

Use of case class -

Type safety: By using case classes to define the schema, you get compile-time type
checking. The compiler ensures that the data you are working with adheres to the
specified structure. This reduces the chances of runtime errors caused by incorrect data
manipulation or mismatched types.
When you use a case class to represent the structure of your data and convert a
DataFrame to a Dataset using the as[] method, Spark can map the columns of the
DataFrame to the fields of the case class based on their names and types.

Use of import spark.implicits._ - working with DataFrames or Datasets in Scala to

enable convenient conversions between different data types
provide access to useful functions and operators( filtering, aggregating, joining, etc.).
DataFrame(toDF())/Dataset conversions(as[T])

Example 1 - rating counter

// Load up each line of the ratings data into an RDD

val lines = sc.textFile("data/ml-100k/u.data")

// Convert each line to a string, split it out by tabs, and extract the third field.
// (The file format is userID, movieID, rating, timestamp)
val ratings = lines.map(x => x.split("\t")(2))

// Count up how many times each value (rating) occurs

val results = ratings.countByValue()

// Sort the resulting map of (rating, count) tuples

val sortedResults = results.toSeq.sortBy(_._1)

// Print each result on its own line.

sortedResults.foreach(println)

(5,21201)
(4,34174)
(3,27145)
(2,11370)
Example 2 - Avg friends

/** A function that splits a line of input into (age, numFriends) tuples. */
/**
0,Will,33,385
1,Jean-Luc,26,2
*/
def parseLine(line: String): (Int, Int) = {
// Split by commas
val fields = line.split(",")
// Extract the age and numFriends fields, and convert to integers
val age = fields(2).toInt
val numFriends = fields(3).toInt
// Create a tuple that is our result.
(age, numFriends)

// Load each line of the source data into an RDD

val lines = sc.textFile("data/fakefriends-noheader.csv")

// Use our parseLines function to convert to (age, numFriends) tuples

val rdd = lines.map(parseLine)

// Lots going on here...

// We are starting with an RDD of form (age, numFriends) where age is the KEY and
numFriends is the VALUE
// We use mapValues to convert each numFriends value to a tuple of (numFriends, 1)
// Then we use reduceByKey to sum up the total numFriends and total instances for each
age, by
// adding together all the numFriends values and 1's respectively.
val totalsByAge = rdd.mapValues(x => (x, 1)).reduceByKey( (x,y) => (x._1 + y._1, x._2
+ y._2)) // x = first row, y = second row

// So now we have tuples of (age, (totalFriends, totalInstances))

// To compute the average we divide totalFriends / totalInstances for each age.
val averagesByAge = totalsByAge.mapValues(x => x._1 / x._2) // x._1,x._2 both are
from same row

// Collect the results from the RDD (This kicks off computing the DAG and actually executes
the job)
val results = averagesByAge.collect()

// Sort and print the final results.

results.sorted.foreach(println)
Example 3 - Min temp

def dataExtractot(x:String) : (String,Int,String) ={

val prop = x.split(",")(2)
val station = x.split(",")(0)
val temp = x.split(",")(3).toInt
return (station,temp,prop)
}
val data = sc.textFile("data/1800.csv")
val filteredData = data.map(dataExtractot)
val twocol = filteredData.filter(x => x._3=="TMIN")
val twocolumnonly = twocol.map(x => (x._1,x._2))
val results = twocolumnonly.reduceByKey((x,y) => min(x,y)) // x,y are values of 1st
and 2nd row
for (result <- results.sortByKey()) {
val station = result._1
val temp = result._2
val formattedTemp = f"$temp%.2f F"
println(s"$station minimum temperature: $formattedTemp")

Example 4 - Word count

Scripting Guide ServiceNow PDF
100% (2)
Scripting Guide ServiceNow PDF
140 pages
PYSPARK Interview Questions
100% (2)
PYSPARK Interview Questions
126 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
Glove
100% (1)
Glove
10 pages
DH7516 Nova Receiving Card Spec English
No ratings yet
DH7516 Nova Receiving Card Spec English
7 pages
Android Merchant Application Using QR
No ratings yet
Android Merchant Application Using QR
5 pages
Scriptinge ServiceNow PDF
No ratings yet
Scriptinge ServiceNow PDF
140 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
notes - Copy
No ratings yet
notes - Copy
5 pages
notes (2) - Copy
No ratings yet
notes (2) - Copy
4 pages
Unit 4( Data Frame and Apache Kafka)
No ratings yet
Unit 4( Data Frame and Apache Kafka)
28 pages
Sun Microsystem Green Oak Java: Susceptible To Bugs That Can Crash System .In Particular, C++ Uses Direct
No ratings yet
Sun Microsystem Green Oak Java: Susceptible To Bugs That Can Crash System .In Particular, C++ Uses Direct
157 pages
ApacheSpark MyNotes
No ratings yet
ApacheSpark MyNotes
6 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Java Basics
No ratings yet
Java Basics
23 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
Zod_ Schema Validation in TS
No ratings yet
Zod_ Schema Validation in TS
12 pages
Java Interview Questions and Answers
100% (15)
Java Interview Questions and Answers
32 pages
CH3 - JDBC – Java Database Connectivity
No ratings yet
CH3 - JDBC – Java Database Connectivity
13 pages
Data Mining Lab Questions
100% (1)
Data Mining Lab Questions
47 pages
Corejava - 24 11 2015
No ratings yet
Corejava - 24 11 2015
390 pages
Webservices - Axis: 1.1. Table of Contents
No ratings yet
Webservices - Axis: 1.1. Table of Contents
18 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Answers Java Part 1
No ratings yet
Answers Java Part 1
18 pages
SPARK
No ratings yet
SPARK
35 pages
notes - Copy (2)
No ratings yet
notes - Copy (2)
4 pages
Basics: "1.0" "Utf-8" "Urn:nhibernate-Configuration-2.2" "Dialect"
No ratings yet
Basics: "1.0" "Utf-8" "Urn:nhibernate-Configuration-2.2" "Dialect"
13 pages
JAva Interview
No ratings yet
JAva Interview
13 pages
Interview Questions On Java
No ratings yet
Interview Questions On Java
10 pages
Java Basic
No ratings yet
Java Basic
189 pages
12-Generics - Multithreading
No ratings yet
12-Generics - Multithreading
17 pages
Medium Com Unstructured Io Setting Up A Private Retrieval Augmented Generation Rag System With Local Vector Database D42f34692ca7 1
No ratings yet
Medium Com Unstructured Io Setting Up A Private Retrieval Augmented Generation Rag System With Local Vector Database D42f34692ca7 1
9 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Data Interactions
No ratings yet
Data Interactions
7 pages
Page 01
No ratings yet
Page 01
2 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Spark Material
No ratings yet
Spark Material
6 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
JDBC Lecture Notes
No ratings yet
JDBC Lecture Notes
14 pages
Ki NG in C.,: Java Interview Questions & Answers
No ratings yet
Ki NG in C.,: Java Interview Questions & Answers
11 pages
ScalaMulti
No ratings yet
ScalaMulti
48 pages
Week12 Assignment Solution
No ratings yet
Week12 Assignment Solution
10 pages
Gradle Examples
No ratings yet
Gradle Examples
8 pages
Terraform
No ratings yet
Terraform
59 pages
MGraph new thinking
No ratings yet
MGraph new thinking
15 pages
Data Driven Framework: Here We Divide The Entire Project in To Modules and Start Automation by Writing
No ratings yet
Data Driven Framework: Here We Divide The Entire Project in To Modules and Start Automation by Writing
11 pages
Advanced Programming Using The Spark Core API: in This Chapter
No ratings yet
Advanced Programming Using The Spark Core API: in This Chapter
69 pages
SOFTCOPY
No ratings yet
SOFTCOPY
58 pages
RunningWithScissors
No ratings yet
RunningWithScissors
57 pages
Java Faq'S Features of JDK 1.5: Generics
No ratings yet
Java Faq'S Features of JDK 1.5: Generics
21 pages
Lesson5 - Introduction to Shared Preferences and Databases.pptx
No ratings yet
Lesson5 - Introduction to Shared Preferences and Databases.pptx
31 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Unit-2-AJP
No ratings yet
Unit-2-AJP
25 pages
Scripting Guide ServiceNow PDF
No ratings yet
Scripting Guide ServiceNow PDF
140 pages
Brian Clapper Founder: PHASE @brianclapper
No ratings yet
Brian Clapper Founder: PHASE @brianclapper
33 pages
JDK Tutorials - Herong's Tutorial Examples
From Everand
JDK Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Core Java Programming Book
From Everand
Core Java Programming Book
Manish Soni
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
How To Install AJAX Control Toolkit
No ratings yet
How To Install AJAX Control Toolkit
2 pages
PP_UNIT-4(PART-1)
No ratings yet
PP_UNIT-4(PART-1)
14 pages
Queues
No ratings yet
Queues
52 pages
Apex Trigger - Salesforce
No ratings yet
Apex Trigger - Salesforce
3 pages
Automation System S7-400 CPU Specifications
No ratings yet
Automation System S7-400 CPU Specifications
152 pages
OpenField Release Notes 2.2.1
No ratings yet
OpenField Release Notes 2.2.1
8 pages
13030822051304
No ratings yet
13030822051304
62 pages
R346C Upgrade Runbook
No ratings yet
R346C Upgrade Runbook
45 pages
JVM On Containers 2022
No ratings yet
JVM On Containers 2022
33 pages
Hwids - 2016 11 21 - 05 46 50
No ratings yet
Hwids - 2016 11 21 - 05 46 50
6 pages
ID CPR.03.20-CD: Manual
No ratings yet
ID CPR.03.20-CD: Manual
41 pages
Mobile Network Device: 4550 191 02111 Revision D
100% (1)
Mobile Network Device: 4550 191 02111 Revision D
17 pages
PHP - Introduction UNIT 2PART 1
No ratings yet
PHP - Introduction UNIT 2PART 1
40 pages
Resume New PDF
No ratings yet
Resume New PDF
2 pages
Tutorial-KernelProgramming
No ratings yet
Tutorial-KernelProgramming
29 pages
Abap Code Practice
No ratings yet
Abap Code Practice
35 pages
Ruth Marienga - Sukari Industries Attachment Report
No ratings yet
Ruth Marienga - Sukari Industries Attachment Report
40 pages
Railway Training Report
100% (2)
Railway Training Report
50 pages
Ptrobot Api: Dll-Based Application Programming Interface
No ratings yet
Ptrobot Api: Dll-Based Application Programming Interface
51 pages
Storing Information On The Cloud
No ratings yet
Storing Information On The Cloud
5 pages
Creating An IP-based Catalyst Store For Veeam Backups
No ratings yet
Creating An IP-based Catalyst Store For Veeam Backups
1 page
Advantages of Cloud Computing - Javatpoint
No ratings yet
Advantages of Cloud Computing - Javatpoint
10 pages
Arm Server Base System Architecture 7.0: Platform Design Document
No ratings yet
Arm Server Base System Architecture 7.0: Platform Design Document
54 pages
Recovery Catalog DG
No ratings yet
Recovery Catalog DG
11 pages
Release of QRadar 7.5.0 Update Package 6 SFS (7.5.0-QRADAR-QRSIEM-20230519190832)
No ratings yet
Release of QRadar 7.5.0 Update Package 6 SFS (7.5.0-QRADAR-QRSIEM-20230519190832)
9 pages
Computer Systems Servicing: Pre-Test
No ratings yet
Computer Systems Servicing: Pre-Test
2 pages
Android Pentesting Command Cheatsheet: Pentest All The Things
No ratings yet
Android Pentesting Command Cheatsheet: Pentest All The Things
17 pages
Irule: Some Common Examples
No ratings yet
Irule: Some Common Examples
10 pages

Spark

Uploaded by

Spark

Uploaded by

If task in parallel - same stage ..

Map - each element of the collection into a new element

// Map each element to its length

// Output: [11, 12]

Spark can do these special stuff on key value pairs -

1. reduceByKey((x,y) => x+y) - combine values using same key.

3. sortByKey() Sort RDD by key

countByValue() is a transformation operation in Spark that is used to count the

Map vs flat map -

Map - one to one

val rdd = sc.parallelize(Seq("Hello World", "Goodbye World"))

// Split each string into words

// Output: ["Hello", "World", "Goodbye", "World"]

case class Person(name: String, age: Int)

val person1 = Person("Alice", 30)

Refer mybroadcast.scala - com/sundogsoftware/spark/mybroadcast.scala

val broadcastMap = spark.sparkContext.broadcast(nameId())

spark-submit --class com.sundogsoftware.spark.HelloWorld hello2.jar

spark-submit --class com.example.MyApp --master local myapp.jar

libraryDependencies ++= Seq(

Resilient Distributed Structured data with Strongly typed structured

Immutable Yes Yes Yes

Schema No schema enforcement Schema enforcement Schema enforcement

Optimization Limited optimization Catalyst optimizer Catalyst optimizer

Lower performance Higher performance

Supports both structured Supports both structured

Catalyst optimizer No Yes Yes

val rdd = sparkContext.textFile("hdfs://path/to/file.txt")

1. RDD TO Dataframe - RDD to dataframe by toDF() method

val spark = SparkSession.builder()

val rdd = spark.sparkContext.parallelize(Seq(("Alice", 25), ("Bob", 30), ("Charlie",

val df = rdd.toDF("Name", "Age")

3. Read file with header

4. File without header

import org.apache.spark.sql.{SparkSession, Row}

val moviesSchema = new StructType()

// Load up movie data as dataset

val userRatingsSchema = new StructType()

val ratingsDS = spark.read

Use of case class -

Use of import spark.implicits._ - working with DataFrames or Datasets in Scala to

Example 1 - rating counter

// Load up each line of the ratings data into an RDD

// Count up how many times each value (rating) occurs

// Sort the resulting map of (rating, count) tuples

// Print each result on its own line.

// Load each line of the source data into an RDD

// Use our parseLines function to convert to (age, numFriends) tuples

// Lots going on here...

// So now we have tuples of (age, (totalFriends, totalInstances))

// Sort and print the final results.

def dataExtractot(x:String) : (String,Int,String) ={

Example 4 - Word count

You might also like