Spark
Spark
eg - map
If task need shuffling between the nodes then a new stage is created - eg countByValue
Now each stage is broken into task which may be distributed across a cluster..
Finally the tasks are scheduled across the cluster and executed.
Maps can also create key value pairs - rdd.map(x => (x,1))
Myavgfriends -
val mapval = filtereddata.mapValues(x=> (x,1)).reduceByKey( (x,y) => (x._1+y._1, x._2+y._2))
val xyz = mapval.mapValues(x => x._1 / x._2 )
2. groupByKey() Group values with same key
5 - SQL style joins - join, right outer join, left outer join, cogroup, substractByKey
Wordcount program -
Val lines = sc.textfile(“file.txt”) // gives each lines from file
val data2 = lines.flatMap(x => x.split(" "))
// val data3 = data2.map(x => (x,1))
// val data4 = data3.reduceByKey((x,y) => x+y).sortBy(_._2)
val data4 = data2.countByValue()
Flat map - each element of the collection into zero or more new elements.
println(person1.equals(person2)) // true
println(person1.equals(person3)) // false
Broadcast variable - if the dataset is small (if it could load to the memory) then we could
load it driver program, spark automatically forward it to each executor when needed.
sc.broadcast()
.value() to get the object back
But if tables are massive. we ‘d only want to transfer it once to each executor and keep it there.
UDF - In Spark Scala, a User-Defined Function (UDF) allows you to define custom functions
that can be applied to DataFrame columns or used in SQL expressions. UDFs provide flexibility
in performing custom operations on data within Spark.
import org.apache.spark.sql.functions.udf
// Define the UDF
val myUDF = udf((input: DataType) => {
// Custom logic to process the input and return a result
result})
val df: DataFrame = ... // Your DataFrame
val resultDF = df.withColumn("newColumn", myUDF(col("inputColumn")))
resultDF.show()
Example -
val squareUDF = udf((x: Int) => x * x)
// Apply the UDF to a column
val resultDF = df.withColumn("squared", squareUDF(df("number")))
Spark-submit
1. Create jar by going in project structure => artifacts => include dep + code(without lib)=>
build
2. This will create a out folder and a .jar file in it
3. Navigate in the folder where your jar is placed (for ubuntu) and run the following
command -
By SBT -
Remove code which is not suitable for cluster like local[*](cant take adv of cluster), file path.
Sbt.build file
name := "MovieSimilarities1MDataset"
version := "1.0"
organization := "com.sundogsoftware"
scalaVersion := "2.11.8"
Lower-level API with more Higher-level API with Higher-level API with type
API
flexibility ease of use safety
Uses Java serialization by Uses optimized Uses optimized
Serialization
default serialization formats serialization formats
Compatible with any JVM Compatible with any Compatible with any JVM
Interoperability
language JVM language language
Partially type-safe
Type Safety Not type-safe Type-safe
(based on DataFrame)
Compile-time
No No Yes
checks
import org.apache.spark.SparkContext
1.
val data = Seq(1, 2, 3, 4, 5)
val rdd = sparkContext.parallelize(data)
2.
val sc = new SparkContext("local[*]","myavgfriend")
val data = sc.textFile("data/fakefriends-noheader.csv")
3. Spark supports reading data from various external data sources, such as HDFS, Amazon S3,
Apache Cassandra, JDBC databases, etc
Dataframe creation -
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
import spark.implicits._
val df = spark.read.format("csv")
.option("header", "true")
.load("/path/to/file.csv")
import spark.implicits._
Dataset creation -
1. DF to DS
case class UserRatings(userID: Int, movieID: Int, rating: Int, timestamp: Long)
// Convert each line to a string, split it out by tabs, and extract the third field.
// (The file format is userID, movieID, rating, timestamp)
val ratings = lines.map(x => x.split("\t")(2))
(5,21201)
(4,34174)
(3,27145)
(2,11370)
Example 2 - Avg friends
/** A function that splits a line of input into (age, numFriends) tuples. */
/**
0,Will,33,385
1,Jean-Luc,26,2
*/
def parseLine(line: String): (Int, Int) = {
// Split by commas
val fields = line.split(",")
// Extract the age and numFriends fields, and convert to integers
val age = fields(2).toInt
val numFriends = fields(3).toInt
// Create a tuple that is our result.
(age, numFriends)
// Collect the results from the RDD (This kicks off computing the DAG and actually executes
the job)
val results = averagesByAge.collect()