0% found this document useful (0 votes)
13 views

Introduction to Spark

Uploaded by

somnathdeb0212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Introduction to Spark

Uploaded by

somnathdeb0212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Introduction to Spark

Sajan Kedia
Agenda
• What is Spark
• Why Spark
• Spark Framework
• RDD
• Immutability
• Lazy Evaluation
• Dataframe
• Dataset
• Spark SQL
• Architecture
• Cluster Manager
What is Spark?
• Apache Spark is a fast, in-memory data processing engine
• With development APIs it allow to execute streaming, machine learning or SQL.
• Fast, expressive cluster computing system compatible with Apache Hadoop
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs (DAG)
• Up to 100× faster
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
• Often 2-10× less code
• open source parallel process computational framework primarily used for data
engineering and analytics.
About Apache Spark
• Initially started at UC Berkeley in 2009
• Open source cluster computing framework
• Written in Scala (gives power of functional Programming)
• Provides high level APIs in
• Java
• Scala
• Python
• R
• Integration with Hadoop and its ecosystem and can read existing data.
• Designed to be fast for iterative algorithms and interactive queries for which mapReduce is inefficient.
• Most popular for running Iterative Machine Learning Algorithms.
• With support for in memory storage and efficient fault recovery.
• 10x (on disk) - 100x (In-Memory) faster
Why Spark ?
• Most of Machine Learning Algorithms are iterative because each iteration
can improve the results
• With Disk based approach each iteration output is written to disk making it
slow

Hadoop execution flow

Spark execution flow


Spark Core Engine
Spark Framework
RDD (Resilient Distributed Dataset)
• Key Spark Construct
• A distributed collection of elements
• Each RDD is split into multiple partitions which may be computed on different nodes of the cluster
• Spark automatically distributes the data in RDD across cluster and parallelize the operations
• RDD have following properties
○ Immutable
○ Lazy evaluated
○ Cacheable
○ Type inferred
RDD Operations
• How to create RDD:
• Loading external data sources
• lines=sc.textfile(“readme.txt”)
• Parallelizing a collection in a driver program
• Lines=sc.parallelize([“pandas”, “I like pandas”])
• Transformation
• transform RDD to another RDD by applying some functions
• Lineage graph (DAG) : keep the track of dependencies between transformed RDDs using which on
demand new RDDs can be created or part of persistent RDD can be recovered in case of failure.
• Examples : map, filter,flapmap,distinct,sample , union, intersect,subtract ,cartesian etc.
• Action
• Actual output is being generated in transformed RDD once action is applied
• Return values to the driver program or write data to external storage
• The entire RDD gets computed from scratch on new action call and if intermediate results are not
persisted.
• Examples : reduce, collect,count,countByvalue,take,top,takeSample,aggreagate,foreach etc.
Immutability
● Immutability means once created it never changes
● Big data by default immutable in nature
● Immutability helps to
○ Parallelize
○ Caching
Immutability in actio
const int a = 0 //immutable
int b = 0; // mutable
Updation
b ++ // in place
c=a+1
Immutability is about value not about reference
Challenges of Immutability
● Immutability is great for parallelism but not good for space
● Doing multiple transformations result in
○ Multiple copies of data
○ Multiple passes over data
● In big data, multiple copies and multiple passes will have poor performance characteristics.
Lazy Evaluation
• Laziness means not computing transformation till it’s need
• Once, any action is performed then the actual computation starts
• A DAG (Directed acyclic graph) will be created for the tasks
• Catalyst Engine is used to optimize the tasks & queries
• It helps reduce the number of passes
• Laziness in action
val c1 = collection.map(value => value +1) //do not compute anything
val c2 = c1.map(value => value +2) // don’t compute
print c2 // Now transform into

Multiple transformations are combined to one


val c2 = collection.map (value => {var result = value +1
result = result + 2 } )
Challenges of Laziness

• Laziness poses challenges in terms of data type


• If laziness defers execution, determining the type of the variable becomes challenging
• If we can’t determine the right type, it allows to have semantic issues
• Running big data programs and getting semantics errors are not fun.
Type inference
• Type inference is part of compiler to determining the type by value
• As all the transformation are side effect free, we can determine the type by operation
• Every transformation has specific return type
• Having type inference relieves you think about representation for many transforms.
• Example:
• c3 = c2.count( ) // inferred as Int
• collection = [1,2,4,5] // explicit type Array
Caching
• Immutable data allows you to cache data for long time
• Lazy transformation allows to recreate data on failure
• Transformations can also be saved
• Caching data improves execution engine performance
• Reduces lots of I/O operations of reading/writing
data from HDFS
What Spark gives Hadoop?
• Machine learning module delivers capabilities not easily exploited in
Hadoop.
• in-memory processing of sizeable data volumes, remains an important
contribution to the capabilities of a Hadoop cluster.
• Valuable for enterprise use cases
• Spark’s SQL capabilities for interactive analysis over big data
• Streaming capabilities (Spark streaming)
• Graph processing capabilities (GraphX)
What Hadoop gives Spark?
• YARN resource manager
• DFS
• Disaster Recovery capabilities
• Data Security
• A distributed data platform
• Leverage existing clusters
• Data locality
• Manage workloads using advanced policies
• Allocate shares to different teams and users
• Hierarchical queues
• Queue placement policies
• Take advantage of Hadoop’s security
• Run on Kerberized clusters
Data Frames in Spark
• Unlike an RDD, data is organized into named columns.
• Allows developers to impose a structure onto a distributed collection of data.
• Enables wider audiences beyond “Big Data” engineers to leverage the power of distributed
processing
• allowing Spark to manage the schema and only pass data between nodes, in a much more efficient
way than using Java serialization.
• Spark can serialize the data into off-heap storage in a binary format and then perform many
transformations directly on this off-heap memory,
• Custom memory management
• Data is stored in off – heap memory in binary format.
• No Garbage collection is involved ,due to avoidance of serialization
• Query optimization plan
• Catalyst Query optimizer
Datasets in Spark
• Aims to provide the best of both worlds
• RDDs – OOP and compile time safely
• Data frames - Catalyst query optimizer, custom memory management
• How dataset scores over Data frame is an additional feature it has: Encoders .
• Encoders act as interface between JVM objects and off-heap custom memory binary format data.
• Encoders generate byte code to interact with off-heap data and provide on-demand access to
individual attributes without having to deserialize an entire object.
Spark SQL
• Spark SQL is a Spark module for structured data processing
• It lets you query structured data inside Spark programs, using SQL or a familiar
DataFrame API.
• Connect to any data source the same way
• Hive, Avro, Parquet, ORC, JSON, and JDBC.
• You can even join data across these sources.
• Run SQL or HiveQL queries on existing warehouses.
• A server mode provides industry standard JDBC and ODBC connectivity for business
intelligence tools.
• Writing code in RDD API in scala can be difficult, using Spark SQL easy SQL format
code can be written which internally converts to Spark API & optimized using the DAG
& catalyst engine.
• There is no reduction in the performance.
PySpark Code - Hands On
Spark Runtime Architecture
Spark Runtime Architecture - Driver
• Master node
• Process where “main” method runs
• Runs user code that
• creates a SparkContext
• Performs RDD operations
• When runs performs two main duties
• Converting user program into tasks
• logical DAG of operations -> physical execution plan.
• Optimization Ex: pipelining map transforms together to merge them and
convert execution graph into set of stages.
• Bundled up task and send them to cluster.
• Scheduling tasks on executors
• Executor register themselves to driver
• Look at current set executors and try to schedule task in appropriate location
,based on data placement.
• Track location of cache data and use it to schedule future tasks that access
that data, to avoid side effect of storing cached data while running a task.
Spark Runtime Architecture - Spark Context
• Driver accesses Spark functionality through SC object
• Represents a connection to the computing cluster
• Used to build RDDs
• Works with the cluster manager
• Manages executors running on worker nodes
• Splits jobs as parallel task and execute them on worker nodes
• Partitions RDDs and distributes them on the cluster
Spark Runtime Architecture - Executor
• Runs individual tasks in given spark job
• Launched once at the beginning of spark application
• Two main roles
• Runs the task and return results to driver
• Provides in-memory data stored for RDDs that are cached by user
programs, through service called Block Manager that lives within each
executor.
Spark Runtime Architecture - Cluster Manager

• Launches Executors and sometimes driver


• Allows sparks to run on top of different external managers
• YARN
• Mesos
• Spark built-in stand alone cluster manager
• Deploy modes
• Client mode
• Cluster mode
Spark + YARN (Cluster Deployment mode)
• Driver runs inside a YARN container
Spark + YARN (Client Deployment Mode)
• Driver runs on the machine that you submitted the application.
e.g. local on laptop
Steps: running spark application on cluster
• Submit an application using spark-submit
• spark-submit launches driver program and invoke main() method
• Driver program contact cluster manager to ask for resources to launch executors
• Cluster manager launches executors on behalf of the driver program.
• Drive process runs through user application and send work to executors in the form
of tasks.
• Executors runs the tasks and save the results.
• driver’s main() method exits or SparkContext.stop() – terminate the executors and
release the resources.
Thanks :)

You might also like