Introduction to Spark
Introduction to Spark
Sajan Kedia
Agenda
• What is Spark
• Why Spark
• Spark Framework
• RDD
• Immutability
• Lazy Evaluation
• Dataframe
• Dataset
• Spark SQL
• Architecture
• Cluster Manager
What is Spark?
• Apache Spark is a fast, in-memory data processing engine
• With development APIs it allow to execute streaming, machine learning or SQL.
• Fast, expressive cluster computing system compatible with Apache Hadoop
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs (DAG)
• Up to 100× faster
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
• Often 2-10× less code
• open source parallel process computational framework primarily used for data
engineering and analytics.
About Apache Spark
• Initially started at UC Berkeley in 2009
• Open source cluster computing framework
• Written in Scala (gives power of functional Programming)
• Provides high level APIs in
• Java
• Scala
• Python
• R
• Integration with Hadoop and its ecosystem and can read existing data.
• Designed to be fast for iterative algorithms and interactive queries for which mapReduce is inefficient.
• Most popular for running Iterative Machine Learning Algorithms.
• With support for in memory storage and efficient fault recovery.
• 10x (on disk) - 100x (In-Memory) faster
Why Spark ?
• Most of Machine Learning Algorithms are iterative because each iteration
can improve the results
• With Disk based approach each iteration output is written to disk making it
slow