Spark Details

Spark is an expressive computing system that facilitates in-memory computing to avoid storing intermediate results to disk. It introduces the RDD abstraction of partitioned and distributed datasets that can be cached in memory across a cluster. RDDs support transformations and actions, where transformations build new RDDs and actions trigger execution by returning values or exporting data. Jobs are executed through a DAG scheduler and task scheduler to optimize partitioning and execution across worker nodes.

Uploaded by

sarvesh_mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views11 pages

Spark Details

Uploaded by

sarvesh_mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Spark

Spark ideas
• expressive computing system, not limited to
map-reduce model
• facilitate system memory
– avoid saving intermediate results to disk
– cache data for repetitive queries (e.g. for machine
learning)
• compatible with Hadoop
RDD abstraction
• Resilient Distributed Datasets
• partitioned collection of records
• spread across the cluster
• read-only
• caching dataset in memory
– different storage levels available
– fallback to disk possible
RDD operations
• transformations to build RDDs through
deterministic operations on other RDDs
– transformations include map, filter, join
– lazy operation
• actions to return value or export data
– actions include count, collect, save
– triggers execution
Job example
Driver
val log = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
Action!
errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)).count()

Worker Worker Worker

Cache1 Cache2 Cache2

Block1 Block2 Block3

RDD partition-level view
Dataset-level view: Partition-level view:

log:
HadoopRDD
path = hdfs://...

errors:
FilteredRDD
func = _.contains(…)
shouldCache = true
Task 1 Task 2 ...

source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Job scheduling
RDD Objects DAGScheduler TaskScheduler Worker
Cluster Threads
DAG TaskSet manager Task Block
manager

rdd1.join(rdd2) split graph into launch tasks via execute tasks

.groupBy(…)
.filter(…)
stages of tasks cluster manager
submit each retry failed or store and serve
build operator DAG
stage as ready straggling tasks blocks

source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Available APIs
• You can write in Java, Scala or Python
• interactive interpreter: Scala & Python only
• standalone applications: any
• performance: Java & Scala are faster thanks to
static typing
Hand on - interpreter
• script
http://cern.ch/kacper/spark.txt

• run scala spark interpreter

$ spark-shell

• or python interpreter
$ pyspark
Hand on – build and submission
• download and unpack source code
wget http://cern.ch/kacper/GvaWeather.tar.gz; tar -xzf GvaWeather.tar.gz
• build definition in
GvaWeather/gvaweather.sbt

• source code
GvaWeather/src/main/scala/GvaWeather.scala

• building
cd GvaWeather
sbt package

• job submission
spark-submit --master local --class GvaWeather \
target/scala-2.10/gva-weather_2.10-1.0.jar
Summary
• concept not limited to single pass map-reduce
• avoid soring intermediate results on disk or
HDFS
• speedup computations when reusing datasets

Wiley - Student Solutions Manual Engineering Statistics, 5e - Douglas C. Montgomery, George C. Runger, Norma F
7% (15)
Wiley - Student Solutions Manual Engineering Statistics, 5e - Douglas C. Montgomery, George C. Runger, Norma F
1 page
Adrian Png, Heli Helskyaho - Extending Oracle Application Express with Oracle Cloud Features_ A Guide to Enhancing APEX Web Applications with Cloud-Native and Machine Learning Technologies-Apress (202
No ratings yet
Adrian Png, Heli Helskyaho - Extending Oracle Application Express with Oracle Cloud Features_ A Guide to Enhancing APEX Web Applications with Cloud-Native and Machine Learning Technologies-Apress (202
421 pages
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
Informatica Big Data Management Course Agenda
100% (2)
Informatica Big Data Management Course Agenda
4 pages
Big Data Hadoop Certification Training: About Intellipaat
No ratings yet
Big Data Hadoop Certification Training: About Intellipaat
13 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Hands On
No ratings yet
Hands On
26 pages
Homework Labs Lecture01
No ratings yet
Homework Labs Lecture01
9 pages
Big Data Engineer Interview Questions
No ratings yet
Big Data Engineer Interview Questions
1 page
100 Interview Questions On Hadoop - Hadoop Online Tutorials
100% (1)
100 Interview Questions On Hadoop - Hadoop Online Tutorials
22 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
Hadoop Singlenode
No ratings yet
Hadoop Singlenode
43 pages
DB - Coding Interview
No ratings yet
DB - Coding Interview
54 pages
Big Data Architect JD
No ratings yet
Big Data Architect JD
2 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
No ratings yet
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
8 pages
Nagarjuna Hadoop Resume
No ratings yet
Nagarjuna Hadoop Resume
7 pages
Teradata Studio User Guide
No ratings yet
Teradata Studio User Guide
256 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Explain in Detail About Hadoop Framework
No ratings yet
Explain in Detail About Hadoop Framework
4 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Koustav BigData Resume
No ratings yet
Koustav BigData Resume
2 pages
Matillion Optimizing Snowflake
No ratings yet
Matillion Optimizing Snowflake
23 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Pyspark
No ratings yet
Pyspark
31 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Big Data Syllabus For Theory and Lab
No ratings yet
Big Data Syllabus For Theory and Lab
4 pages
Snowflake Architecture
No ratings yet
Snowflake Architecture
18 pages
What Are The Duties of A Lead Solutions Architect - Everyday Life - Global Post
No ratings yet
What Are The Duties of A Lead Solutions Architect - Everyday Life - Global Post
2 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
Hadoop Big Data Administration
No ratings yet
Hadoop Big Data Administration
6 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
Chandralekha Rao Yachamaneni
No ratings yet
Chandralekha Rao Yachamaneni
7 pages
Vinay - Interview Feedback V1.0 - Final (2) .Ods
No ratings yet
Vinay - Interview Feedback V1.0 - Final (2) .Ods
7 pages
1.hadoop Admin Brochure
No ratings yet
1.hadoop Admin Brochure
11 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Mandapriyanka (7 0)
No ratings yet
Mandapriyanka (7 0)
3 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
WP Data Engineers Handbook
No ratings yet
WP Data Engineers Handbook
22 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Mohit BigData 5yr
100% (1)
Mohit BigData 5yr
3 pages
Hadoop Admin Interview Questions and Answers
No ratings yet
Hadoop Admin Interview Questions and Answers
9 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Certification
No ratings yet
Certification
16 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
Williams S. Ettouati, Pharm.D. Director, Industrial Relations & Development Health Sciences Associate Clinical Professor, N.S
No ratings yet
Williams S. Ettouati, Pharm.D. Director, Industrial Relations & Development Health Sciences Associate Clinical Professor, N.S
44 pages
SW Project
No ratings yet
SW Project
19 pages
Introduction To Databricks SQL Answer Guide
No ratings yet
Introduction To Databricks SQL Answer Guide
6 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Ikshu 8
No ratings yet
Ikshu 8
120 pages
Karma Rahasya
No ratings yet
Karma Rahasya
76 pages
Big Data Capacity Planning
No ratings yet
Big Data Capacity Planning
7 pages
Bhajanamrita - Gita Press Gorakhpur
No ratings yet
Bhajanamrita - Gita Press Gorakhpur
108 pages
TK Inter
No ratings yet
TK Inter
168 pages
Overview
No ratings yet
Overview
25 pages
AdvancedBooks - Python Wiki
0% (1)
AdvancedBooks - Python Wiki
104 pages
Python Crash Course 0.07 PDF
No ratings yet
Python Crash Course 0.07 PDF
68 pages
Hadoop Spark
No ratings yet
Hadoop Spark
31 pages
Classification Basic Concepts, Decision Trees, and Model Evaluation
No ratings yet
Classification Basic Concepts, Decision Trees, and Model Evaluation
67 pages
The Scala Programming Language: Presented by Donna Malayeri
No ratings yet
The Scala Programming Language: Presented by Donna Malayeri
25 pages
Intro To RAML - The RESTful API Modeling Language - Baeldung
No ratings yet
Intro To RAML - The RESTful API Modeling Language - Baeldung
10 pages
Cloud Era Csu La 11122012
No ratings yet
Cloud Era Csu La 11122012
50 pages
Log Linear Models and Logistic Regression Springer Texts in Statistics
No ratings yet
Log Linear Models and Logistic Regression Springer Texts in Statistics
33 pages
Cloud Era Csu La 11122012
No ratings yet
Cloud Era Csu La 11122012
50 pages
Machine Learning With Apps in R
No ratings yet
Machine Learning With Apps in R
43 pages
It Application Tools in Business: 07 Task Performance
No ratings yet
It Application Tools in Business: 07 Task Performance
4 pages
Day 1 Introduction To AWS Services and Computing
No ratings yet
Day 1 Introduction To AWS Services and Computing
3 pages
Omprakash Workday Finance updated Resume - omprakash reddy guddeti
No ratings yet
Omprakash Workday Finance updated Resume - omprakash reddy guddeti
8 pages
5 Best Practices For Code Review
No ratings yet
5 Best Practices For Code Review
9 pages
Tempest 702 Relnote 2013
No ratings yet
Tempest 702 Relnote 2013
12 pages
Elastic Compute Cloud
No ratings yet
Elastic Compute Cloud
5 pages
Book Recommendation System Report (1)
No ratings yet
Book Recommendation System Report (1)
55 pages
Rman Recovery: Purpose of Complete Database Recovery
No ratings yet
Rman Recovery: Purpose of Complete Database Recovery
6 pages
Unit 2 Functions of Database Management Systems
No ratings yet
Unit 2 Functions of Database Management Systems
22 pages
DVCon Europe 2015 P1 7 Paper
No ratings yet
DVCon Europe 2015 P1 7 Paper
7 pages
Splunk Developer's Guide - Second Edition - Sample Chapter
No ratings yet
Splunk Developer's Guide - Second Edition - Sample Chapter
16 pages
Electronic Commerce Systems
No ratings yet
Electronic Commerce Systems
81 pages
Prescriptive Process Models Unit 1
No ratings yet
Prescriptive Process Models Unit 1
32 pages
CHAPTER 4.3 - 4.5 - IT Professionalism and Ethics
No ratings yet
CHAPTER 4.3 - 4.5 - IT Professionalism and Ethics
43 pages
OOP microproject 45-48
No ratings yet
OOP microproject 45-48
12 pages
Certification in Details
No ratings yet
Certification in Details
15 pages
Living in The IT Era - Module 1
No ratings yet
Living in The IT Era - Module 1
5 pages
Showing Support Message On Top of The Qlik Sense HUB (Forcing Update Client-Side)
No ratings yet
Showing Support Message On Top of The Qlik Sense HUB (Forcing Update Client-Side)
8 pages
Ai Tools! - AI Tools
No ratings yet
Ai Tools! - AI Tools
3 pages
SQL Server Integration Services (SSIS) 15 Best Practices
No ratings yet
SQL Server Integration Services (SSIS) 15 Best Practices
2 pages
Chapter 1
No ratings yet
Chapter 1
33 pages
Incident Response FAQ
No ratings yet
Incident Response FAQ
2 pages
gc_2025_01_17
No ratings yet
gc_2025_01_17
5 pages
Snowflake Data Cloud
No ratings yet
Snowflake Data Cloud
13 pages
Acfs Mount and Unmount: 2.6 Unmounting ACFS File Systems
No ratings yet
Acfs Mount and Unmount: 2.6 Unmounting ACFS File Systems
3 pages
OTBI
100% (1)
OTBI
22 pages
XTTS With RMAN Incremental Backup
No ratings yet
XTTS With RMAN Incremental Backup
18 pages
How To Escape The Agility Trap in Digital Transformation
No ratings yet
How To Escape The Agility Trap in Digital Transformation
47 pages
Chapter 61: Viewing The Dump Analysis Viewing The Dump Analysis
No ratings yet
Chapter 61: Viewing The Dump Analysis Viewing The Dump Analysis
4 pages

Spark Details

Uploaded by

Spark Details

Uploaded by

Spark

Worker Worker Worker

Block1 Block2 Block3

rdd1.join(rdd2) split graph into launch tasks via execute tasks

• run scala spark interpreter

You might also like