0% found this document useful (0 votes)

115 views

Apache Spark

The document provides an introduction to Apache Spark, including comparisons to Apache Hadoop. It outlines Spark's advantages over Hadoop in allowing interactive queries and iterative algorithms on large datasets stored in memory across clusters. It describes Spark's machine learning library MLlib and the ML API built on DataFrames. Finally, it mentions the SparkR and PySpark interfaces for using Spark from R and Python.

Uploaded by

Jose Pim

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views

Apache Spark

Uploaded by

Jose Pim

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

FIRST STEPS ON APACHE SPARK

BIG DATA II Jesús Maillo ([email protected]) 14/03/2019

Outline
2

 Apache Hadoop VS Apache Spark

 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

Hadoop VS Spark
3
Hadoop VS Spark
4
Hadoop VS Spark
5
Hadoop VS Spark
6
Hadoop VS Spark
7
Hadoop VS Spark
8
Hadoop VS Spark
9
Hadoop VS Spark
10

https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Hadoop VS Spark
11

https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Hadoop VS Spark
12

https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Outline
13

 Apache Hadoop VS Apache Spark

 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

Weaknesses of Hadoop
14

 Use of HDD disc

 Java programming

There is no interactive shell

 You can not iterate over the data
 However, it is widely used for its
great advantages
Weaknesses of Hadoop
15

Open Source Community

 1000+ meetup members

 70+ contributors from 20 companies

 In use at Intel, Yahoo!, Adobe, etc.

Weaknesses of Hadoop
16

 A wide variety
of Solutions:
Outline
17

 Apache Hadoop VS Apache Spark

 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

Apache Spark
18

Retain the attractive properties of MapReduce

 Fault tolerance

 Data locality

 Scalability

Solution: augment data flow model with

“resilient distributed datasets” (RDDs)
Apache Spark
19

What is a RDD?
 A RDD is an immutable, partitioned, logical
collection of records
 Built using transformations

over another RDDs

 Can be cached for future reuse

 Partitioning can be based on a key in each record

Apache Spark
20

Transformations (define a new RDD)

map flatMap mapPartition

filter repartition sample

union intersection distinct

aggregateByKey reduceByKey

http://spark.apache.org/docs/latest/programming-guide.html#transformations
Apache Spark
21

Actions (return a result to driver program)

count first take collect

reduce takeSample takeOrdered

saveAsTextFile saveAsSequenceFile

http://spark.apache.org/docs/latest/programming-guide.html#actions
Apache Spark
22

Iterative Algorithms
Apache Spark
23

HDFS
 Hadoop Distributed
File System

 Commodity Hardware

 HDD disk
Apache Spark
24

HDFS
/home/antoniolopez/ /user/antoniolopez/

HDFS storage is different from user's local storage

Outline
25

 Apache Hadoop VS Apache Spark

 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

Mllib: Machine Learning on
26

Mllib is a scalable machine learning library

 Easy to Deploy

 Take advantage of hadoop environment

 Contains many algorithms and utilities.

https://spark.apache.org/docs/latest/mllib-guide.html
Mllib: Machine Learning on
27

Algorithms and utilities

 Classification
 Naive Bayes, Decision Tree classifier, Random Forest
 Regression
 Linear regression, Decision Tree regression
 Clustering
 K-means
 Statistics
 Summary Statistics, Pearson’s test for independence
Mllib: Read & Write
28

 Reading from HDFS to main memory

 Writing intermediate or final results to HDFS

Mllib: Cache
29

Algorithms and utilities

 “Cache” operation forces Spark to
distribute the data
 Allocate the data in main memory

 Import for reuse of data

Efficiently
 Iterative algorithm
Mllib: Web UI
30

Web UI ( http://hadoop.ugr.es:8079/cluster )
Mllib: Web UI
31

12 Workers
Mllib: Web UI
32

Run & completed applications

Outline
33

 Apache Hadoop VS Apache Spark

 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

ML: API on top of DataFrame
34

“Spark ML standardizes APIs for machine

learning algorithms to make it easier to combine
multiple algorithms into a single pipeline, or
workflow.”

https://spark.apache.org/docs/latest/ml-guide.html
ML: API on top of DataFrame
35

“A DataFrame is a distributed collection

of data organized into named columns. It
is conceptually equivalent to a table in a
relational database or a data frame in
R/Python, but with richer optimizations
under the hood”

http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
ML: API on top of DataFrame
36

SQL Queries
 SQL queries using string
 Return the result as a new DataFrame

http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically
ML: API on top of DataFrame
37

Interoperating with RDDs

 Transforming a RDD into a DataFrame
 Build like a RDD with column names

http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
Outline
38

 Apache Hadoop VS Apache Spark

 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

SparkR & PySpark
39

https://spark.apache.org/docs/latest/api/python/index.html

https://spark.apache.org/docs/latest/sparkr.html
FIRST STEPS ON APACHE SPARK
BIG DATA II Jesús Maillo ([email protected]) 14/03/2019

Develop and Operate Your LoRaWAN IoT Nodes
No ratings yet
Develop and Operate Your LoRaWAN IoT Nodes
224 pages
Artificial Intelligence and Machine Learning-Lyla
67% (3)
Artificial Intelligence and Machine Learning-Lyla
3 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Simply Rethink DB
No ratings yet
Simply Rethink DB
193 pages
SJ-20160119164028-001-ZXR10 5960 Series (V3.02.20) All 10-Gigabit Data Center Switch Product Description
No ratings yet
SJ-20160119164028-001-ZXR10 5960 Series (V3.02.20) All 10-Gigabit Data Center Switch Product Description
67 pages
BOSS AccuScore Plus 90 C
100% (1)
BOSS AccuScore Plus 90 C
166 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Excel Professionally
No ratings yet
Excel Professionally
123 pages
Docker - Part1
No ratings yet
Docker - Part1
3 pages
13 SparkBuildingAndDeploying
No ratings yet
13 SparkBuildingAndDeploying
53 pages
KPLABS+Course+ +Terraform+D0
No ratings yet
KPLABS+Course+ +Terraform+D0
5 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Docker
No ratings yet
Docker
2 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Cloudera Developer Training
100% (1)
Cloudera Developer Training
483 pages
Linux Shell Scripting v2
No ratings yet
Linux Shell Scripting v2
294 pages
Shell Scripting
No ratings yet
Shell Scripting
25 pages
Apache Spark Component Guide
No ratings yet
Apache Spark Component Guide
84 pages
Jenkins Lab
100% (1)
Jenkins Lab
41 pages
Cloudera Distribution of Apache Kafka
No ratings yet
Cloudera Distribution of Apache Kafka
56 pages
Introductiontoshellscripting 140114112036 Phpapp02
No ratings yet
Introductiontoshellscripting 140114112036 Phpapp02
61 pages
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
100% (2)
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
35 pages
Kafka Cheat Sheets
No ratings yet
Kafka Cheat Sheets
1 page
BK Ambari Installation
No ratings yet
BK Ambari Installation
59 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
DevOps KKK PDF
No ratings yet
DevOps KKK PDF
168 pages
Hive Tutorial For Beginners: Learn With Examples in 3 Days
No ratings yet
Hive Tutorial For Beginners: Learn With Examples in 3 Days
3 pages
Week 5_ Docker
No ratings yet
Week 5_ Docker
39 pages
Gcloud Command Structure
No ratings yet
Gcloud Command Structure
14 pages
Cloudera Kafka PDF
No ratings yet
Cloudera Kafka PDF
175 pages
BK Hdfs Administration
No ratings yet
BK Hdfs Administration
73 pages
Docker Fundamentals Jumpstart
No ratings yet
Docker Fundamentals Jumpstart
34 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
GCP-presented Diagram - Drawio
No ratings yet
GCP-presented Diagram - Drawio
60 pages
Devops Master Program Syllabus: Mithun Software Solutions Bangalore
No ratings yet
Devops Master Program Syllabus: Mithun Software Solutions Bangalore
12 pages
Sunil AWS Devops Course Content-1
No ratings yet
Sunil AWS Devops Course Content-1
7 pages
Deployment With Docker and Rancher and Continuous Integration and
No ratings yet
Deployment With Docker and Rancher and Continuous Integration and
47 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Jetty Server Cookbook
No ratings yet
Jetty Server Cookbook
79 pages
Avro Tutorial
100% (2)
Avro Tutorial
49 pages
Certification
No ratings yet
Certification
16 pages
Apache Cassandra Sample Resume
No ratings yet
Apache Cassandra Sample Resume
17 pages
Deploying Jupyter Notebooks For Students and Researchers
No ratings yet
Deploying Jupyter Notebooks For Students and Researchers
35 pages
Devops Interview
No ratings yet
Devops Interview
26 pages
Getting Started With Apache Kafka
No ratings yet
Getting Started With Apache Kafka
21 pages
Kubernetes Interview Questions
No ratings yet
Kubernetes Interview Questions
88 pages
What Is Aws?: Saas (Software As A Service)
No ratings yet
What Is Aws?: Saas (Software As A Service)
16 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Databricks DBX CLI - Deploy The Spark JAR Using YAML - by Ganesh Chandrasekaran - Medium
No ratings yet
Databricks DBX CLI - Deploy The Spark JAR Using YAML - by Ganesh Chandrasekaran - Medium
7 pages
Packages: DURGASOFT, # 202, 2 Floor, HUDA Maitrivanam, Ameerpet, Hyderabad - 500038
No ratings yet
Packages: DURGASOFT, # 202, 2 Floor, HUDA Maitrivanam, Ameerpet, Hyderabad - 500038
4 pages
Unix
No ratings yet
Unix
67 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
?K8's Interview Questions
No ratings yet
?K8's Interview Questions
18 pages
Amazon Web Services Training
No ratings yet
Amazon Web Services Training
5 pages
TalendOpenStudio BigData UG 5.2.1 en
No ratings yet
TalendOpenStudio BigData UG 5.2.1 en
266 pages
HDPDeveloper EnterpriseSpark1 StudentGuide
100% (1)
HDPDeveloper EnterpriseSpark1 StudentGuide
244 pages
F2L Algorithms (First 2 Layers) : Algorithm Presentation Format
No ratings yet
F2L Algorithms (First 2 Layers) : Algorithm Presentation Format
3 pages
HM-3220: Creating A Hexahedral Mesh Using The Solid Map Function
No ratings yet
HM-3220: Creating A Hexahedral Mesh Using The Solid Map Function
8 pages
SIT325: Advanced Network Security T2 - 2022, Deakin University, VIC
No ratings yet
SIT325: Advanced Network Security T2 - 2022, Deakin University, VIC
2 pages
RationShop Management System
No ratings yet
RationShop Management System
7 pages
Log
No ratings yet
Log
10 pages
AWC Unit 4
No ratings yet
AWC Unit 4
11 pages
HUAWEI WiFi AX3 Pro Guia de Início Rápido - (WS7206,01, PT-BR)
No ratings yet
HUAWEI WiFi AX3 Pro Guia de Início Rápido - (WS7206,01, PT-BR)
56 pages
Formal Methods in Software Engineering (CSE315) Assignment# 1 Name Ehtisham ID 24225
No ratings yet
Formal Methods in Software Engineering (CSE315) Assignment# 1 Name Ehtisham ID 24225
2 pages
MS Project Quick Guide
No ratings yet
MS Project Quick Guide
6 pages
Hardware and Software Interpolator
No ratings yet
Hardware and Software Interpolator
78 pages
Mtcre Static Routing
No ratings yet
Mtcre Static Routing
10 pages
Nondeterministic Finite Automata (NFA) : Multiple Next State
No ratings yet
Nondeterministic Finite Automata (NFA) : Multiple Next State
4 pages
Cat 8300 Series Edge Qa
No ratings yet
Cat 8300 Series Edge Qa
22 pages
LIST of Important ICT Computer Acronyms For UGC NET Paper 1
No ratings yet
LIST of Important ICT Computer Acronyms For UGC NET Paper 1
11 pages
Sample Resume Help Desk Agent
100% (1)
Sample Resume Help Desk Agent
8 pages
The Biography of Bill Gates
No ratings yet
The Biography of Bill Gates
8 pages
Set Up and Configure SAP Central Finance
No ratings yet
Set Up and Configure SAP Central Finance
28 pages
1 DiaSys 2.50
No ratings yet
1 DiaSys 2.50
114 pages
DATA Communication & Computer Network Theory DA
No ratings yet
DATA Communication & Computer Network Theory DA
5 pages
CSC 409 part 1_113507
No ratings yet
CSC 409 part 1_113507
23 pages
FTP Server Configuration
No ratings yet
FTP Server Configuration
25 pages
Data Glossary - Michael Dillon
No ratings yet
Data Glossary - Michael Dillon
11 pages
Block Diagram-Wps Office
No ratings yet
Block Diagram-Wps Office
35 pages
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
No ratings yet
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
29 pages
Nadar Saraswathi College of Arts and Science: Department of CS &IT
No ratings yet
Nadar Saraswathi College of Arts and Science: Department of CS &IT
15 pages
Final Year E&TC Syllabus Wef 2023-24
No ratings yet
Final Year E&TC Syllabus Wef 2023-24
43 pages

Apache Spark

Uploaded by

Apache Spark

Uploaded by

FIRST STEPS ON APACHE SPARK

BIG DATA II Jesús Maillo ([email protected]) 14/03/2019

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

 Use of HDD disc

There is no interactive shell

Open Source Community

 70+ contributors from 20 companies

 In use at Intel, Yahoo!, Adobe, etc.

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

Retain the attractive properties of MapReduce

Solution: augment data flow model with

over another RDDs

 Partitioning can be based on a key in each record

Transformations (define a new RDD)

filter repartition sample

union intersection distinct

Actions (return a result to driver program)

reduce takeSample takeOrdered

HDFS storage is different from user's local storage

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

Mllib is a scalable machine learning library

 Take advantage of hadoop environment

 Contains many algorithms and utilities.

Algorithms and utilities

 Reading from HDFS to main memory

 Writing intermediate or final results to HDFS

Algorithms and utilities

 Import for reuse of data

Run & completed applications

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

“Spark ML standardizes APIs for machine

“A DataFrame is a distributed collection

Interoperating with RDDs

 Apache Hadoop VS Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark

You might also like