0% found this document useful (0 votes)
115 views

Apache Spark

The document provides an introduction to Apache Spark, including comparisons to Apache Hadoop. It outlines Spark's advantages over Hadoop in allowing interactive queries and iterative algorithms on large datasets stored in memory across clusters. It describes Spark's machine learning library MLlib and the ML API built on DataFrames. Finally, it mentions the SparkR and PySpark interfaces for using Spark from R and Python.

Uploaded by

Jose Pim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views

Apache Spark

The document provides an introduction to Apache Spark, including comparisons to Apache Hadoop. It outlines Spark's advantages over Hadoop in allowing interactive queries and iterative algorithms on large datasets stored in memory across clusters. It describes Spark's machine learning library MLlib and the ML API built on DataFrames. Finally, it mentions the SparkR and PySpark interfaces for using Spark from R and Python.

Uploaded by

Jose Pim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

FIRST STEPS ON APACHE SPARK

BIG DATA II Jesús Maillo ([email protected]) 14/03/2019


Outline
2

 Apache Hadoop VS Apache Spark


 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark


Hadoop VS Spark
3
Hadoop VS Spark
4
Hadoop VS Spark
5
Hadoop VS Spark
6
Hadoop VS Spark
7
Hadoop VS Spark
8
Hadoop VS Spark
9
Hadoop VS Spark
10

https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Hadoop VS Spark
11

https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Hadoop VS Spark
12

https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
Outline
13

 Apache Hadoop VS Apache Spark


 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark


Weaknesses of Hadoop
14

 Use of HDD disc


 Java programming

There is no interactive shell


 You can not iterate over the data
 However, it is widely used for its
great advantages
Weaknesses of Hadoop
15

Open Source Community


 1000+ meetup members

 70+ contributors from 20 companies

 In use at Intel, Yahoo!, Adobe, etc.


Weaknesses of Hadoop
16

 A wide variety
of Solutions:
Outline
17

 Apache Hadoop VS Apache Spark


 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark


Apache Spark
18

Retain the attractive properties of MapReduce


 Fault tolerance

 Data locality

 Scalability

Solution: augment data flow model with


“resilient distributed datasets” (RDDs)
Apache Spark
19

What is a RDD?
 A RDD is an immutable, partitioned, logical
collection of records
 Built using transformations

over another RDDs


 Can be cached for future reuse

 Partitioning can be based on a key in each record


Apache Spark
20

Transformations (define a new RDD)


map flatMap mapPartition

filter repartition sample

union intersection distinct

aggregateByKey reduceByKey

http://spark.apache.org/docs/latest/programming-guide.html#transformations
Apache Spark
21

Actions (return a result to driver program)


count first take collect

reduce takeSample takeOrdered

saveAsTextFile saveAsSequenceFile

http://spark.apache.org/docs/latest/programming-guide.html#actions
Apache Spark
22

Iterative Algorithms
Apache Spark
23

HDFS
 Hadoop Distributed
File System

 Commodity Hardware

 HDD disk
Apache Spark
24

HDFS
/home/antoniolopez/ /user/antoniolopez/

HDFS storage is different from user's local storage


Outline
25

 Apache Hadoop VS Apache Spark


 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark


Mllib: Machine Learning on
26

Mllib is a scalable machine learning library


 Easy to Deploy

 Take advantage of hadoop environment

 Contains many algorithms and utilities.

https://spark.apache.org/docs/latest/mllib-guide.html
Mllib: Machine Learning on
27

Algorithms and utilities


 Classification
 Naive Bayes, Decision Tree classifier, Random Forest
 Regression
 Linear regression, Decision Tree regression
 Clustering
 K-means
 Statistics
 Summary Statistics, Pearson’s test for independence
Mllib: Read & Write
28

 Reading from HDFS to main memory

 Writing intermediate or final results to HDFS


Mllib: Cache
29

Algorithms and utilities


 “Cache” operation forces Spark to
distribute the data
 Allocate the data in main memory

 Import for reuse of data

Efficiently
 Iterative algorithm
Mllib: Web UI
30

Web UI ( http://hadoop.ugr.es:8079/cluster )
Mllib: Web UI
31

12 Workers
Mllib: Web UI
32

Run & completed applications


Outline
33

 Apache Hadoop VS Apache Spark


 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark


ML: API on top of DataFrame
34

“Spark ML standardizes APIs for machine


learning algorithms to make it easier to combine
multiple algorithms into a single pipeline, or
workflow.”

https://spark.apache.org/docs/latest/ml-guide.html
ML: API on top of DataFrame
35

“A DataFrame is a distributed collection


of data organized into named columns. It
is conceptually equivalent to a table in a
relational database or a data frame in
R/Python, but with richer optimizations
under the hood”

http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
ML: API on top of DataFrame
36

SQL Queries
 SQL queries using string
 Return the result as a new DataFrame

http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically
ML: API on top of DataFrame
37

Interoperating with RDDs


 Transforming a RDD into a DataFrame
 Build like a RDD with column names

http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
Outline
38

 Apache Hadoop VS Apache Spark


 Weaknesses of Hadoop

 Apache Spark

 MLlib: Machine Learning on Spark

 ML: API on top of DataFrame

 SparkR & PySpark


SparkR & PySpark
39

https://spark.apache.org/docs/latest/api/python/index.html

https://spark.apache.org/docs/latest/sparkr.html
FIRST STEPS ON APACHE SPARK
BIG DATA II Jesús Maillo ([email protected]) 14/03/2019

You might also like