Databricks On AWS 01 Getting Started Apache Spark Slides
Databricks On AWS 01 Getting Started Apache Spark Slides
2
About Your Instructor
3
Apache Spark - Genesis and Open Source
4
VISION Accelerate innovation by unifying data science,
engineering and business
With Transactions
7
Apache Spark
8
HOW TO PROCESS LOTS OF DATA?
M&Ms
10
Spark Cluster
11
Spark APIs
● RDD
● DataFrame
● Dataset
12
RDD
Resilient: Fault-tolerant
13
Transformations and Actions
Transformations Actions
Filter Count
Sample Take
Union Collect
14
Dataframe
15
Datasets
16
Dataframe vs. Dataset
17
DATAFRAMES
Why Switch to Dataframes?
● User-friendly API
# RDD
(dataRDD.map(lambda (x,y): (x, (y,1)))
.reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1]))
# DataFrame
dataDF = dataRDD.toDF(["name", "age"])
dataDF.groupBy("name").agg(avg("age"))
19
Why Switch to Dataframes?
● User-friendly API
Benefits:
■ SQL/DataFrame queries
■ Tungsten and Catalyst
optimizations
■ Uniform APIs across languages
20
Why Switch to Dataframes?
21
Catalyst: Under the Hood
22
Still Not Convinced?
23
Structured APIs in Spark
24
WHY SWITCH FROM
MAPREDUCE TO SPARK?
Spark vs. MapReduce
26
When to Use Spark
27
Spark References
● Databricks
● Apache Spark ML Programming Guide
● Scala API Docs
● Python API Docs
● Spark Key Terms
28
Questions?
29