100% found this document useful (1 vote)
140 views

Databricks On AWS 01 Getting Started Apache Spark Slides

This document provides an introduction and overview of Apache Spark. It discusses how Spark was created at UC Berkeley to address bringing together data and machine learning. Spark uses a unified analytics engine and APIs for processing big data, including SQL, streaming, machine learning and graph processing. It also summarizes the benefits of using DataFrames in Spark, such as optimized performance and a more user-friendly API compared to lower-level RDDs.

Uploaded by

Mohil Joshi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
140 views

Databricks On AWS 01 Getting Started Apache Spark Slides

This document provides an introduction and overview of Apache Spark. It discusses how Spark was created at UC Berkeley to address bringing together data and machine learning. Spark uses a unified analytics engine and APIs for processing big data, including SQL, streaming, machine learning and graph processing. It also summarizes the benefits of using DataFrames in Spark, such as optimized performance and a more user-friendly API compared to lower-level RDDs.

Uploaded by

Mohil Joshi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Getting Started

with Apache Spark


Welcome and Housekeeping

● You should have received instructions on how


to participate in the training session
● If you have questions, you can use the Q&A
window in Go To Webinar
● The slides will also be made available to you as
well as a recording of the session after the
event

2
About Your Instructor

Doug Bateman is Director


of Training and Education
at Databricks. Prior to this
role he was Director of
Training at NewCircle.

3
Apache Spark - Genesis and Open Source

Spark was originally created at the AMP Lab at Berkeley. The


original creators went on to found Databricks.
Spark was created to address bringing data and machine
learning together
Spark was donated to the Apache Foundation to create the
Apache Spark open source project

4
VISION Accelerate innovation by unifying data science,
engineering and business

SOLUTION Unified Analytics Platform

WHO WE • Original creators of


ARE • 2000+ global companies use our platform across big
data & machine learning lifecycle
Introducing Delta Lake

A New Standard for Building Data Lakes

Open Format Based on Parquet

With Transactions

Apache Spark API’s


Apache Spark - A Unified Analytics Engine

7
Apache Spark

“Unified analytics engine for big data


processing, with built-in modules for
streaming, SQL, machine learning and
graph processing”
● Research project at UC Berkeley in 2009
● APIs: Scala, Java, Python, R, and SQL
● Built by more than 1,200 developers from more than 200
companies

8
HOW TO PROCESS LOTS OF DATA?
M&Ms

10
Spark Cluster

One Driver and many Executor JVMs

11
Spark APIs

● RDD
● DataFrame
● Dataset

12
RDD

Resilient: Fault-tolerant

Distributed: Computed across multiple nodes

Dataset: Collection of partitioned data

● Immutable once constructed


● Track lineage information
● Operations on collection of elements in parallel

13
Transformations and Actions

Transformations Actions
Filter Count
Sample Take
Union Collect

14
Dataframe

Data with columns (built on RDDs)

Improved performance via optimizations

15
Datasets

16
Dataframe vs. Dataset

17
DATAFRAMES
Why Switch to Dataframes?

● User-friendly API

dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])

# RDD
(dataRDD.map(lambda (x,y): (x, (y,1)))
.reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1]))

.map(lambda (x, (y, z)): (x, y / z)))

# DataFrame
dataDF = dataRDD.toDF(["name", "age"])

dataDF.groupBy("name").agg(avg("age"))

19
Why Switch to Dataframes?

● User-friendly API

Benefits:

■ SQL/DataFrame queries
■ Tungsten and Catalyst
optimizations
■ Uniform APIs across languages

20
Why Switch to Dataframes?

Wrapper to create logical plan

21
Catalyst: Under the Hood

22
Still Not Convinced?

23
Structured APIs in Spark

24
WHY SWITCH FROM
MAPREDUCE TO SPARK?
Spark vs. MapReduce

26
When to Use Spark

● Scale out: Model or data too large to


process on a single machine
● Speed up: Benefit from faster results

27
Spark References

● Databricks
● Apache Spark ML Programming Guide
● Scala API Docs
● Python API Docs
● Spark Key Terms

28
Questions?

Further Training Options: http://bit.ly/DBTrng


● Live Onsite Training
● Live Online
● Self Paced

Meet one of our Spark experts: http://bit.ly/ContactUsDB

29

You might also like