100% found this document useful (1 vote)

153 views29 pages

Databricks On AWS 01 Getting Started Apache Spark Slides

This document provides an introduction and overview of Apache Spark. It discusses how Spark was created at UC Berkeley to address bringing together data and machine learning. Spark uses a unified analytics engine and APIs for processing big data, including SQL, streaming, machine learning and graph processing. It also summarizes the benefits of using DataFrames in Spark, such as optimized performance and a more user-friendly API compared to lower-level RDDs.

Uploaded by

Mohil Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

153 views29 pages

Databricks On AWS 01 Getting Started Apache Spark Slides

Uploaded by

Mohil Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Getting Started

with Apache Spark

Welcome and Housekeeping

● You should have received instructions on how

to participate in the training session
● If you have questions, you can use the Q&A
window in Go To Webinar
● The slides will also be made available to you as
well as a recording of the session after the
event

2
About Your Instructor

Doug Bateman is Director

of Training and Education
at Databricks. Prior to this
role he was Director of
Training at NewCircle.

3
Apache Spark - Genesis and Open Source

Spark was originally created at the AMP Lab at Berkeley. The

original creators went on to found Databricks.
Spark was created to address bringing data and machine
learning together
Spark was donated to the Apache Foundation to create the
Apache Spark open source project

4
VISION Accelerate innovation by unifying data science,
engineering and business

SOLUTION Unified Analytics Platform

WHO WE • Original creators of

ARE • 2000+ global companies use our platform across big
data & machine learning lifecycle
Introducing Delta Lake

A New Standard for Building Data Lakes

Open Format Based on Parquet

With Transactions

Apache Spark API’s

Apache Spark - A Unified Analytics Engine

7
Apache Spark

“Unified analytics engine for big data

processing, with built-in modules for
streaming, SQL, machine learning and
graph processing”
● Research project at UC Berkeley in 2009
● APIs: Scala, Java, Python, R, and SQL
● Built by more than 1,200 developers from more than 200
companies

8
HOW TO PROCESS LOTS OF DATA?
M&Ms

10
Spark Cluster

One Driver and many Executor JVMs

11
Spark APIs

● RDD
● DataFrame
● Dataset

12
RDD

Resilient: Fault-tolerant

Distributed: Computed across multiple nodes

Dataset: Collection of partitioned data

● Immutable once constructed

● Track lineage information
● Operations on collection of elements in parallel

13
Transformations and Actions

Transformations Actions
Filter Count
Sample Take
Union Collect

14
Dataframe

Data with columns (built on RDDs)

Improved performance via optimizations

15
Datasets

16
Dataframe vs. Dataset

17
DATAFRAMES
Why Switch to Dataframes?

● User-friendly API

dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])

# RDD
(dataRDD.map(lambda (x,y): (x, (y,1)))
.reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1]))

.map(lambda (x, (y, z)): (x, y / z)))

# DataFrame
dataDF = dataRDD.toDF(["name", "age"])

dataDF.groupBy("name").agg(avg("age"))

19
Why Switch to Dataframes?

● User-friendly API

Benefits:

■ SQL/DataFrame queries
■ Tungsten and Catalyst
optimizations
■ Uniform APIs across languages

20
Why Switch to Dataframes?

Wrapper to create logical plan

21
Catalyst: Under the Hood

22
Still Not Convinced?

23
Structured APIs in Spark

24
WHY SWITCH FROM
MAPREDUCE TO SPARK?
Spark vs. MapReduce

26
When to Use Spark

● Scale out: Model or data too large to

process on a single machine
● Speed up: Benefit from faster results

27
Spark References

● Databricks
● Apache Spark ML Programming Guide
● Scala API Docs
● Python API Docs
● Spark Key Terms

28
Questions?

Further Training Options: http://bit.ly/DBTrng

● Live Onsite Training
● Live Online
● Self Paced

Meet one of our Spark experts: http://bit.ly/ContactUsDB

Medio P200u P200s P213 P232 P233 - Command Set v1-0
No ratings yet
Medio P200u P200s P213 P232 P233 - Command Set v1-0
63 pages
Data Fundamentals
No ratings yet
Data Fundamentals
37 pages
Dps7 e
0% (1)
Dps7 e
38 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
No ratings yet
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
76 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Spark Tutorial
No ratings yet
Spark Tutorial
8 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Databricks
No ratings yet
Databricks
4 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Kanishk Resume
No ratings yet
Kanishk Resume
5 pages
Pyspark IQ
No ratings yet
Pyspark IQ
13 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Databricks Interview Question & Answers
No ratings yet
Databricks Interview Question & Answers
10 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Cloudera Spark
No ratings yet
Cloudera Spark
55 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
Oozie Tutorial
No ratings yet
Oozie Tutorial
84 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
Databricks Associate Data Engineer Notes
No ratings yet
Databricks Associate Data Engineer Notes
39 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Data Engineering
No ratings yet
Data Engineering
15 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
Model Answers Hw1 - Chapter 2 & 3
0% (1)
Model Answers Hw1 - Chapter 2 & 3
7 pages
Fortigate - 100D Series: Integrated Security For Small and Medium Enterprises
No ratings yet
Fortigate - 100D Series: Integrated Security For Small and Medium Enterprises
4 pages
Fast Export
No ratings yet
Fast Export
2 pages
Scrutinizing The Current Wi-Fi Security Protocol (Wpa2-Psk) To Improve Its Security
No ratings yet
Scrutinizing The Current Wi-Fi Security Protocol (Wpa2-Psk) To Improve Its Security
22 pages
SAP Mobile Data Entry RF
No ratings yet
SAP Mobile Data Entry RF
33 pages
GP-BD - Troubleshooting Lazy Doc - PA01
100% (1)
GP-BD - Troubleshooting Lazy Doc - PA01
4 pages
IT Support Technician CV
No ratings yet
IT Support Technician CV
1 page
Asterisk RealTime Sip
No ratings yet
Asterisk RealTime Sip
1 page
Information Reporting System and Executive Information System
No ratings yet
Information Reporting System and Executive Information System
9 pages
LG 42LH90 LED LCD TV Presentation Training Manual
50% (2)
LG 42LH90 LED LCD TV Presentation Training Manual
97 pages
Facebook Login Page in HTML
No ratings yet
Facebook Login Page in HTML
2 pages
Disaster Recovery Presentation
No ratings yet
Disaster Recovery Presentation
14 pages
MDX Ethernet - Leroy Somer - 2011
No ratings yet
MDX Ethernet - Leroy Somer - 2011
60 pages
Chapter 8
100% (1)
Chapter 8
12 pages
AVCOM Master GUI User Guide
No ratings yet
AVCOM Master GUI User Guide
23 pages
GSM and WCDMA Networks Interworking v5
No ratings yet
GSM and WCDMA Networks Interworking v5
34 pages
Dell Wyse Device Manager 5.7 Administrator's Guide
No ratings yet
Dell Wyse Device Manager 5.7 Administrator's Guide
259 pages
Commands Included in Show Tech CISCO
No ratings yet
Commands Included in Show Tech CISCO
4 pages
Tadiran Coral IPX New
No ratings yet
Tadiran Coral IPX New
48 pages
Operatingsystems (Autosaved)
No ratings yet
Operatingsystems (Autosaved)
17 pages
2016 Cybersecurity Playbook
100% (1)
2016 Cybersecurity Playbook
26 pages
Ibm X3650 Server
No ratings yet
Ibm X3650 Server
2 pages
Indian Institute of Information Technology Pune: Semester III Semester IV
No ratings yet
Indian Institute of Information Technology Pune: Semester III Semester IV
1 page
Advantys en DeviceNet
No ratings yet
Advantys en DeviceNet
104 pages
Indesign cs5 Taggedtext PDF
No ratings yet
Indesign cs5 Taggedtext PDF
36 pages
An Automated Teller Machine or Automatic Teller Machine
0% (1)
An Automated Teller Machine or Automatic Teller Machine
54 pages
A70 User Guide: Page 1 of 58
No ratings yet
A70 User Guide: Page 1 of 58
58 pages
DS-2XM6726G0-IS (-IM) /ND 2 MP IR Mobile Dome Network Camera
No ratings yet
DS-2XM6726G0-IS (-IM) /ND 2 MP IR Mobile Dome Network Camera
6 pages

Databricks On AWS 01 Getting Started Apache Spark Slides

Uploaded by

Databricks On AWS 01 Getting Started Apache Spark Slides

Uploaded by

Getting Started

with Apache Spark

● You should have received instructions on how

Doug Bateman is Director

Spark was originally created at the AMP Lab at Berkeley. The

SOLUTION Unified Analytics Platform

WHO WE • Original creators of

A New Standard for Building Data Lakes

Open Format Based on Parquet

Apache Spark API’s

“Unified analytics engine for big data

One Driver and many Executor JVMs

Distributed: Computed across multiple nodes

Dataset: Collection of partitioned data

● Immutable once constructed

Data with columns (built on RDDs)

Improved performance via optimizations

dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])

.map(lambda (x, (y, z)): (x, y / z)))

Wrapper to create logical plan

● Scale out: Model or data too large to

Further Training Options: http://bit.ly/DBTrng

Meet one of our Spark experts: http://bit.ly/ContactUsDB

You might also like