A Brief Introduction To Apache Spark

The document provides an introduction to Apache Spark, describing its key characteristics as an open-source distributed general-purpose cluster computing framework. It highlights how Spark uses in-memory processing to provide faster performance than Hadoop MapReduce, and supports various types of data processing. The document also discusses Spark's integration with Hadoop and its various components like Spark SQL, streaming, machine learning and graph processing.

Uploaded by

Venkatesh Narisetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

202 views10 pages

A Brief Introduction To Apache Spark

Uploaded by

Venkatesh Narisetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

A brief introduction to Apache

Spark
Apache Spark™

 Apache Spark is an open-source distributed general-purpose cluster computing

framework with (mostly) in-memory data processing engine that can do ETL,
analytics, machine learning and graph processing on large volumes of data at
rest (batch processing)or in motion (streaming processing) with rich concise
high-level APIs for the programming languages: Scala, Python, Java, R, and
SQL.
 In contrast to Hadoop’s two-stage disk-based MapReduce computation engine,
Spark’s multi-stage (mostly) in-memory computing engine allows for running
most computations in memory, and hence most of the time provides better
performance.
Why Apache Spark

In the industry, there is a need for general purpose cluster computing tool as:
 Hadoop MapReduce can only perform “Batch processing”.
 Apache Storm / S4 can only perform “Stream processing”.
 Apache Impala / Apache Tez can only perform “Interactive processing”.
 Neo4j / Apache Giraph can only perform to “Graph processing”.

Apache Spark is a powerful open source engine that provides real-time stream
processing, interactive processing, graph processing, in-memory processing as
well as batch processing with very fast speed, ease of use and standard
interface.
Hadoop vs. Spark - An Answer to the Wrong
Question
Spark is not, despite the hype, a replacement for Hadoop. Nor is MapReduce dead.
Spark can run on top of Hadoop, benefiting from Hadoop’s cluster manager (YARN) and
underlying storage (HDFS, HBase, etc.). Spark can also run completely separately from
Hadoop, integrating with alternative cluster managers like Mesos and alternative storage
platforms like Cassandra and Amazon S3.
Earlier Hadoop relied upon Map-Reduce for the bulk of its data processing. Hadoop
MapReduce also managed scheduling and task allocation processes within the cluster; even
workloads that were not best suited to batch processing were passed through Hadoop’s
MapReduce engine, adding complexity and reducing performance. MapReduce is really a
programming model. In Hadoop MapReduce, multiple MapReduce jobs would be strung
together to create a data pipeline. In between every stage of that pipeline, the MapReduce
code would read data from the disk, and when completed, would write the data back to the
disk. This process was inefficient because it had to read all the data from disk at the
beginning of each stage of the process.

Spark offers a far faster way to process data than passing it through unnecessary Hadoop
MapReduce processes.
What Hadoop Gives Spark

 • YARN resource manager, which takes responsibility for scheduling tasks

across available nodes in the cluster;
 • Distributed File System, which stores data when the cluster runs out of
free memory, and which persistently stores historical data when Spark is not
running.
 • Disaster Recovery capabilities, inherent to Hadoop, which enable recovery
of data when individual nodes fail. These capabilities include basic (but
reliable) data mirroring across the cluster and richer snapshot and mirroring
capabilities such as those offered by the MapR Data Platform;
 • Data Security, which becomes increasingly important as Spark tackles
production workloads in regulated industries such as healthcare and
financialservices.
Apache Spark Characteristics

 Lighting Fast Processing: Spark enables applications in Hadoop clusters to run up to

100x faster in memory, and 10x faster even when running on disk
 Usability: It allows you quickly write application in Java, Scala, Python and R.
 In-Memory Computing: Keeping data in servers' RAM as it makes accessing stored
data quickly.
 Variation in Data Processing: It can work on Stream Processing as well as batch
processing
 Compatibility: Spark is compatible with YARN as well as Mesos. Also it can work on
SIMR (Spark in MapReduce).
 Various Functionality: Spark comes up with different functionality like Spark SQL,
Spark streaming, MLIB, GraphX, Spark R.
 Lazy Evaluation: Call by need or memorization. It waits for instructions before
providing final result which saves significant time.
Various Functionality of Apache Spark

 1. Apache Spark Core: All the functionalities being provided by Apache Spark
are built on the top of Spark Core. It delivers speed by providing in-memory
computation capability.
 2. Apache Spark SQL: The Spark SQL component is a distributed framework
for structured data processing. Using Spark SQL, Spark gets more information
about the structure of data and the computation.
 3. Apache Spark Streaming: It is an add-on to core Spark API which allows
scalable, high-throughput, fault-tolerant stream processing of live data
streams. Spark can access data from sources like Kafka, Flume, Kinesis or TCP
socket.
 4. MLIB: Spark’s ability to store data in memory and rapidly run repeated
queries makes it well suited to training machine learning algorithms.
 5. GraphX: Spark is well suited to process Graph data as well.
Spark architecture
A sample program..
 Use Case: A simple Word Count Program
Question and Answer Session

Introduction to Apache Spark
No ratings yet
Introduction to Apache Spark
21 pages
Pyspark Learning Notes PDF Guide
No ratings yet
Pyspark Learning Notes PDF Guide
18 pages
Apache Spark: Big Data Analytics Overview
No ratings yet
Apache Spark: Big Data Analytics Overview
52 pages
Apache Spark: Fast Data Processing Overview
No ratings yet
Apache Spark: Fast Data Processing Overview
19 pages
Spark: Fast Data Processing Overview
No ratings yet
Spark: Fast Data Processing Overview
80 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
Overview of Apache Spark Framework
No ratings yet
Overview of Apache Spark Framework
57 pages
Apache Spark vs Hadoop: Key Advantages
No ratings yet
Apache Spark vs Hadoop: Key Advantages
52 pages
Key Features of Apache Spark
No ratings yet
Key Features of Apache Spark
16 pages
Features and Architecture of Apache Spark
No ratings yet
Features and Architecture of Apache Spark
24 pages
Data Analysis with Apache Spark Overview
No ratings yet
Data Analysis with Apache Spark Overview
39 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
49 pages
Understanding Spark RDD and Ecosystem
No ratings yet
Understanding Spark RDD and Ecosystem
15 pages
Overview of Apache Spark Framework
No ratings yet
Overview of Apache Spark Framework
14 pages
Apache Spark Overview and Comparison
No ratings yet
Apache Spark Overview and Comparison
23 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
26 pages
Apache Spark Seminar Overview
No ratings yet
Apache Spark Seminar Overview
5 pages
Understanding Apache Spark: Features & Benefits
No ratings yet
Understanding Apache Spark: Features & Benefits
19 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
20 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
48 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
37 pages
Introduction to Big Data with Spark
No ratings yet
Introduction to Big Data with Spark
18 pages
Overview of Apache Spark Basics
No ratings yet
Overview of Apache Spark Basics
49 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
20 pages
Apache Spark: Fast Big Data Processing
No ratings yet
Apache Spark: Fast Big Data Processing
45 pages
Apache Spark: Real-Time Data Processing
No ratings yet
Apache Spark: Real-Time Data Processing
61 pages
Apache Spark: Fast Data Processing Insights
No ratings yet
Apache Spark: Fast Data Processing Insights
7 pages
Overview of Apache Spark Features
No ratings yet
Overview of Apache Spark Features
12 pages
Introduction to Apache Spark by Dulari Bhatt
No ratings yet
Introduction to Apache Spark by Dulari Bhatt
19 pages
Apache Spark Interview Q&A Guide
No ratings yet
Apache Spark Interview Q&A Guide
19 pages
Apache Spark Overview and Benefits
No ratings yet
Apache Spark Overview and Benefits
18 pages
Apache Spark Basics and Features
No ratings yet
Apache Spark Basics and Features
44 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark: Overview and Local Setup Guide
No ratings yet
Spark: Overview and Local Setup Guide
8 pages
Overview of the Hadoop Ecosystem
No ratings yet
Overview of the Hadoop Ecosystem
37 pages
Apache Spark Basics for Beginners
No ratings yet
Apache Spark Basics for Beginners
30 pages
Understanding Apache Spark Clusters
No ratings yet
Understanding Apache Spark Clusters
9 pages
Top 50 Apache Spark Interview Q&A
No ratings yet
Top 50 Apache Spark Interview Q&A
25 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
20 pages
Apache Spark Concepts Explained
No ratings yet
Apache Spark Concepts Explained
38 pages
Big Data Processing with Hadoop & Spark
No ratings yet
Big Data Processing with Hadoop & Spark
5 pages
Sparklyr Online Course Overview
No ratings yet
Sparklyr Online Course Overview
80 pages
Apache Spark Features and Use Cases Explained
No ratings yet
Apache Spark Features and Use Cases Explained
14 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Apache Spark for Efficient IoT Processing
No ratings yet
Apache Spark for Efficient IoT Processing
17 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Introduction to Apache Spark and Hadoop
No ratings yet
Introduction to Apache Spark and Hadoop
9 pages
Big Data Processing with Apache Spark
No ratings yet
Big Data Processing with Apache Spark
38 pages
Big Data Processing with Apache Spark
No ratings yet
Big Data Processing with Apache Spark
61 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
35 pages
Overview of Apache Spark Features
No ratings yet
Overview of Apache Spark Features
22 pages
Overview of Apache Spark Ecosystem
No ratings yet
Overview of Apache Spark Ecosystem
17 pages
Atigeo's Spark Big Data Platform Overview
No ratings yet
Atigeo's Spark Big Data Platform Overview
16 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
7 pages
Apache Spark: Fast Stream Processing
No ratings yet
Apache Spark: Fast Stream Processing
74 pages
Apache Spark: Advantages Over Hadoop
No ratings yet
Apache Spark: Advantages Over Hadoop
102 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
30 pages
Go Digit COVID-19 Insurance FAQs
No ratings yet
Go Digit COVID-19 Insurance FAQs
4 pages
User Guide - Go Digit COVID Plan
No ratings yet
User Guide - Go Digit COVID Plan
10 pages
AI and ML Fundamentals Overview
No ratings yet
AI and ML Fundamentals Overview
39 pages
Understanding Hive: Data Warehouse Insights
No ratings yet
Understanding Hive: Data Warehouse Insights
18 pages
Offline Appraisal Writing Guide
No ratings yet
Offline Appraisal Writing Guide
3 pages
DevOps Fresher Resume - Chirag Chandak
No ratings yet
DevOps Fresher Resume - Chirag Chandak
2 pages
SAP Java Architecture Overview
No ratings yet
SAP Java Architecture Overview
15 pages
Advanced Backend Development with Node.js
No ratings yet
Advanced Backend Development with Node.js
1 page
Software Architectures Lab Experiments
No ratings yet
Software Architectures Lab Experiments
23 pages
Understanding Service-Oriented Architecture
No ratings yet
Understanding Service-Oriented Architecture
10 pages
Introduction to Cloud Computing
No ratings yet
Introduction to Cloud Computing
26 pages
IoT and Cloud Computing MCQ Assignment
No ratings yet
IoT and Cloud Computing MCQ Assignment
8 pages
Essential IT Skills for Developers
No ratings yet
Essential IT Skills for Developers
1 page
Full Stack Java Developer Resume
No ratings yet
Full Stack Java Developer Resume
9 pages
Java Backend Developer Profile
No ratings yet
Java Backend Developer Profile
4 pages
Understanding the Shared Responsibility Model
No ratings yet
Understanding the Shared Responsibility Model
4 pages
Software Design and Architecture Course 2025
100% (1)
Software Design and Architecture Course 2025
22 pages
Solace Pubsub Platform Datasheet
No ratings yet
Solace Pubsub Platform Datasheet
6 pages
REST API Fundamentals Guide
No ratings yet
REST API Fundamentals Guide
21 pages
Telecom Software Developer Resume
No ratings yet
Telecom Software Developer Resume
6 pages
Anna University IT Question Paper 2024
No ratings yet
Anna University IT Question Paper 2024
3 pages
AZ-900 Certification Exam Study Guide
No ratings yet
AZ-900 Certification Exam Study Guide
2 pages
Full Stack Developer Profile Summary
No ratings yet
Full Stack Developer Profile Summary
2 pages
Sistem Informasi Koperasi dengan Laravel & Vue.js
No ratings yet
Sistem Informasi Koperasi dengan Laravel & Vue.js
10 pages
NFS vs SMB: Key Differences Explained
No ratings yet
NFS vs SMB: Key Differences Explained
5 pages
SOA and Web Services Overview Guide
No ratings yet
SOA and Web Services Overview Guide
30 pages
Understanding Web Architecture Principles
No ratings yet
Understanding Web Architecture Principles
21 pages
Eslam El-Shenawy: Java Developer Resume
No ratings yet
Eslam El-Shenawy: Java Developer Resume
7 pages
Oracle SOA Developer Resume Summary
No ratings yet
Oracle SOA Developer Resume Summary
3 pages
Cloud Computing Assignment for CSE Students
No ratings yet
Cloud Computing Assignment for CSE Students
2 pages
Wiley CIO Architecting The Cloud Design Decisions For Cloud Computing Service Models SaaS PaaS and IaaS
No ratings yet
Wiley CIO Architecting The Cloud Design Decisions For Cloud Computing Service Models SaaS PaaS and IaaS
18 pages
Microservices in .NET: A Practical Guide
No ratings yet
Microservices in .NET: A Practical Guide
143 pages
Blazor For ASP NET Web Forms Developers PDF
No ratings yet
Blazor For ASP NET Web Forms Developers PDF
72 pages
ASP.NET Core Web API Interview Q&A
No ratings yet
ASP.NET Core Web API Interview Q&A
24 pages
Client Authentication Configuration Guide
No ratings yet
Client Authentication Configuration Guide
12 pages

A Brief Introduction To Apache Spark

Uploaded by

A Brief Introduction To Apache Spark

Uploaded by

A brief introduction to Apache

 Apache Spark is an open-source distributed general-purpose cluster computing

 • YARN resource manager, which takes responsibility for scheduling tasks

 Lighting Fast Processing: Spark enables applications in Hadoop clusters to run up to

You might also like