0% found this document useful (0 votes)
123 views

A Brief Introduction To Apache Spark

The document provides an introduction to Apache Spark, describing its key characteristics as an open-source distributed general-purpose cluster computing framework. It highlights how Spark uses in-memory processing to provide faster performance than Hadoop MapReduce, and supports various types of data processing. The document also discusses Spark's integration with Hadoop and its various components like Spark SQL, streaming, machine learning and graph processing.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

A Brief Introduction To Apache Spark

The document provides an introduction to Apache Spark, describing its key characteristics as an open-source distributed general-purpose cluster computing framework. It highlights how Spark uses in-memory processing to provide faster performance than Hadoop MapReduce, and supports various types of data processing. The document also discusses Spark's integration with Hadoop and its various components like Spark SQL, streaming, machine learning and graph processing.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

A brief introduction to Apache

Spark
Apache Spark™

 Apache Spark is an open-source distributed general-purpose cluster computing


framework with (mostly) in-memory data processing engine that can do ETL,
analytics, machine learning and graph processing on large volumes of data at
rest (batch processing)or in motion (streaming processing) with rich concise
high-level APIs for the programming languages: Scala, Python, Java, R, and
SQL.
 In contrast to Hadoop’s two-stage disk-based MapReduce computation engine,
Spark’s multi-stage (mostly) in-memory computing engine allows for running
most computations in memory, and hence most of the time provides better
performance.
Why Apache Spark

In the industry, there is a need for general purpose cluster computing tool as:
 Hadoop MapReduce can only perform “Batch processing”.
 Apache Storm / S4 can only perform “Stream processing”.
 Apache Impala / Apache Tez can only perform “Interactive processing”.
 Neo4j / Apache Giraph can only perform to “Graph processing”.

Apache Spark is a powerful open source engine that provides real-time stream
processing, interactive processing, graph processing, in-memory processing as
well as batch processing with very fast speed, ease of use and standard
interface.
Hadoop vs. Spark - An Answer to the Wrong
Question
Spark is not, despite the hype, a replacement for Hadoop. Nor is MapReduce dead.
Spark can run on top of Hadoop, benefiting from Hadoop’s cluster manager (YARN) and
underlying storage (HDFS, HBase, etc.). Spark can also run completely separately from
Hadoop, integrating with alternative cluster managers like Mesos and alternative storage
platforms like Cassandra and Amazon S3.
Earlier Hadoop relied upon Map-Reduce for the bulk of its data processing. Hadoop
MapReduce also managed scheduling and task allocation processes within the cluster; even
workloads that were not best suited to batch processing were passed through Hadoop’s
MapReduce engine, adding complexity and reducing performance. MapReduce is really a
programming model. In Hadoop MapReduce, multiple MapReduce jobs would be strung
together to create a data pipeline. In between every stage of that pipeline, the MapReduce
code would read data from the disk, and when completed, would write the data back to the
disk. This process was inefficient because it had to read all the data from disk at the
beginning of each stage of the process.

Spark offers a far faster way to process data than passing it through unnecessary Hadoop
MapReduce processes.
What Hadoop Gives Spark

 • YARN resource manager, which takes responsibility for scheduling tasks


across available nodes in the cluster;
 • Distributed File System, which stores data when the cluster runs out of
free memory, and which persistently stores historical data when Spark is not
running.
 • Disaster Recovery capabilities, inherent to Hadoop, which enable recovery
of data when individual nodes fail. These capabilities include basic (but
reliable) data mirroring across the cluster and richer snapshot and mirroring
capabilities such as those offered by the MapR Data Platform;
 • Data Security, which becomes increasingly important as Spark tackles
production workloads in regulated industries such as healthcare and
financialservices.
Apache Spark Characteristics

 Lighting Fast Processing: Spark enables applications in Hadoop clusters to run up to


100x faster in memory, and 10x faster even when running on disk
 Usability: It allows you quickly write application in Java, Scala, Python and R.
 In-Memory Computing: Keeping data in servers' RAM as it makes accessing stored
data quickly.
 Variation in Data Processing: It can work on Stream Processing as well as batch
processing
 Compatibility: Spark is compatible with YARN as well as Mesos. Also it can work on
SIMR (Spark in MapReduce).
 Various Functionality: Spark comes up with different functionality like Spark SQL,
Spark streaming, MLIB, GraphX, Spark R.
 Lazy Evaluation: Call by need or memorization. It waits for instructions before
providing final result which saves significant time.
Various Functionality of Apache Spark

 1. Apache Spark Core: All the functionalities being provided by Apache Spark
are built on the top of Spark Core. It delivers speed by providing in-memory
computation capability.
 2. Apache Spark SQL: The Spark SQL component is a distributed framework
for structured data processing. Using Spark SQL, Spark gets more information
about the structure of data and the computation.
 3. Apache Spark Streaming: It is an add-on to core Spark API which allows
scalable, high-throughput, fault-tolerant stream processing of live data
streams. Spark can access data from sources like Kafka, Flume, Kinesis or TCP
socket.
 4. MLIB: Spark’s ability to store data in memory and rapidly run repeated
queries makes it well suited to training machine learning algorithms.
 5. GraphX: Spark is well suited to process Graph data as well.
Spark architecture
A sample program..
 Use Case: A simple Word Count Program
Question and Answer Session

You might also like