A Brief Introduction To Apache Spark
A Brief Introduction To Apache Spark
Spark
Apache Spark™
In the industry, there is a need for general purpose cluster computing tool as:
Hadoop MapReduce can only perform “Batch processing”.
Apache Storm / S4 can only perform “Stream processing”.
Apache Impala / Apache Tez can only perform “Interactive processing”.
Neo4j / Apache Giraph can only perform to “Graph processing”.
Apache Spark is a powerful open source engine that provides real-time stream
processing, interactive processing, graph processing, in-memory processing as
well as batch processing with very fast speed, ease of use and standard
interface.
Hadoop vs. Spark - An Answer to the Wrong
Question
Spark is not, despite the hype, a replacement for Hadoop. Nor is MapReduce dead.
Spark can run on top of Hadoop, benefiting from Hadoop’s cluster manager (YARN) and
underlying storage (HDFS, HBase, etc.). Spark can also run completely separately from
Hadoop, integrating with alternative cluster managers like Mesos and alternative storage
platforms like Cassandra and Amazon S3.
Earlier Hadoop relied upon Map-Reduce for the bulk of its data processing. Hadoop
MapReduce also managed scheduling and task allocation processes within the cluster; even
workloads that were not best suited to batch processing were passed through Hadoop’s
MapReduce engine, adding complexity and reducing performance. MapReduce is really a
programming model. In Hadoop MapReduce, multiple MapReduce jobs would be strung
together to create a data pipeline. In between every stage of that pipeline, the MapReduce
code would read data from the disk, and when completed, would write the data back to the
disk. This process was inefficient because it had to read all the data from disk at the
beginning of each stage of the process.
Spark offers a far faster way to process data than passing it through unnecessary Hadoop
MapReduce processes.
What Hadoop Gives Spark
1. Apache Spark Core: All the functionalities being provided by Apache Spark
are built on the top of Spark Core. It delivers speed by providing in-memory
computation capability.
2. Apache Spark SQL: The Spark SQL component is a distributed framework
for structured data processing. Using Spark SQL, Spark gets more information
about the structure of data and the computation.
3. Apache Spark Streaming: It is an add-on to core Spark API which allows
scalable, high-throughput, fault-tolerant stream processing of live data
streams. Spark can access data from sources like Kafka, Flume, Kinesis or TCP
socket.
4. MLIB: Spark’s ability to store data in memory and rapidly run repeated
queries makes it well suited to training machine learning algorithms.
5. GraphX: Spark is well suited to process Graph data as well.
Spark architecture
A sample program..
Use Case: A simple Word Count Program
Question and Answer Session