Hadoop Vs Apache Spark
Hadoop Vs Apache Spark
wearables, etc. Hadoop Common: These are Java libraries and utilities
Hadoop and Spark are popular Apache projects in required for running other Hadoop modules. These libraries
the big data ecosystem. Apache Spark is an open- provide OS level and filesystem abstractions and contain
source platform, based on the original Hadoop the necessary Java files and scripts required to start and run
ecosystem. Here we come up with a comparative Hadoop Distributed File System (HDFS): A distributed file
analysis between Hadoop and Apache Spark in system that provides high-throughput access to
terms of performance, storage, reliability, application data.
architecture, etc. Hadoop YARN: A framework for job scheduling and cluster
resource management.
Hadoop - Architecture
Node Status
Resource
Request
Hadoop YARN
YARN (Yet Another Resource Negotiator), a central component in the Hadoop ecosystem, is a framework for job scheduling and
cluster resource management. The basic idea of YARN is to split up the functionalities of resource management and job
scheduling/monitoring into separate daemons.
Hadoop MapReduce
MapReduce is a programming model and an associated implementation for processing and generating large datasets with a
parallel, distributed algorithm on a cluster. Mapper maps input key/value pair to set of intermediate pairs. Reducer takes this
intermediate pairs and process to output the required values. Mapper processes the jobs in parallel on every cluster and Reducer
process them in any available node as directed by YARN.
Hadoop - Advantages:
For organizations considering a big data implementation, Hadoop has many features that make it worth considering, including:
Flexibility – Hadoop enables businesses to easily access new data sources and tap into different types of data to
generate value from that data.
Scalability – Hadoop is a highly scalable storage platform, as it can store and distribute very large datasets across
hundreds of inexpensive commodity servers that operate in parallel.
Affordability – Hadoop is open source and runs on commodity hardware, including cloud providers like Amazon.
Fault-Tolerance – Automatic replication and ability to handle the failures at application layer makes Hadoop fault
tolerant by default.
Apache Spark - Overview
It is a framework for analyzing data analytics on a distributed computing cluster.
It provides in-memory computations for increasing speed and data processing
over MapReduce. It utilizes the Hadoop Distributed File System (HDFS) and runs
on top of existing Hadoop cluster. It can also process both structured data in Hive
and streaming data from different sources like HDFS, Flume, Kafka, and Twitter.
Spark Stream
Spark Streaming is an extension of the core Spark API. Processing live data
streams can be done using Spark Streaming, that enables scalable, high-
throughput, fault-tolerant stream. Input Data can be from any sources like Web
Stream (TCP sockets), Flume, Kafka, etc., and can be processed using complex
algorithms with high-level functions like map, reduce, join, etc. Finally, processed
data can be pushed out to filesystems (HDFS), databases, and live dashboards. We
can also apply Spark’s graph processing algorithms and machine learning on data
streams.
Spark SQL
Apache Spark provides a separate module Spark SQL for processing structured
data. Spark SQL has an interface, which provides detailed information about the
structure of the data and the computation being performed. Internally, Spark SQL
uses this additional information to perform extra optimizations.
Speed – Spark process data in-memory, making it run 100x faster than applications in Hadoop clusters. Also this enables
even 10x faster when running on disk. This also reduces number of read/write to disc. Spark stores its intermediate
processing data in-memory.
Ease of Use – Spark lets developers to write applications in Java, Python or Scala, so that they can create and run their
applications on their familiar programming languages.
Combines SQL, Streaming and Complex Analytics – In addition to simple “map” and “reduce” operations, Spark supports
streaming data, SQL queries, and complex analytics such as graph algorithms and machine learning out-of-the-box.
Runs Everywhere – Spark runs on any system, either Hadoop, or standalone, or in the cloud. It can access diverse data
sources including HDFS, HBase, S3 and Cassandra.
Conclusion
Hadoop stores data on disk, whereas Spark stores data in-memory. Hadoop uses replication to achieve fault tolerance whereas
Spark uses resilient distributed datasets (RDD), which guarantees fault tolerance that minimizes network I/O. Spark can run on
top of HDFS along with other Hadoop components. Spark has become one more data processing engine in Hadoop ecosystem,
providing more capabilities to Hadoop stack for businesses.
For developers, there is no difference between the two. Hadoop is a framework, where one can write MapReduce jobs using Java
classes. Whereas Spark is a library that enables writing code for parallel computation via function calls. For administrators running
a cluster, there is an overlap in general skills, such as cluster management, code deployment, monitoring configuration, etc.
MapReduce is difficult to program and Spark is easy to program and does not
Difficulty
needs abstractions. require any abstractions.
Hadoop MapReduce just get to process Spark can be used to modify in real time
Streaming
a batch of large stored data. through Spark Streaming.
MapReduce does not leverage the memory Spark has been said to execute batch
Performance processing jobs about 10 to 100 times
of the Hadoop cluster to the maximum.
faster than Hadoop MapReduce.
ACL Digital is a design led Digital Experience, Product Innovation, Engineering and Enterprise IT offerings leader. From strategy, to design, implementation
and management we help accelerate innovation and transform businesses.
ACL Digital is a part of ALTEN group, a leader in technology consulting and engineering services.
Proprietary content. No content of this document can be reproduced without the prior written agreement of ACL Digital.
To know more about how ACL can partner with you to help create Digital Transformation, connect with: [email protected]