0% found this document useful (0 votes)

15 views

Hadoop Vs Apache Spark

Uploaded by

indolent56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Hadoop Vs Apache Spark

Uploaded by

indolent56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Innovate, Digitize, Transform

Hadoop vs Apache Spark

Introduction Comparative Analysis: Hadoop vs Apache Spark
“Any sufficiently advanced technology is Hadoop - Overview
indistinguishable from magic.”– said Arthur C.
Apache developed Hadoop project as open-source software for
Clark. Big data technologies and implementations
reliable, scalable, distributed computing.
are gaining traction and moving at a fast pace
The Apache Hadoop software library is a framework that allows
with novel innovations happening in its space.
distributed processing of large datasets across clusters of
Hadoop, NoSQL, MongoDB and Apache Spark are
computers using simple programming models. Hadoop can be easily
the buzzwords with big data technologies –
scaled-up to multi cluster machines, each offering local storage and
leaving a digital trace of data in every individual's
computation. Hadoop libraries are designed in such a way that it can
life which can be used for analysis. Future of big
detect the failed cluster at application layer and can handle those
data analytics market will revolve around
failures by it. This ensures high-availability by default.
Internet of Things (IoT), Sentiment Analysis,
Trend Analysis, increase in sensor driven The project includes these modules:

wearables, etc. Hadoop Common: These are Java libraries and utilities

Hadoop and Spark are popular Apache projects in required for running other Hadoop modules. These libraries

the big data ecosystem. Apache Spark is an open- provide OS level and filesystem abstractions and contain

source platform, based on the original Hadoop the necessary Java files and scripts required to start and run

MapReduce component of the Hadoop Hadoop.

ecosystem. Here we come up with a comparative Hadoop Distributed File System (HDFS): A distributed file
analysis between Hadoop and Apache Spark in system that provides high-throughput access to
terms of performance, storage, reliability, application data.
architecture, etc. Hadoop YARN: A framework for job scheduling and cluster
resource management.

Hadoop MapReduce: A YARN-based system for parallel

processing of large datasets.

Hadoop - Architecture

Hadoop MapReduce, HDFS and YARN provide a scalable, fault-

tolerant and distributed platform for storage and processing of very
large datasets across clusters of commodity computers. Hadoop
uses the same set of nodes for data storage as well as to perform
the computations. This allows Hadoop to improve the performance
of large scale computations by combining computations along with
the storage.
Job Submission YARN Resource
Client
Manager

Node Status
Resource
Request

NodeManager NodeManager nch NodeManager

er Lau
Contain st
Reque
Containers Containers Containers
Map Task 2 Map Reduce MapReduce Map Task N
Application Coordination
Master

Reduce Task 1 Map Task 1 Reduce Task M

Hadoop Distributed File System – HDFS

HDFS is a distributed filesystem that is designed to store large volume of data reliably. HDFS stores a single large file on different
nodes across the cluster of commodity machines. HDFS overlays on top of the existing filesystem. Data is stored in fine grained
blocks, with default block size of 128MB. HDFS also stores redundant copies of these data blocks in multiple nodes to ensure
reliability and fault tolerance. HDFS is a distributed, reliable and scalable file system.

Hadoop YARN
YARN (Yet Another Resource Negotiator), a central component in the Hadoop ecosystem, is a framework for job scheduling and
cluster resource management. The basic idea of YARN is to split up the functionalities of resource management and job
scheduling/monitoring into separate daemons.

Hadoop MapReduce
MapReduce is a programming model and an associated implementation for processing and generating large datasets with a
parallel, distributed algorithm on a cluster. Mapper maps input key/value pair to set of intermediate pairs. Reducer takes this
intermediate pairs and process to output the required values. Mapper processes the jobs in parallel on every cluster and Reducer
process them in any available node as directed by YARN.

Hadoop - Advantages:
For organizations considering a big data implementation, Hadoop has many features that make it worth considering, including:

Flexibility – Hadoop enables businesses to easily access new data sources and tap into different types of data to
generate value from that data.

Scalability – Hadoop is a highly scalable storage platform, as it can store and distribute very large datasets across
hundreds of inexpensive commodity servers that operate in parallel.

Affordability – Hadoop is open source and runs on commodity hardware, including cloud providers like Amazon.

Fault-Tolerance – Automatic replication and ability to handle the failures at application layer makes Hadoop fault
tolerant by default.
Apache Spark - Overview
It is a framework for analyzing data analytics on a distributed computing cluster.
It provides in-memory computations for increasing speed and data processing
over MapReduce. It utilizes the Hadoop Distributed File System (HDFS) and runs
on top of existing Hadoop cluster. It can also process both structured data in Hive
and streaming data from different sources like HDFS, Flume, Kafka, and Twitter.

Spark Stream
Spark Streaming is an extension of the core Spark API. Processing live data
streams can be done using Spark Streaming, that enables scalable, high-
throughput, fault-tolerant stream. Input Data can be from any sources like Web
Stream (TCP sockets), Flume, Kafka, etc., and can be processed using complex
algorithms with high-level functions like map, reduce, join, etc. Finally, processed
data can be pushed out to filesystems (HDFS), databases, and live dashboards. We
can also apply Spark’s graph processing algorithms and machine learning on data
streams.

Spark SQL
Apache Spark provides a separate module Spark SQL for processing structured
data. Spark SQL has an interface, which provides detailed information about the
structure of the data and the computation being performed. Internally, Spark SQL
uses this additional information to perform extra optimizations.

Datasets and DataFrames

A distributed collection of data is called as Dataset in Apache Spark. Dataset
provides the benefits of RDDs along with utilizing the Spark SQL’s optimized
execution engine. A Dataset can be constructed from objects and then
manipulated using functional transformations.
A DataFrame is a dataset organized into named columns. It is equally related to a
relational database table or a R/Python data frame, but with richer optimizations
under the hood. A DataFrame can be constructed, using various data source like
structured data file or Hive tables or external databases or existing RDDs.

Resilient Distributed Datasets (RDDs)

Spark works on fault-tolerant collection of elements that can be operated on in
parallel, the concept called resilient distributed dataset (RDD). RDDs can be
created in two ways, parallelizing an existing collection in driver program, or
referencing a dataset in an external storage system, such as a shared filesystem,
HDFS, HBase, etc.
Apache Spark - Advantages

Speed – Spark process data in-memory, making it run 100x faster than applications in Hadoop clusters. Also this enables
even 10x faster when running on disk. This also reduces number of read/write to disc. Spark stores its intermediate
processing data in-memory.

Ease of Use – Spark lets developers to write applications in Java, Python or Scala, so that they can create and run their
applications on their familiar programming languages.

Combines SQL, Streaming and Complex Analytics – In addition to simple “map” and “reduce” operations, Spark supports
streaming data, SQL queries, and complex analytics such as graph algorithms and machine learning out-of-the-box.

Runs Everywhere – Spark runs on any system, either Hadoop, or standalone, or in the cloud. It can access diverse data
sources including HDFS, HBase, S3 and Cassandra.
Conclusion
Hadoop stores data on disk, whereas Spark stores data in-memory. Hadoop uses replication to achieve fault tolerance whereas
Spark uses resilient distributed datasets (RDD), which guarantees fault tolerance that minimizes network I/O. Spark can run on
top of HDFS along with other Hadoop components. Spark has become one more data processing engine in Hadoop ecosystem,
providing more capabilities to Hadoop stack for businesses.

For developers, there is no difference between the two. Hadoop is a framework, where one can write MapReduce jobs using Java
classes. Whereas Spark is a library that enables writing code for parallel computation via function calls. For administrators running
a cluster, there is an overlap in general skills, such as cluster management, code deployment, monitoring configuration, etc.

Here is a quick comparison guideline before concluding.

Aspects Hadoop Apache Spark

MapReduce is difficult to program and Spark is easy to program and does not
Difficulty
needs abstractions. require any abstractions.

There is no in-built interactive mode,

Interactive Mode It has interactive mode.
except Pig and Hive.

Hadoop MapReduce just get to process Spark can be used to modify in real time
Streaming
a batch of large stored data. through Spark Streaming.

MapReduce does not leverage the memory Spark has been said to execute batch
Performance processing jobs about 10 to 100 times
of the Hadoop cluster to the maximum.
faster than Hadoop MapReduce.

Spark ensures lower latency computations

Latency MapReduce is disk oriented completely. by caching the partial results across its
memory of distributed workers.

Writing Hadoop MapReduce pipelines is Writing Spark code is always more

Ease of coding
complex and lengthy process. compact.

ACL Digital is a design led Digital Experience, Product Innovation, Engineering and Enterprise IT oﬀerings leader. From strategy, to design, implementation
and management we help accelerate innovation and transform businesses.
ACL Digital is a part of ALTEN group, a leader in technology consulting and engineering services.

Proprietary content. No content of this document can be reproduced without the prior written agreement of ACL Digital.

To know more about how ACL can partner with you to help create Digital Transformation, connect with: [email protected]

www.acldigital.com USA | UK | France | India

Unit Iii
No ratings yet
Unit Iii
20 pages
147324-1 MotoCom SDK Functon Manual
No ratings yet
147324-1 MotoCom SDK Functon Manual
217 pages
Presentation On Online Car Rental Management System: Presented by
100% (2)
Presentation On Online Car Rental Management System: Presented by
41 pages
【W1-W11 2020】TikTok Platform Snapshot- VN
No ratings yet
【W1-W11 2020】TikTok Platform Snapshot- VN
14 pages
Customer Segmentation Using RFM Analysis: Overview
No ratings yet
Customer Segmentation Using RFM Analysis: Overview
11 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
spark
No ratings yet
spark
9 pages
Unit 4
No ratings yet
Unit 4
8 pages
Apache Spark Features
No ratings yet
Apache Spark Features
2 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Unit1
No ratings yet
Unit1
50 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
No ratings yet
A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
6 pages
BigData Nov2019
No ratings yet
BigData Nov2019
50 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Data Science With Python - Lesson 12 - Python Integration With Hadoop
No ratings yet
Data Science With Python - Lesson 12 - Python Integration With Hadoop
53 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
Sspark
No ratings yet
Sspark
7 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
HADOOP
No ratings yet
HADOOP
10 pages
Big Data Course Agenda
No ratings yet
Big Data Course Agenda
3 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Bda 5
No ratings yet
Bda 5
21 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
No ratings yet
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
8 pages
Performance Comparison of Apache Hadoop and Apache Spark
No ratings yet
Performance Comparison of Apache Hadoop and Apache Spark
5 pages
Hadoop
No ratings yet
Hadoop
11 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Shark
No ratings yet
Shark
24 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Apache Spark and Ignite
No ratings yet
Apache Spark and Ignite
4 pages
UNIT II
No ratings yet
UNIT II
30 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
10 - Big Data Architecture and Tools (1)
No ratings yet
10 - Big Data Architecture and Tools (1)
31 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Spark
No ratings yet
Spark
96 pages
Introduction To Big Data PDF
No ratings yet
Introduction To Big Data PDF
16 pages
Apache Spark Self Learning 1
No ratings yet
Apache Spark Self Learning 1
7 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
How To Submit Music To Nielsen BDS
No ratings yet
How To Submit Music To Nielsen BDS
1 page
Paper 3 Comp
No ratings yet
Paper 3 Comp
6 pages
Nimbus Load Balancing
No ratings yet
Nimbus Load Balancing
17 pages
Ganit - Data Analyst - JD
No ratings yet
Ganit - Data Analyst - JD
1 page
MCPP - Terms - of - Participation - Guide - Dec
No ratings yet
MCPP - Terms - of - Participation - Guide - Dec
8 pages
ISYS6028 Session 2 - Database Environment
No ratings yet
ISYS6028 Session 2 - Database Environment
30 pages
CV NguyenPhuTruong SoftwareEngineer
No ratings yet
CV NguyenPhuTruong SoftwareEngineer
2 pages
CSNETWK - Machine Project T3 AY2023-2024
No ratings yet
CSNETWK - Machine Project T3 AY2023-2024
3 pages
AWS Certified Solutions Architect - Associate - SAA-C02 - Marks4Sure - Mansoor
No ratings yet
AWS Certified Solutions Architect - Associate - SAA-C02 - Marks4Sure - Mansoor
3 pages
Aryaman (Bright) MIS607 Cybersecurity Threat Model Report A2
No ratings yet
Aryaman (Bright) MIS607 Cybersecurity Threat Model Report A2
11 pages
Explain The Cloud Components With A Suitable Diagram
No ratings yet
Explain The Cloud Components With A Suitable Diagram
17 pages
TP Link TD w8901n Manual de Usuario
No ratings yet
TP Link TD w8901n Manual de Usuario
3 pages
Facebook Forensics
No ratings yet
Facebook Forensics
25 pages
HMIS
0% (1)
HMIS
6 pages
Aci Apic - Learn Work It
No ratings yet
Aci Apic - Learn Work It
12 pages
MY Siwes Program Within Niit
No ratings yet
MY Siwes Program Within Niit
22 pages
Multicon User Manual: Revision: T
No ratings yet
Multicon User Manual: Revision: T
99 pages
IDAM
100% (1)
IDAM
56 pages
Scaling_Machine_Learning
No ratings yet
Scaling_Machine_Learning
30 pages
2020 Insider Threat Report Gurucul
No ratings yet
2020 Insider Threat Report Gurucul
24 pages
Pictured Installation Guide For Symantec Encryption Management Server
No ratings yet
Pictured Installation Guide For Symantec Encryption Management Server
20 pages
(F3-FORMATIVE) NEW Formative Assessment 3
No ratings yet
(F3-FORMATIVE) NEW Formative Assessment 3
40 pages
Chapter 37: Operation and Maintenance of The ERP System
No ratings yet
Chapter 37: Operation and Maintenance of The ERP System
3 pages
Literary Adaptation
No ratings yet
Literary Adaptation
44 pages
Integration of DB & DW-D6
No ratings yet
Integration of DB & DW-D6
10 pages
RS IT (402) X 08 Assignment
No ratings yet
RS IT (402) X 08 Assignment
4 pages

Hadoop Vs Apache Spark

Uploaded by

Hadoop Vs Apache Spark

Uploaded by

Innovate, Digitize, Transform

Hadoop vs Apache Spark

MapReduce component of the Hadoop Hadoop.

Hadoop MapReduce: A YARN-based system for parallel

Hadoop MapReduce, HDFS and YARN provide a scalable, fault-

NodeManager NodeManager nch NodeManager

Reduce Task 1 Map Task 1 Reduce Task M

Hadoop Distributed File System – HDFS

Datasets and DataFrames

Resilient Distributed Datasets (RDDs)

Here is a quick comparison guideline before concluding.

Aspects Hadoop Apache Spark

There is no in-built interactive mode,

Spark ensures lower latency computations

Writing Hadoop MapReduce pipelines is Writing Spark code is always more

www.acldigital.com USA | UK | France | India

© ACL Digital. All rights Reserved.

You might also like