A Comparative Between Hadoop MapReduce and Apache

Uploaded by

salah Alswiay

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

A Comparative Between Hadoop MapReduce and Apache

Uploaded by

salah Alswiay

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

A comparative between Hadoop MapReduce and Apache

Spark on HDFS
Mohamed Saouabi Abdellah Ezzati
Lavete Laboratory Lavete Laboratory
University Hassan 1st University Hassan 1st
Settat, Morocco Settat, Morocco
[email protected] [email protected]

ABSTRACT
Data is growing now in a very high speed with a large
volume, Spark and MapReduce1 both provide a processing
model for analyzing and managing this large data -Big
Data- stored on HDFS. In this paper, we discuss a
comparative between Apache Spark and Hadoop
MapReduce using the machine learning algorithms, k-
means and logistic regression. This comparative is done
through two experiences, the first one using the same
programming language java, and the second using different
programming languages. Figure 1: HDFS architecture
The Master nodes supervise the two cores of Hadoop:
KEYWORDS storing data in parallel in HDFS and process all that data
Hadoop MapReduce, Apache Spark, Big Data, HDFS, k- using MapReduce. The Name Node supervise and
means, Logistic Regression, Machine learning coordinates the data storage function in HDFS, while the
Job Tracker supervise and coordinates the parallel
1 INTRODUCTION processing of data using MapReduce.
Slave Nodes make up the large majority of machines and
Hadoop is an Apache Open Source Framework that do all the work of storing the data and running the
supports the processing of large datasets in a distributed computations. The slaves run data nodes and task tracker
computing environment, it consists of two main parts: daemon which take instructions from the master nodes. The
HDFS which allow us to store data in parallel in a cluster, Data Node daemon is a slave to the Name Node and the
and MapReduce which process data in parallel. HDFS can Task Tracker daemon is a slave to the Job Tracker.
handle a large amount of data and it offers an easy and fast
access to data, in parallel and in a fault tolerant manner.
It’s based on master slave architecture: 2 RELATED WORKS
Dr. E. Laxmi Lydia, Dr.A.Krishna Mohan, S.Naga Mallik
Raj [7] proposes a comparative between Apache spark and
Hadoop MapReduce using the machine learning algorithm
Permission to make digital or hard copies of all or part of this work for k-means, the comparison took part just one algorithm (k-
personal or classroom use is granted without fee provided that copies are means), which is not totally sufficient to come up with a
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for
general result, because, every algorithm has a different way
components of this work owned by others than ACM must be honored. of working. Trying more than one algorithm may give us a
Abstracting with credit is permitted. To copy otherwise, or republish, to better idea about the performance analysis.
post on servers or to redistribute to lists, requires prior specific permission Meenakshi Sharma, Vaishali Chauhan, Keshav Kishore [8]
and/or a fee. Request permissions from [email protected].
IML '17, October 17-18, 2017, Liverpool, United Kingdom
gives us a comparison between Apache spark and Hadoop
© 2017 Association for Computing Machinery. MapReduce, without using an algorithm or an experimental
ACM ISBN 978-1-4503-5243-7/17/10…$15.00 use case, just by giving advantages and disadvantages for
http://dx.doi.org/10.1145/ 3109761.3109775 each.
IML '17, October 17-18, 2017, Liverpool, UK M. Saouabi and A. Ezzati.

Jai Prakash Verma, Atul Patel [10] discusses a comparison Spark in MapReduce: Spark in MapReduce is used to
of both the programming frameworks, Apache Spark and launch Spark job in addition to the Standalone deployment.
Hadoop MapReduce using the wordcount program for the In this case, the user can start Spark and use the shell
experimental analysis. In this sort of comparison, it might without any administrative access.
be preferable to use huge algorithms for really
experimenting the analysis performance of Big Data, 5 COMPARING HADOOP MAPREDUCE AND
wordcount probably is not the best algorithm to use, maybe APACHE SPARK
machine learning’s algorithms will give us a pretty clear
image of the capacity and the performance of analysis. Now, we will compare Hadoop MapReduce and Apache
Spark in different fields:
3 HADOOP MAPREDUCE
5.1 Performance
MapReduce is a programming paradigm that allows the
One of the basis of Apache Spark, the in-memory
processing of distributed data across multiple servers in
processing, which it avoids the write and read on the hard
a cluster. The MapReduce has become dominant
drive and gain a lot of time. To do so, Apache spark uses the
concerning batch-processing, the principle idea is to
RDDs to put the data in memory and process it. It uses a lot
divide the input data sets into chunks that are processed
of memory, if the data are too large to fit in the memory, it
in a completely parallel manner.
can cause problems for Spark.
In the other hand, MapReduce kill the processes when a job
is done, which means it can easily run other services. It’s
designed for batch-processing, and it can process really big
data in the cluster with no problems.

5.2 Ease of use

MapReduce is written with java, not easy to program with,
but it offers Pig to make it easier which can take a little
time to learn, and also Hive for adding the SQL
Figure 2: Map Reduce Architecture and Working
compatibility.
Spark has comfortable APIs for java, scala and python,
It consists of two main functions, Map and Reduce. The
which make programming easy, and it provide also Spark
mapper processes the data and creates several small chunks
SQL.
of data of key/value pairs that are processed in parallel, the
outputs of the Mapper function will be sorting and 5.3 Cost
shuffling which takes us to the next phase, the Reduce
phase, and here the data is aggregated to return the result. We all know that both, Hadoop MapReduce and Apache
Spark are open source, but again, they require certain
requirements for the machines:
4 APACHE SPARK
Apache Spark is a high performance, parallel processing
framework for analyzing large datasets across clustered Table 1: Hardware requirements
computers, it can process data from different data
Apache Spark Hadoop MapReduce
repositories, like HDFS. It supports in-memory processing
Cores 8-16 11 seconds
of data, by using RDDs, where Spark stores the data into,
Memory 8Gb-100Gb 24Gb
and process it in memory. It can also process data into hard
Disk 4-8 4-6 1Tb disks
drive when the datasets cannot fit into the available
Network 10Gb or more 1Gb Ethernet all-to-all
memory of the system.
Spark needs memory a little bigger than the data to process
There is three manners Apache spark can be deployed:
for optimal performance. This means if we want to process
Standalone: which means Spark occupies the place on
really big data for a lower material cost, we’ll go for
HDFS and the space is allocated for HDFS, explicitly. Here
Hadoop MapReduce, because the cost of the memory is
Spark and MapReduce will function side by side to cover
higher than the hard drive.
all of the Spark work on the cluster.
Hadoop YARN: It means, Spark runs on YARN without 5.4 Compatibility
any pre-installation or root access required. It helps
integrate Spark into the Hadoop ecosystem. Apache Spark handles data sources that implement the
input format of Hadoop. And according to the Apache
Spark website, Pig and Hive are on the way of integration.
2
A comparative between Hadoop MapReduce and Apache Spark On HDFS IML '17, October 17-18, 2017, Liverpool, UK

5.5 Fault tolerance 6.2 Results for k-means and logistic regression using
java with MapReduce and Scala with Spark:
Hadoop MapReduce utilizes the replications of data into In this experience, we’ll use different languages, for spark
the cluster to assure the fault tolerance. Spark is also fault we’ll use Scala, and MapReduce we’ll use java, using the
tolerant, it uses the RDDs to re-calculate and re-create the same algorithms: k-means and logistic regression.
data. But if a process is down, Spark needs to start the
processing from the start, while MapReduce can start from Table 3: Results for k-means and logistic regression
where he finished, since he uses hard drive. with Spark and MapReduce
Logistic regression K-means
6 EXPERIMENTAL CASE STUDY COMPARING Spark 2 seconds 3 seconds
APACHE SPARK AND APACHE MAPREDUCE MapReduce 41 seconds 83 seconds
In the first experience, we will compare Hadoop
MapReduce and Apache Spark using two algorithms of 100
machine learning, k-means and logistic regression, using
the same language: java. 80
The machine used had a configuration as follows: 60
4Gb Ram Spark
Linux 40
64 Gb hard drive 20 Mapreduce
Dataset size: 4MB
0
6.1 Results for k-means and logistic regression using Logistic K‐means
the same programming language java with Spark regression
and MapReduce:
Figure 4: Results for k-means and logistic regression with
Table 2: Results for k-means and logistic regression Spark and MapReduce
using the same programming language java with Spark
and MapReduce This experience shows that Spark is significantly faster
Logistic regression K-means than MapReduce by taking the datasets, and time in
Spark 8 seconds 11 seconds consideration, because MapReduce perform the write and
MapReduce 41 seconds 83 seconds read operations of the data on the hard drive, which makes
him slow comparing to Spark who uses the memory.
90
80 7 CONCLUSIONS
70 This paper gives a comparative of Apache Spark and
Hadoop MapReduce, using the analysis algorithms k-
60 means and logistic regression. The outcome of this analysis
50 Spark show clearly that Spark is faster than MapReduce. Both
40 Spark and MapReduce provide an effective solution to
Mapreduce
handle Big Data. Spark has excellent performance, offering
30
the in-memory processing of data which make the work
20 faster than MapReduce who uses hard drive to store and
10 process the data.
0
Logistic regression K‐means REFERENCES
[1] Apache Spark Documentation,
https://spark.apache.org/docs/0.9.1/hardware-provisioning.html.
Figure 3: Results for k-means and logistic regression using the
[2] Big data Analytics, http://www.sas.com/en_us/insights/analytics/big-
same programming language java with Spark and data-analytics.html.
MapReduce [3] Putting Spark to use, http://blog.cloudera.com/blog/2013/11/putting-
spark-to-use-fast-in-memory-computing-for-your-big-data-
applications/.
Using the same language java, the result shows clearly [4] Amol Bansod: Efficient Big Data Analysis with Apache Spark in
HDFS, 2015.
that the performance of Spark is far better than [5] Meenakshi Sharma1, Vaishali Chauhan2, Keshav Kishore: A
MapReduce concerning the time of processing. REVIEW: MAPREDUCE AND SPARK FOR BIG DATA
ANALYTICS, 2016.

3
IML '17, October 17-18, 2017, Liverpool, UK M. Saouabi and A. Ezzati.

[6] Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.

Franklin, S. Shenker, and I. Stoica: Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster computing.
University of California, Berkeley, 2011.
[7] Dr. E. Laxmi Lydia, Dr.A.Krishna Mohan, S.Naga Mallik Raj:
Correlating Apache Spark and Map Reduce with Performance
Analysis using K-Means
[8] Meenakshi Sharma, Vaishali Chauhan, Keshav Kishore: A
REVIEW: MAPREDUCE AND SPARK FOR BIG DATA
ANALYTICS
[9] Kuan-Ching LI, Hai Jiang, Laurence T.Yang, Alfredo Cuzzocrea:
Big Data Algorithms, Analytics and Applications
[10] Jai Prakash Verma, Atul Patel: Comparison of MapReduce and
Spark Programming Frameworks for Big Data Analytics on HDFS
[11] Apache Spark and Hadoop,
https://databricks.com/blog/2014/01/21/spark-and-hadoop.html
[12] Satish Gopalani Rohan Arora: Comparing Apache Spark and Map
Reduce with Performance Analysis using K-Means
[13] Aaron Davidson, Andrew Or: Optimizing Shuffle Performance in
Spark

Cheapassignmenthelp Co Uk
No ratings yet
Cheapassignmenthelp Co Uk
9 pages
Crash 1.1.07.0019 (11070019) 20200609 183532 1908310741
No ratings yet
Crash 1.1.07.0019 (11070019) 20200609 183532 1908310741
2 pages
Hackers Handbook 2017
100% (1)
Hackers Handbook 2017
12 pages
Hadoop vs. Spark: The New Age of Big Data
No ratings yet
Hadoop vs. Spark: The New Age of Big Data
7 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Spark SQL
100% (1)
Spark SQL
25 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
No ratings yet
A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
6 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
V3i308 PDF
No ratings yet
V3i308 PDF
9 pages
The Big Big Data' Question Hadoop or Spark
No ratings yet
The Big Big Data' Question Hadoop or Spark
3 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
No ratings yet
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
8 pages
Big Data
No ratings yet
Big Data
7 pages
Hadoop Vs Spark
No ratings yet
Hadoop Vs Spark
2 pages
Performance Comparison of Apache Hadoop and Apache Spark
No ratings yet
Performance Comparison of Apache Hadoop and Apache Spark
5 pages
Shark
No ratings yet
Shark
24 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Apache Spark
No ratings yet
Apache Spark
25 pages
Spark
No ratings yet
Spark
49 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
B66266C3 Hadoopvsspark
No ratings yet
B66266C3 Hadoopvsspark
13 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Large Scale Data Processing: Saeed Iqbal Khattak
No ratings yet
Large Scale Data Processing: Saeed Iqbal Khattak
81 pages
Introduction To Big Data PDF
No ratings yet
Introduction To Big Data PDF
16 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark Interview Ques1
No ratings yet
Spark Interview Ques1
20 pages
Compare Hadoop vs. Spark vs. Kafka For Your Big Data Strategy
No ratings yet
Compare Hadoop vs. Spark vs. Kafka For Your Big Data Strategy
10 pages
Apache spark vs MapReduce(1)
No ratings yet
Apache spark vs MapReduce(1)
3 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Basics of Big Data
No ratings yet
Basics of Big Data
7 pages
Apache Spark
No ratings yet
Apache Spark
40 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
8 pages
Apache Spark
No ratings yet
Apache Spark
16 pages
HADOOP
No ratings yet
HADOOP
10 pages
Week 9
No ratings yet
Week 9
3 pages
Testmagzine Admin,+115+manuscript 6
No ratings yet
Testmagzine Admin,+115+manuscript 6
8 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
spark
No ratings yet
spark
9 pages
Unit 6-1
No ratings yet
Unit 6-1
128 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Ch6 Architectural Design v1
No ratings yet
Ch6 Architectural Design v1
26 pages
Module 3
No ratings yet
Module 3
51 pages
BigDataProcessingTools HaddopHDFSHiveSpark
No ratings yet
BigDataProcessingTools HaddopHDFSHiveSpark
2 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
74-2023 OSVR An Efficient Support Vector Regression Model Based Host
No ratings yet
74-2023 OSVR An Efficient Support Vector Regression Model Based Host
9 pages
68-2021 Research On Security Algorithm of Virtual
No ratings yet
68-2021 Research On Security Algorithm of Virtual
17 pages
73-2022 Using Energy Efficient Security Technology To Protect The Migration of Live Virtual Machines in The Cloud Computing Infrastructure
No ratings yet
73-2022 Using Energy Efficient Security Technology To Protect The Migration of Live Virtual Machines in The Cloud Computing Infrastructure
23 pages
72-2015 Towards - Security-Aware - Virtual - Server - Migration - Optimization - To - The - Cloud
No ratings yet
72-2015 Towards - Security-Aware - Virtual - Server - Migration - Optimization - To - The - Cloud
10 pages
A Framework For Monitoring and Security Authentication in Cloud Based On Eucalyptus
No ratings yet
A Framework For Monitoring and Security Authentication in Cloud Based On Eucalyptus
5 pages
Telecommunications Policy: Nir Kshetri
No ratings yet
Telecommunications Policy: Nir Kshetri
19 pages
Web Applications and Security Revision Notes
No ratings yet
Web Applications and Security Revision Notes
10 pages
NetEngine 8000 M8 Universal Service Router Datasheet - Cleaned
100% (1)
NetEngine 8000 M8 Universal Service Router Datasheet - Cleaned
17 pages
Tcl+32s6500s+Chassis+Rt41pb Ag
100% (1)
Tcl+32s6500s+Chassis+Rt41pb Ag
141 pages
How - To - Transport - Business Functions
No ratings yet
How - To - Transport - Business Functions
4 pages
Python Skills Worksheet 9a Two-dimensional Lists - Copy
No ratings yet
Python Skills Worksheet 9a Two-dimensional Lists - Copy
2 pages
SDH Basics
No ratings yet
SDH Basics
48 pages
Instant Ebooks Textbook Modern X86 Assembly Language Programming: Covers x86 64-Bit, AVX, AVX2, and AVX-512 Daniel Kusswurm Download All Chapters
100% (5)
Instant Ebooks Textbook Modern X86 Assembly Language Programming: Covers x86 64-Bit, AVX, AVX2, and AVX-512 Daniel Kusswurm Download All Chapters
52 pages
Tunnel Ventilation - The Actual State of PDF
No ratings yet
Tunnel Ventilation - The Actual State of PDF
39 pages
Missing Person Proposal2
No ratings yet
Missing Person Proposal2
16 pages
Google Research Scholar Program 2025
No ratings yet
Google Research Scholar Program 2025
13 pages
Lecture 2. AI Techniques
No ratings yet
Lecture 2. AI Techniques
23 pages
Equipment List: Leica iCON Total Stations
No ratings yet
Equipment List: Leica iCON Total Stations
32 pages
3 - Software Engineering
No ratings yet
3 - Software Engineering
29 pages
OBJECTIVE: - : Tarsem Kumar
No ratings yet
OBJECTIVE: - : Tarsem Kumar
3 pages
Password Based Circuit Breaker
No ratings yet
Password Based Circuit Breaker
13 pages
DBMS COVID BED SLOT BOOKING - Main - Organized - Pagenumber
No ratings yet
DBMS COVID BED SLOT BOOKING - Main - Organized - Pagenumber
25 pages
Ebooks File Evolutionary Deep Learning. MEAP Edition: Version 10 Micheal Lanham All Chapters
100% (4)
Ebooks File Evolutionary Deep Learning. MEAP Edition: Version 10 Micheal Lanham All Chapters
49 pages
NETW191 Module 2 PPT Template 10.23
No ratings yet
NETW191 Module 2 PPT Template 10.23
7 pages
CPET - 15L - Lab6
No ratings yet
CPET - 15L - Lab6
19 pages
Smart Card
No ratings yet
Smart Card
23 pages
Launch of i-SMART PDF
No ratings yet
Launch of i-SMART PDF
3 pages
Graph Operations and Representation
No ratings yet
Graph Operations and Representation
33 pages
User User User User Manual Manual Manual Manual: WED-3000V
No ratings yet
User User User User Manual Manual Manual Manual: WED-3000V
56 pages
DBMS assignment_02
No ratings yet
DBMS assignment_02
2 pages
Chapter-1: Multimedia Define Multimedia: Multimedia: Media
100% (1)
Chapter-1: Multimedia Define Multimedia: Multimedia: Media
5 pages
Conan Exiles v196231 FitGirl PDF
No ratings yet
Conan Exiles v196231 FitGirl PDF
4 pages

A Comparative Between Hadoop MapReduce and Apache

Uploaded by

A Comparative Between Hadoop MapReduce and Apache

Uploaded by

A comparative between Hadoop MapReduce and Apache

5.2 Ease of use

[6] Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.

You might also like