A Comparative Between Hadoop MapReduce and Apache
A Comparative Between Hadoop MapReduce and Apache
Spark on HDFS
Mohamed Saouabi Abdellah Ezzati
Lavete Laboratory Lavete Laboratory
University Hassan 1st University Hassan 1st
Settat, Morocco Settat, Morocco
[email protected] [email protected]
ABSTRACT
Data is growing now in a very high speed with a large
volume, Spark and MapReduce1 both provide a processing
model for analyzing and managing this large data -Big
Data- stored on HDFS. In this paper, we discuss a
comparative between Apache Spark and Hadoop
MapReduce using the machine learning algorithms, k-
means and logistic regression. This comparative is done
through two experiences, the first one using the same
programming language java, and the second using different
programming languages. Figure 1: HDFS architecture
The Master nodes supervise the two cores of Hadoop:
KEYWORDS storing data in parallel in HDFS and process all that data
Hadoop MapReduce, Apache Spark, Big Data, HDFS, k- using MapReduce. The Name Node supervise and
means, Logistic Regression, Machine learning coordinates the data storage function in HDFS, while the
Job Tracker supervise and coordinates the parallel
1 INTRODUCTION processing of data using MapReduce.
Slave Nodes make up the large majority of machines and
Hadoop is an Apache Open Source Framework that do all the work of storing the data and running the
supports the processing of large datasets in a distributed computations. The slaves run data nodes and task tracker
computing environment, it consists of two main parts: daemon which take instructions from the master nodes. The
HDFS which allow us to store data in parallel in a cluster, Data Node daemon is a slave to the Name Node and the
and MapReduce which process data in parallel. HDFS can Task Tracker daemon is a slave to the Job Tracker.
handle a large amount of data and it offers an easy and fast
access to data, in parallel and in a fault tolerant manner.
It’s based on master slave architecture: 2 RELATED WORKS
Dr. E. Laxmi Lydia, Dr.A.Krishna Mohan, S.Naga Mallik
Raj [7] proposes a comparative between Apache spark and
Hadoop MapReduce using the machine learning algorithm
Permission to make digital or hard copies of all or part of this work for k-means, the comparison took part just one algorithm (k-
personal or classroom use is granted without fee provided that copies are means), which is not totally sufficient to come up with a
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for
general result, because, every algorithm has a different way
components of this work owned by others than ACM must be honored. of working. Trying more than one algorithm may give us a
Abstracting with credit is permitted. To copy otherwise, or republish, to better idea about the performance analysis.
post on servers or to redistribute to lists, requires prior specific permission Meenakshi Sharma, Vaishali Chauhan, Keshav Kishore [8]
and/or a fee. Request permissions from [email protected].
IML '17, October 17-18, 2017, Liverpool, United Kingdom
gives us a comparison between Apache spark and Hadoop
© 2017 Association for Computing Machinery. MapReduce, without using an algorithm or an experimental
ACM ISBN 978-1-4503-5243-7/17/10…$15.00 use case, just by giving advantages and disadvantages for
http://dx.doi.org/10.1145/ 3109761.3109775 each.
IML '17, October 17-18, 2017, Liverpool, UK M. Saouabi and A. Ezzati.
Jai Prakash Verma, Atul Patel [10] discusses a comparison Spark in MapReduce: Spark in MapReduce is used to
of both the programming frameworks, Apache Spark and launch Spark job in addition to the Standalone deployment.
Hadoop MapReduce using the wordcount program for the In this case, the user can start Spark and use the shell
experimental analysis. In this sort of comparison, it might without any administrative access.
be preferable to use huge algorithms for really
experimenting the analysis performance of Big Data, 5 COMPARING HADOOP MAPREDUCE AND
wordcount probably is not the best algorithm to use, maybe APACHE SPARK
machine learning’s algorithms will give us a pretty clear
image of the capacity and the performance of analysis. Now, we will compare Hadoop MapReduce and Apache
Spark in different fields:
3 HADOOP MAPREDUCE
5.1 Performance
MapReduce is a programming paradigm that allows the
One of the basis of Apache Spark, the in-memory
processing of distributed data across multiple servers in
processing, which it avoids the write and read on the hard
a cluster. The MapReduce has become dominant
drive and gain a lot of time. To do so, Apache spark uses the
concerning batch-processing, the principle idea is to
RDDs to put the data in memory and process it. It uses a lot
divide the input data sets into chunks that are processed
of memory, if the data are too large to fit in the memory, it
in a completely parallel manner.
can cause problems for Spark.
In the other hand, MapReduce kill the processes when a job
is done, which means it can easily run other services. It’s
designed for batch-processing, and it can process really big
data in the cluster with no problems.
5.5 Fault tolerance 6.2 Results for k-means and logistic regression using
java with MapReduce and Scala with Spark:
Hadoop MapReduce utilizes the replications of data into In this experience, we’ll use different languages, for spark
the cluster to assure the fault tolerance. Spark is also fault we’ll use Scala, and MapReduce we’ll use java, using the
tolerant, it uses the RDDs to re-calculate and re-create the same algorithms: k-means and logistic regression.
data. But if a process is down, Spark needs to start the
processing from the start, while MapReduce can start from Table 3: Results for k-means and logistic regression
where he finished, since he uses hard drive. with Spark and MapReduce
Logistic regression K-means
6 EXPERIMENTAL CASE STUDY COMPARING Spark 2 seconds 3 seconds
APACHE SPARK AND APACHE MAPREDUCE MapReduce 41 seconds 83 seconds
In the first experience, we will compare Hadoop
MapReduce and Apache Spark using two algorithms of 100
machine learning, k-means and logistic regression, using
the same language: java. 80
The machine used had a configuration as follows: 60
4Gb Ram Spark
Linux 40
64 Gb hard drive 20 Mapreduce
Dataset size: 4MB
0
6.1 Results for k-means and logistic regression using Logistic K‐means
the same programming language java with Spark regression
and MapReduce:
Figure 4: Results for k-means and logistic regression with
Table 2: Results for k-means and logistic regression Spark and MapReduce
using the same programming language java with Spark
and MapReduce This experience shows that Spark is significantly faster
Logistic regression K-means than MapReduce by taking the datasets, and time in
Spark 8 seconds 11 seconds consideration, because MapReduce perform the write and
MapReduce 41 seconds 83 seconds read operations of the data on the hard drive, which makes
him slow comparing to Spark who uses the memory.
90
80 7 CONCLUSIONS
70 This paper gives a comparative of Apache Spark and
Hadoop MapReduce, using the analysis algorithms k-
60 means and logistic regression. The outcome of this analysis
50 Spark show clearly that Spark is faster than MapReduce. Both
40 Spark and MapReduce provide an effective solution to
Mapreduce
handle Big Data. Spark has excellent performance, offering
30
the in-memory processing of data which make the work
20 faster than MapReduce who uses hard drive to store and
10 process the data.
0
Logistic regression K‐means REFERENCES
[1] Apache Spark Documentation,
https://spark.apache.org/docs/0.9.1/hardware-provisioning.html.
Figure 3: Results for k-means and logistic regression using the
[2] Big data Analytics, http://www.sas.com/en_us/insights/analytics/big-
same programming language java with Spark and data-analytics.html.
MapReduce [3] Putting Spark to use, http://blog.cloudera.com/blog/2013/11/putting-
spark-to-use-fast-in-memory-computing-for-your-big-data-
applications/.
Using the same language java, the result shows clearly [4] Amol Bansod: Efficient Big Data Analysis with Apache Spark in
HDFS, 2015.
that the performance of Spark is far better than [5] Meenakshi Sharma1, Vaishali Chauhan2, Keshav Kishore: A
MapReduce concerning the time of processing. REVIEW: MAPREDUCE AND SPARK FOR BIG DATA
ANALYTICS, 2016.
3
IML '17, October 17-18, 2017, Liverpool, UK M. Saouabi and A. Ezzati.