0% found this document useful (0 votes)
29 views

A Critical Analysis of Apache Hadoop and Spark For Big Data Processing

This document summarizes a paper presented at the 6th IEEE International Conference on Signal Processing, Computing and Control (ISPCC 2k21) in October 2021. The paper provides a critical analysis and comparison of Apache Hadoop and Apache Spark, two popular platforms for big data processing. It discusses that while Hadoop has proven effective for batch processing, Spark is better suited for iterative and streaming data. The paper aims to critically compare versions of Hadoop (1.x, 2.x, 3.x) and Spark (1.x, 2.x, 3.x) based on components, storage systems, resource management, fault tolerance, data processing, scalability, and performance.

Uploaded by

salah Alswiay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

A Critical Analysis of Apache Hadoop and Spark For Big Data Processing

This document summarizes a paper presented at the 6th IEEE International Conference on Signal Processing, Computing and Control (ISPCC 2k21) in October 2021. The paper provides a critical analysis and comparison of Apache Hadoop and Apache Spark, two popular platforms for big data processing. It discusses that while Hadoop has proven effective for batch processing, Spark is better suited for iterative and streaming data. The paper aims to critically compare versions of Hadoop (1.x, 2.x, 3.x) and Spark (1.x, 2.x, 3.x) based on components, storage systems, resource management, fault tolerance, data processing, scalability, and performance.

Uploaded by

salah Alswiay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

6th IEEE International Conference on Signal Processing, Computing and Control (ISPCC 2k21), Oct 07-09, 2021, JUIT,

Solan, India

A Critical Analysis of Apache Hadoop and


Spark for Big Data Processing
Piyush Sewal
2021 6th International Conference on Signal Processing, Computing and Control (ISPCC) | 978-1-6654-2554-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ISPCC53510.2021.9609518

Hari Singh
Computer Science & Engineering Department Computer Science & Engineering Department
Jaypee University of Information Technology Jaypee University of Information Technology
Solan, Himachal Pradesh, India Solan, Himachal Pradesh, India
[email protected] [email protected]

Abstract- The emergence of big data processing platforms


that can work globally in an integrated manner and process
the huge datasets efficiently has become very significant. A The Apache Hadoop has proven very effective for batch-
critical analysis of two big data processing platforms, Apache processing[4]–[7] while the Apache Spark is one of the
Hadoop MapReduce and Apache Spark, has been done in most effective frameworks for handling iterative,
this paper. Earlier Hadoop MapReduce was one of the most
interactive, streaming data [8].
popular platforms for batch-processing of huge size datasets
but variation in the nature of data from static to dynamic,
Although there are various platforms available for
Apache Spark proves to be better for iterative jobs and live
data handling and processing like Apache Hadoop, Apache
data streams. This paper aims to critically compare and
analyze Hadoop-1.x, 2.x and 3.x, Spark-1.x, 2.x and 3.x on
Spark, Apache Strom, Apache Cassandra, Flink,
well-known key parameters like components, storage system, MongoDB, Kafka, Tableau, RapidMiner, R Programming
resource management, fault tolerance, data processing, etc. This paper discusses the Apache Hadoop and Apache
scalability and performance etc. Spark frameworks for their critical evaluation and
comparison. Key parameters of discussion will be their
Keywords: Big data, Hadoop MapReduce, Batch Processing, architecture, components, storage, resource management,
Stream processing, Spark.
file system, fault tolerance and scalability.

I. INTRODUCTION The distribution of the paper is as follows: Section-II


Data is an important thing as it is used to derive results gives the background and insights of Apache Hadoop
after its processing and based on those results, we can architecture, MapReduce framework and YARN followed
make the decisions. Although data was the same as earlier, by the comparison among Hadoop-1.x, Hadoop-2.x and
the word "big" has been added with it and now it becomes Hadoop-3.x. Section-III gives the background of Apache
"big data"[1]. The term big data can be explained using the Spark and its high-level architectural details, concept of in-
concept of HACE [2] which means that this type of data is memory processing, RDDs, DAG, DataFrames and
Huge, Autonomous, Complex and Evolving. Different DataSets, and comparison among Spark 1.x, 2.x and 3.x.
sources like social media platforms, internet, mobile and Section-IV covers the related work and Section-V
multimedia devices and sensors are generating a huge concludes the paper.
amount of data. Now the point is not only to store and
retrieve it but also to analyze this data and use it for II. APACHE HADOOP
decision making and planning strategies Along with this This section describes the background, framework of
big data also faces the challenges of five V’s [3].
Apache Hadoop and critically compares different versions
From the processing point of view, data is classified of Apache Hadoop.
into two major categories - batch data and streaming data.
A. Background
Batch processing involves processing data collected over a
Apache Hadoop is an open source cloud computing
time interval whereas streaming data is collected,
platform of Apache Software Foundation and it is
processed and analyzed on real time basis. Although
considered as one of the most popular framework for
Hadoop is very suitable for batch processing jobs but
handling Big Data in a distributed environment[1], [9]–
presently interactive jobs, call log streams, click streams,
[11]. It is characterized by fault tolerance, scalability,
message streams, views streams and message streams, real
parallel processing and distributed computations. Hadoop
time queries are in more demand.
has the capability to store and process massive amount of

978-1-6654-2554-4/21/$31.00 ©2021 IEEE 308

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
structured and unstructured data. It uses Hadoop Different kinds of operations are written as map and
Distributed File System (HDFS)[12] for storage handling reduce jobs using programming languages like Java, Hive
and MapReduce framework for data processing[13], [14]. and Pig. The output of these jobs can further write back in
It is a cost-effective platform that can be realized through HDFS [13], [17]. However, the MapReduce architecture
commodity hardware. introduced in Hadoop-1.x has certain limitations with
respect to shuffling phase and task scheduling system.
Hadoop also offers data locality optimization and Between the map and reduce functions, intermediate data
there is no need to transfer the whole data to the main shuffling is performed from Map tasks to Reduce tasks, as
processing system and as a result it saves huge bandwidth shown in Figure II. This data shuffling causes a large
and time. During the execution of clusters if any machine number of disks accesses and consumes a lot of I/O
fails, Hadoop continue the processing of cluster without bandwidth.
any loss of data or any interruption by shifting the
processing task to some other machine in the cluster. For
managing the storage on clusters, HDFS break the
incoming files into pieces known as “blocks” and store
them across the pool of servers and finally three complete
copies of each file are stored to three different servers[12].

Hadoop creates clusters of different machines and


coordinates work among them. During the execution of
clusters if any machine fails, Hadoop continue the
processing of cluster without any loss of data or any Fig. II: The Map-Reduce Framework
interruption by shifting the processing task to some other
machine in the cluster. For managing the storage on These issues are then addressed by using virtual shuffling
clusters, HDFS break the incoming files into pieces known and adaptive pre-shuffling[18]–[20]. Hadoop-1 also unable
as “blocks” and store them across the pool of servers and in efficient resource utilization [21]. This issue is
finally three complete copies of each file are stored to addressed by Hadoop-2.x by introducing a separate
three different servers [15]. Resource Manager and YARN.

B. Hadoop MapReduce Framework C. Comparison among Hadoop-1.x, Hadoop-2.x


and Hadoop-3.x
MapReduce is the basic processing pillar in the
Hadoop ecosystem[16]. It is based on Master Slave Apache software foundation released Apache Hadoop
architecture, shown in Figure I. As the name suggests, it Version 1.x (Hadoop-1.x) in December 2011, Hadoop
basically performs two operations map and reduce. Firstly, Version 2.x (Hadoop-2.x) in August 2013 and Hadoop
map function takes the key values as its input and then it Version 3.x (Hadoop-3.x) in December 2017. Each
generates the intermediate key pairs. successive version improves the limitations of previous
version and includes some additional functionality to
improve the performance. A comparison of Hadoop-1.x,
Hadoop-2.x and Hadoop-3.x is presented in Table I.

An analysis of the three versions of Hadoop is done on


parameters daemons, components, storage, resource
management and data processing, fault tolerance, load
balancing, scalability, support for java and windows, file
system and data analytics. On comparing Hadoop-1.x,
Hadoop-2.x and Hadoop-3.x, it can be easily concluded
that each version of Hadoop improves the flexibility,
scalability resource utilization and performance of the with
Fig. I: Hadoop’s Master-Slave Architecture respect to big data analytics.

After that reduce function performs the operation and it III. APACHE SPARK
merges all the intermediate key values associated with the
same intermediate key. This section describes the background, framework of
Apache Spark and critically compares different versions of
Apache Spark.

309

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
Table I: Comparison among Hadoop-1.x, Hadoop-2.x and Hadoop-3.x

Parameter Hadoop-1.x Hadoop-2.x Hadoop-3.x


Daemons [9] NameNode, DataNode, NameNode, DataNode, NameNode, DataNode,
Secondary NameNode, Job Secondary NameNode, Secondary NameNode, Resource
Tracker, Task Tracker Resource Manager, Node Manager, Node Manager,
Manager WebAppProxy
Components[9], [13] HDFS, MapReduce HDFS, YARN/MRv2, YARN HDFS, YARN, YARN Timeline
Timeline service v.1 service v.2
Storage[9], [22], [23] HDFS Storage Management HDFS with 3 X replication HDFS with Erasure Coding
scheme
Resource Management MapReduce: Resource YARN v.1: Resource YARN v.2: Resource
and Data Processing[9], Management + Data Management (MapReduce: Management (MapReduce: Other
[24], [25] Processing Other types of Jobs) types of Jobs)
Fault Tolerance[9], [22], Single point of failure Handled using replication Handled using Erasure Coding
[23], [26]–[28]
Load Balancing [29]– HDFS Balancer HDFS Balancer HDFS Disk Balancer CLI
[31]
Resource Utilization[9], Low resource utilization Resource utilization is Resource utilization is
[21], [24] comparatively high comparatively high
Scalability[24] Up to 4000 notes per cluster Up to 10000 notes per cluster More than 10000 notes per cluster

Minimum Supported Java Java 6 Java 7 Java 8


Version[9]
Implementation[9], [21], Follows the concept of slots Follows the concept of Follows the concept of containers
[24] containers

Support for Windows[9] Not supported Supported Supported

Compatible File HDFS, FTP File System, HDFS, FTP File System, HDFS, FTP File System, Amazon
System[9], [32] Amazon S3 Amazon S3, Windows Azure S3, Microsoft Azure Data Lake
Storage Blobs (WASB) file system
Data analytics[11], [24], No platform for event Platform available for event Platform available for event
[33] processing, streaming and real processing, streaming and real processing, streaming and real
time operations time operations time operations

B. Apache Spark Architecture


Spark framework contains a tight integration of
A. Background
components. Spark Architecture can be divided into four
Apache Spark is also an open-source data processing segments which are Spark-Core, Upper-Level libraries,
and cluster computing platform that run on the top of Cluster Managers and Storage[36]. Spark-Core is the main
Hadoop and use the functionalities of HDFS. It can process component of Spark which acts as “computational engine”
large datasets in parallel by distributing it across multiple and performs the tasks like task scheduling, fault recovery,
nodes [34]. It supports SQL queries and Machine Learning memory management and interacting with storage systems.
applications for execution of workloads. Spark has a Spark has a number of libraries which includes Spark SQL,
number of libraries and the best thing with Spark is that we Spark MLlib, GraphX and Spark Streaming. Spark SQL
can combine them all in the same application by using a package is used to work with structured data and it allows
single processing engine. Using its standalone cluster the queries over data using SQL or Hive Query Langue
mode, Spark can be run on different cluster managers. (HQL) from different data sources like Hive Tables, JSON
and Parquet etc. Spark MLlib is the machine learning
Spark extends the existing MapReduce framework by
library which provides various machine learning algorithms
supporting additional computational capabilities for
like regression, classification, filtering and clustering etc.
interactive queries and stream processing. Spark engine
GraphX library is used for graph’s manipulation that allows
support different types of workloads like batch processing,
the user to create, manipulate and compute the graphs and
interactive queries, iterative algorithms and streaming
use the common graph algorithms like PageRank etc. Spark
within the framework. Although Spark was originally
Streaming package is used to process the live streams of
written in Scala but it supports a wide range of API’s that
includes R, SQL, Python, Java etc. [35]. data that can be generated from different sources like social
media, search engines and log files generated from web
servers etc. It can receive live data streams from many

310

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
different sources like Kafka, Flume and Kinesis etc. To in place of disk and then performs transformations and
achieve the flexibility of scaling from standalone node to actions over it. Spark uses DAG (directed acyclic graph)
thousands of computing nodes, Spark has the capability to model which is a logical flow of operations for doing the
use different cluster managers which includes Hadoop jobs during the transformation operations. During action
YARN, Apache Mesos, Amazon EC2 and Spark’s own operation, DAG is submitted to DAG scheduler and jobs
cluster Spark Standalone ClusterManager. The fourth will be performed as per the sequence of DAG. DAG is
segment of Spark is storage segment which is used to basically an arrangement of vertices and edges where
create distributed datasets. Different storage systems used vertices indicate different RDDs and edges indicate the
by the Spark are Hadoop’s own HDFS, Hive, HBase, various operations executed over RDDs. DAG scheduler is
Cassandra, Amazon S3 etc.[37]. the part of Spark scheduling layer which is used to
maintain jobs and stages. When an action operation is
Two key features of the spark are in-memory executed, Spark calls the DAG scheduler to execute the
processing of large datasets and concept of using Resilient submitted task.
Distributed datasets (RDDs). In-memory processing means
data processing will take place in main memory in place of Although RDD API was very useful, it has the
secondary storage like disk. RDD enables the limitation of automatic optimization lack of information
programmers to perform in-memory processing of large regarding data structure and user functions. Data was
clusters with faster execution and with fault-tolerant stored by the RDDs as collection of Java objects and it was
mechanism. RDDs are best suited for computing unable to debug the errors on runtime and also have the
framework applications which can handle interactive data performance and scalability issues. To overcome this,
mining tools and iterative algorithms. This is because it DataFrames were introduced in Spark-1.4 version.
keeps the processing data in main memory and hence DataFrames are distributed data collection which is
performance is increased as compared to disk processing organized in the form of rows and columns. Spark
system used by Hadoop. RDDs provides the shared DataFrames can be created using various sources like log
memory in restricted form, hence they are fault tolerant. tables, Hive tables, existing RDDs and external databases
They are called as resilient as they are immutable and and it can be integrated with other big data tools for
cannot to modified, although a new RDD can be created processing huge size datasets at once[36]. Spark
every time from existing RDDs or from external data introduced Dataset as an extension of DataFrames API in
sources. RDDs can perform two types of operations Spark-1.6 to provide an object-oriented programming
transformations and actions. Transformation operation interface. It is a type safe and immutable collection of
includes map, filter and join operations. They are known as objects which are mapped using the relational schemas.
lazy operations because they define a new RDD. Actions Datasets take benefit from fast in- memory encoding of
perform the computations and return a value to program. Tungsten’s and Catalyst optimizer of Spark to expose the
Five pieces of information is used with common interface data fields and expressions to a query planner. Hence
to represent each RDD which are partitions, preferred Spark applications can be written very efficiently having
locations, dependencies, an iterator and metadata regarding almost 2x processing speed and 75% less memory space
its partitioning schema. Language-integrated APIs are used as compared to RDDs [39].
by Spark to expose RDDs where object represents the
datasets and transformations are applied over these objects. C. Comparison among Spark-1.x, Spark-2.x and
One or more RDDs can be defined from existing RDDs Spark-3.x
using transformations and these RDDs can be used in At a time, when Hadoop MapReduce was one of the
actions. Actions are operations which are used to return most dominating platforms for processing of big data on
output to application or export data to storage system. For more than thousand cluster nodes, Apache Spark project
example, count return number of dataset elements, collect was started in 2009. MapReduce engine was inefficient for
to return elements and save returns the output to storage building large applications based for streaming data and
system. RDDs have various advantages as compared to machine learning algorithms where 10 to 20 passes are
DSM (Distributed Shared Memory) in terms of required over the data and each pass is needed to be
performance and faster execution [38]. written as a separate MapReduce job which consumes a
high processing time and operations. To address these
In-memory processing of data is one of the key issues, Spark team firstly builds API for functional
features of Spark which makes it different from disk-based programming suitable for batch processing. To handle the
processing of MapReduce model of Hadoop. It makes it interactive queries, Shark engine was launched in 2011
most efficient for iterative data processing tasks. Spark that can run SQL queries over Spark. Focus was then
allows storing the intermediate results in the main memory shifted to develop the Spark libraries by following the

311

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
“standard library” approach. After the “functional overcome the shortcomings of previous versions. An
operations” based initial releases, Spark was released with analysis of three versions of Spark is done on the
a set of four integrated libraries and large API support. parameters Daemons, Components, API Support, Spark
Apache Spark foundation officially released the Spark SQL, MLib, GraphX and Streaming Performance and
1.0.0 on May 30, 2014, Spark 2.0.0 on July 26, 2016 and Scalability, Data Sources, Cluster Managers, Support for
Spark 3.0.0 on July 18, 2020. Each successive versions of TCP-DS queries and Support for Windows and Unix and
Spark follow the performance improvements and is represented in Table II.
Table II: Comparison among Spark-1.x, Spark-2.x and Spark -3.x

Parameter Spark-1.x Spark-2.x Spark-3.x

Daemons [35] Master Daemon, Worker Daemon Master Daemon, Worker Master Daemon, Worker
Daemon Daemon
Components [36] Spark Core, APIs, Libraries, Storage Spark Core, APIs, Libraries, Spark Core, APIs, Libraries,
system, Cluster manager Storage, Cluster manager Storage, Cluster manager
API Support [35], [39] Scala, Python, Java, SQL and R. Scala, Python, Java, SQL and R. Scala, Python, Java, SQL and R.

Spark SQL [36], [40] Support for loading and Improved SQL functionalities Adaptive query execution,
manipulating structured data in with SQL 2003 support, Dynamic Partition Pruning,
Spark, RDDs DataFrames API Dataset API
MLib[41] Support for sparse feature vectors in DataFrame based primary API Performance improvement and
Scala, Java and Python support for Deep Learning
GraphX[42] Performance improvement in Performance improvement SparkGraph with cypher query
loading of graphs, reversal of edges language
and neighborhood computation
Streaming [43], [44] Performance optimizations for High level streaming API on top New Spark UI
stream transformation of Spark SQL and catalyst
optimizer
Performance and Good Comparatively better than Comparatively better than Spark
Scalability [35], [39] Spark-1.x 1.x and 2.x
Data Sources [35], [39] HDFS, Cassandra, HBase, Alluxio HDFS, Cassandra, HBase, HDFS, Cassandra, HBase,
(Techyon), MongoDB Alluxio (Techyon), MongoDB, Alluxio (Techyon), MongoDB,
Kafka, ElasticSearch Kafka, ElasticSearch
Cluster Managers [35], Spark Standalone, Hadoop YARN, Spark Standalone, Hadoop Spark Standalone, Hadoop
[39] Apache Mesos, Amazon EC2 YARN, Apache Mesos, Amazon YARN, Apache Mesos, Amazon
EC2, Kubernets EC2, Kubernets
Support for TCP- DS 55 out of 99 TCP- DS queries All 99 TCP- DS queries 2x to 17x performance
queries [39] (Spark 1.6) improvement over Spark 2.4 for
all 99 TCP- DS queries
Support for Windows Supported Supported Supported
and UNIX [35]

Mahout and Machine learning in [11], [25], [45], [46] .


IV.RELATED WORK
Similarly researchers has also covered Apache Spark with
Various researchers have covered different big data
its core concepts and libraries [36], [38], [40], [42], [47],
processing tools and highlight their important features and
[48].
key components. The nature of big data, it’s important
components and different tools and technologies which are
widely used for big data processing and analysis are V.CONCLUSIONS
covered in [2] . Similar type of work has been carried out The objective of this paper is to critically analyze the
in [1] along with the challenges and open research issues various aspects of two most popular big data processing
in big data analytics. Due to the high popularity and platforms Hadoop MapReduce and Spark so as to select
efficiency of Hadoop MapReduce and Apache Spark, the appropriate platform for big data processing and
various researchers have done their feature wise in-depth analytics. After providing a brief background and
analysis of performance and efficiency as well as architecture of each platform, this paper briefly provides
comparative analysis of both platforms. Development of the overview of different versions of Apache Hadoop and
Hadoop has been covered in a systematic manner along Apache Spark on widely discussed parameters in the
with the component wise contributors of Hadoop literature to understand the strength and weakness of each
ecosystem with related studies and publications in [10]. version.
Hadoop MapReduce with its programming model and
After an exhaustive literature survey, it can be
implementations has been covered in [13]. Some other
concluded that inception of YARN timeline service v.2,
researchers have covered Hadoop ecosystem with YARN,
erasure coding in HDFS, disc balancer CLI and

312

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
compatibility with Microsoft Azure Data Lake file system Comput. Internet Things, ICGCIoT 2015, vol. 2, pp. 545–549, 2016.
[22] R. J. Chansler, “Data Availability and Durability with the Hadoop
makes the Hadoop-3.x a very effective data processing Distributed File System,” ;Login, vol. 37, no. 1, pp. 16–22, 2012.
tools in terms of processing performance and memory [23] Y. Liu and W. Wei, “A Replication-Based Mechanism for Fault
Tolerance in MapReduce Framework,” Math. Probl. Eng., vol. 2015,
utilization. Similarly, features like usage of Datasets in pp. 1–7, 2015.
place of RDDs, adaptive query execution, dynamic [24] V. K. Vavilapalli et al., “Apache Hadoop YARN,” in Proceedings of
the 4th annual Symposium on Cloud Computing, 2013, pp. 1–16.
partition pruning and improvement in Pandas API makes [25] A. P. Kulkami and M. Khandewal, “Survey on Hadoop and
Introduction to YARN,” Int. J. Emerg. Technol. Adv. Eng., vol. 4, no.
Spark 3.x a versatile, flexible, faster and memory efficient 5, pp. 82–87, 2014.
data processing and analytical tool, especially for [26] M. M. Shetty and D. H. Manjaiah, “Data security in Hadoop distributed
file system,” Proc. IEEE Int. Conf. Emerg. Technol. Trends Comput.
streaming and unstructured data. Hence, it can be Commun. Electr. Eng. ICETT 2016, pp. 939–944, 2017.
concluded that Hadoop MapReduce platform is very [27] A. C. Ko and W. T. Zaw, “Fault tolerant erasure coded replication for
HDFS based cloud storage,” Proc. - 4th IEEE Int. Conf. Big Data
effective for Batch-processing tasks whereas Apache Cloud Comput., pp. 104–109, 2014.
Spark is one of the most effective platform for the [28] A. Chiniah and A. Mungur, “Dynamic Erasure Coding Policy
Allocation (DECPA) in Hadoop 3.0,” Proc. - 6th IEEE Int. Conf. Cyber
processing and analysis of streaming data. Secur. Cloud Comput. CSCloud 2019 5th IEEE Int. Conf. Edge
Comput. Scalable Cloud, EdgeCom 2019, pp. 29–33, 2019.
REFERENCES [29] L. Kolb, A. Thor, and E. Rahm, “Load Balancing for MapReduce-
based Entity Resolution,” in 2012 IEEE 28th International Conference
on Data Engineering, 2012, Section III, pp. 618–629.
[1] D. P. Acharjya and K. Ahmed P, “A Survey on Big Data Analytics: [30] C. Y. Lin and Y. C. Lin, “An overall approach to achieve load
Challenges, Open Research Issues and Tools,” Int. J. Comput. Sci. balancing for Hadoop Distributed File System,” Int. J. Web Grid Serv.,
Eng., vol. 6, no. 6, pp. 1238–1244, 2018. vol. 13, no. 4, pp. 448–466, 2017.
[2] T. R. Rao, P. Mitra, R. Bhatt, and A. Goswami, “The big data system, [31] R. W. A. Fazul, P. V. Cardoso, and P. P. Barcelos, “Improving Data
components, tools, and technologies: a survey,” Knowl. Inf. Syst., vol. Availability in HDFS through Replica Balancing,” in 2019 9th Latin-
60, no. 3, pp. 1165–1245, Sep. 2019. American Symposium on Dependable Computing (LADC), 2019, pp. 1–
[3] Ishwarappa and J. Anuradha, “A brief introduction on big data 5Vs 6.
characteristics and hadoop technology,” Procedia Comput. Sci., vol. [32] “Cloud Computing Services | Microsoft Azure.” [Online]. Available:
48, pp. 319–324, 2015. https://azure.microsoft.com/en-in/. [Accessed: 13-Jul-2021].
[4] H. Singh and S. Bawa, “A MapReduce-based scalable discovery and [33] K. Aziz, D. Zaidouni, and M. Bellafkih, “Real-time data analysis using
indexing of structured big data,” Futur. Gener. Comput. Syst., vol. 73, Spark and Hadoop,” Proc. 2018 Int. Conf. Optim. Appl. ICOA 2018,
no. August 2017, pp. 32–43, 2017. pp. 1–6, 2018.
[5] M. Mittal, H. Singh, K. K. Paliwal, and L. M. Goyal, “Efficient random [34] J. G. Shanahan and L. Dai, “Large scale distributed data science using
data accessing in MapReduce,” in 2017 International Conference on apache spark,” in Proceedings of the ACM SIGKDD International
Infocom Technologies and Unmanned Systems (Trends and Future Conference on Knowledge Discovery and Data Mining, 2015, vol.
Directions) (ICTUS), 2017, vol. 2018-Janua, pp. 552–556. 2015-Augus, pp. 2323–2324.
[6] H. Singh and S. Bawa, “Scalability and Fault Tolerance of MapReduce [35] “Apache Spark TM - Unified Analytics Engine for Big Data.” [Online].
for Spatial data,” Glob. J. Eng. Sci. Res. Manag., vol. 3, no. 8, pp. 97– Available: https://spark.apache.org/. [Accessed: 05-Jan-2021].
103, 2016. [36] S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, “Big data
[7] H. Singh and S. Bawa, “Spatial data analysis with ArcGIS and analytics on Apache Spark,” International Journal of Data Science and
MapReduce,” in Proceeding - IEEE International Conference on Analytics, vol. 1, no. 3–4. Springer International Publishing, pp. 145–
Computing, Communication and Automation, ICCCA 2016, 2017, pp. 164, 2016.
45–49. [37] M. . Karau, H., Konwinski. A., Wendell, and P., Zaharia, Learning
[8] S. Shahrivari, “Beyond Batch Processing: Towards Real-Time and Spark: Lightning- Fast Data Analysis, 1st ed. O’Reilly Media, 2015.
Streaming Big Data,” Computers, vol. 3, no. 4, pp. 117–129, Oct. 2014. [38] M. Zaharia et al., “Resilient distributed datasets: A fault-tolerant
[9] “Apache Hadoop.” [Online]. Available: https://hadoop.apache.org/. abstraction for in-memory cluster computing,” Proc. NSDI 2012 9th
[Accessed: 23-Dec-2020]. USENIX Symp. Networked Syst. Des. Implement., pp. 15–28, 2012.
[10] I. Polato, R. Ré, A. Goldman, and F. Kon, “A comprehensive view of [39] “The Databricks Blog.” [Online]. Available:
Hadoop research—A systematic literature review,” J. Netw. Comput. https://databricks.com/blog. [Accessed: 13-Jul-2021].
Appl., vol. 46, pp. 1–25, Nov. 2014. [40] M. Armbrust et al., “Spark SQL: Relational data processing in spark,”
[11] S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin, “A in Proceedings of the ACM SIGMOD International Conference on
survey of open source tools for machine learning with big data in the Management of Data, 2015, vol. 2015-May, pp. 1383–1394.
Hadoop ecosystem,” J. Big Data, vol. 2, no. 1, pp. 1–36, 2015. [41] M. Assefi, E. Behravesh, G. Liu, and A. P. Tafti, “Big data machine
[12] D. Borthakur, “HDFS Architecture Guide,” The Apache Software learning using apache spark MLlib,” in Proceedings - 2017 IEEE
Foundation, 2008, pp. 1–14. International Conference on Big Data, Big Data 2017, 2017, vol. 2018-
[13] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing Janua, pp. 3492–3498.
on Large Clusters,” Commun. Acm, vol. 51, no. 1, pp. 107–113, 2008. [42] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, “GraphX,” in
[14] S. Sakr, A. Liu, and A. G. Fayoumi, “The family of mapreduce and First International Workshop on Graph Data Management Experiences
large-scale data processing systems,” ACM Comput. Surv., vol. 46, no. and Systems, 2013, pp. 1–6.
1, pp. 1–44, 2013. [43] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica,
[15] H. S. Bhosale and D. P. Gadekar, “A Review Paper on Big Data and “Discretized streams,” in Proceedings of the Twenty-Fourth ACM
Hadoop,” Int. J. Sci. Res. Publ., vol. 4, no. 1, pp. 2250–3153, 2014. Symposium on Operating Systems Principles, 2013, no. 1, pp. 423–438.
[16] C. S. Rani and B. Rama, “MapReduce with Hadoop for Simplified [44] M. Petrov, N. Butakov, D. Nasonov, and M. Melnik, “Adaptive
Analysis of Big Data,” Int. J. Adv. Res. Comput. Sci., vol. 8, no. 5, pp. performance model for dynamic scaling Apache Spark Streaming,”
2015–2018, 2017. Procedia Comput. Sci., vol. 136, pp. 109–117, 2018.
[17] K. Shim, “Databases in Networked Information Systems,” Proc. VLDB [45] P. D. Hung and D. Le Huynh, “E-Commerce Recommendation System
Endow., vol. 5, no. 12, pp. 2016–2017, 2010. Using Mahout,” in 2019 IEEE 4th International Conference on
[18] J. Xie, Y. Tian, S. Yin, J. Zhang, X. Ruan, and X. Qin, “Adaptive Computer and Communication Systems (ICCCS), 2019, pp. 86–90.
preshuffling in Hadoop Clusters,” Procedia Comput. Sci., vol. 18, pp. [46] V. K. Vavilapalli et al., “Apache Hadoop YARN,” in Proceedings of
2458–2467, 2013. the 4th annual Symposium on Cloud Computing, 2013, pp. 1–16.
[19] W. Yu, Y. Wang, X. Que, and C. Xu, “Virtual Shuffling for Efficient [47] Xinyi Liao, Zhiwei Gao, Weixing Ji, and Yizhuo Wang, “An
Data Movement in MapReduce,” IEEE Trans. Comput., vol. 64, no. 2, enforcement of real time scheduling in Spark Streaming,” in 2015 Sixth
pp. 556–568, 2015. International Green and Sustainable Computing Conference (IGSC),
[20] Q. Zhang, M. F. Zhani, Y. Yang, R. Boutaba, and B. Wong, “PRISM: 2015, pp. 1–6.
Fine-grained resource-aware scheduling for MapReduce,” IEEE Trans. [48] X. Meng et al., “MLlib: Machine learning in Apache Spark,” J. Mach.
Cloud Comput., vol. 3, no. 2, pp. 182–194, 2015. Learn. Res., vol. 17, pp. 1–7, 2016.
[21] J. F. Weets, M. K. Kakhani, and A. Kumar, “Limitations and
challenges of HDFS and MapReduce,” Proc. 2015 Int. Conf. Green

313

Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.

You might also like