A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
Solan, India
Hari Singh
Computer Science & Engineering Department Computer Science & Engineering Department
Jaypee University of Information Technology Jaypee University of Information Technology
Solan, Himachal Pradesh, India Solan, Himachal Pradesh, India
[email protected] [email protected]
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
structured and unstructured data. It uses Hadoop Different kinds of operations are written as map and
Distributed File System (HDFS)[12] for storage handling reduce jobs using programming languages like Java, Hive
and MapReduce framework for data processing[13], [14]. and Pig. The output of these jobs can further write back in
It is a cost-effective platform that can be realized through HDFS [13], [17]. However, the MapReduce architecture
commodity hardware. introduced in Hadoop-1.x has certain limitations with
respect to shuffling phase and task scheduling system.
Hadoop also offers data locality optimization and Between the map and reduce functions, intermediate data
there is no need to transfer the whole data to the main shuffling is performed from Map tasks to Reduce tasks, as
processing system and as a result it saves huge bandwidth shown in Figure II. This data shuffling causes a large
and time. During the execution of clusters if any machine number of disks accesses and consumes a lot of I/O
fails, Hadoop continue the processing of cluster without bandwidth.
any loss of data or any interruption by shifting the
processing task to some other machine in the cluster. For
managing the storage on clusters, HDFS break the
incoming files into pieces known as “blocks” and store
them across the pool of servers and finally three complete
copies of each file are stored to three different servers[12].
After that reduce function performs the operation and it III. APACHE SPARK
merges all the intermediate key values associated with the
same intermediate key. This section describes the background, framework of
Apache Spark and critically compares different versions of
Apache Spark.
309
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
Table I: Comparison among Hadoop-1.x, Hadoop-2.x and Hadoop-3.x
Compatible File HDFS, FTP File System, HDFS, FTP File System, HDFS, FTP File System, Amazon
System[9], [32] Amazon S3 Amazon S3, Windows Azure S3, Microsoft Azure Data Lake
Storage Blobs (WASB) file system
Data analytics[11], [24], No platform for event Platform available for event Platform available for event
[33] processing, streaming and real processing, streaming and real processing, streaming and real
time operations time operations time operations
310
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
different sources like Kafka, Flume and Kinesis etc. To in place of disk and then performs transformations and
achieve the flexibility of scaling from standalone node to actions over it. Spark uses DAG (directed acyclic graph)
thousands of computing nodes, Spark has the capability to model which is a logical flow of operations for doing the
use different cluster managers which includes Hadoop jobs during the transformation operations. During action
YARN, Apache Mesos, Amazon EC2 and Spark’s own operation, DAG is submitted to DAG scheduler and jobs
cluster Spark Standalone ClusterManager. The fourth will be performed as per the sequence of DAG. DAG is
segment of Spark is storage segment which is used to basically an arrangement of vertices and edges where
create distributed datasets. Different storage systems used vertices indicate different RDDs and edges indicate the
by the Spark are Hadoop’s own HDFS, Hive, HBase, various operations executed over RDDs. DAG scheduler is
Cassandra, Amazon S3 etc.[37]. the part of Spark scheduling layer which is used to
maintain jobs and stages. When an action operation is
Two key features of the spark are in-memory executed, Spark calls the DAG scheduler to execute the
processing of large datasets and concept of using Resilient submitted task.
Distributed datasets (RDDs). In-memory processing means
data processing will take place in main memory in place of Although RDD API was very useful, it has the
secondary storage like disk. RDD enables the limitation of automatic optimization lack of information
programmers to perform in-memory processing of large regarding data structure and user functions. Data was
clusters with faster execution and with fault-tolerant stored by the RDDs as collection of Java objects and it was
mechanism. RDDs are best suited for computing unable to debug the errors on runtime and also have the
framework applications which can handle interactive data performance and scalability issues. To overcome this,
mining tools and iterative algorithms. This is because it DataFrames were introduced in Spark-1.4 version.
keeps the processing data in main memory and hence DataFrames are distributed data collection which is
performance is increased as compared to disk processing organized in the form of rows and columns. Spark
system used by Hadoop. RDDs provides the shared DataFrames can be created using various sources like log
memory in restricted form, hence they are fault tolerant. tables, Hive tables, existing RDDs and external databases
They are called as resilient as they are immutable and and it can be integrated with other big data tools for
cannot to modified, although a new RDD can be created processing huge size datasets at once[36]. Spark
every time from existing RDDs or from external data introduced Dataset as an extension of DataFrames API in
sources. RDDs can perform two types of operations Spark-1.6 to provide an object-oriented programming
transformations and actions. Transformation operation interface. It is a type safe and immutable collection of
includes map, filter and join operations. They are known as objects which are mapped using the relational schemas.
lazy operations because they define a new RDD. Actions Datasets take benefit from fast in- memory encoding of
perform the computations and return a value to program. Tungsten’s and Catalyst optimizer of Spark to expose the
Five pieces of information is used with common interface data fields and expressions to a query planner. Hence
to represent each RDD which are partitions, preferred Spark applications can be written very efficiently having
locations, dependencies, an iterator and metadata regarding almost 2x processing speed and 75% less memory space
its partitioning schema. Language-integrated APIs are used as compared to RDDs [39].
by Spark to expose RDDs where object represents the
datasets and transformations are applied over these objects. C. Comparison among Spark-1.x, Spark-2.x and
One or more RDDs can be defined from existing RDDs Spark-3.x
using transformations and these RDDs can be used in At a time, when Hadoop MapReduce was one of the
actions. Actions are operations which are used to return most dominating platforms for processing of big data on
output to application or export data to storage system. For more than thousand cluster nodes, Apache Spark project
example, count return number of dataset elements, collect was started in 2009. MapReduce engine was inefficient for
to return elements and save returns the output to storage building large applications based for streaming data and
system. RDDs have various advantages as compared to machine learning algorithms where 10 to 20 passes are
DSM (Distributed Shared Memory) in terms of required over the data and each pass is needed to be
performance and faster execution [38]. written as a separate MapReduce job which consumes a
high processing time and operations. To address these
In-memory processing of data is one of the key issues, Spark team firstly builds API for functional
features of Spark which makes it different from disk-based programming suitable for batch processing. To handle the
processing of MapReduce model of Hadoop. It makes it interactive queries, Shark engine was launched in 2011
most efficient for iterative data processing tasks. Spark that can run SQL queries over Spark. Focus was then
allows storing the intermediate results in the main memory shifted to develop the Spark libraries by following the
311
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
“standard library” approach. After the “functional overcome the shortcomings of previous versions. An
operations” based initial releases, Spark was released with analysis of three versions of Spark is done on the
a set of four integrated libraries and large API support. parameters Daemons, Components, API Support, Spark
Apache Spark foundation officially released the Spark SQL, MLib, GraphX and Streaming Performance and
1.0.0 on May 30, 2014, Spark 2.0.0 on July 26, 2016 and Scalability, Data Sources, Cluster Managers, Support for
Spark 3.0.0 on July 18, 2020. Each successive versions of TCP-DS queries and Support for Windows and Unix and
Spark follow the performance improvements and is represented in Table II.
Table II: Comparison among Spark-1.x, Spark-2.x and Spark -3.x
Daemons [35] Master Daemon, Worker Daemon Master Daemon, Worker Master Daemon, Worker
Daemon Daemon
Components [36] Spark Core, APIs, Libraries, Storage Spark Core, APIs, Libraries, Spark Core, APIs, Libraries,
system, Cluster manager Storage, Cluster manager Storage, Cluster manager
API Support [35], [39] Scala, Python, Java, SQL and R. Scala, Python, Java, SQL and R. Scala, Python, Java, SQL and R.
Spark SQL [36], [40] Support for loading and Improved SQL functionalities Adaptive query execution,
manipulating structured data in with SQL 2003 support, Dynamic Partition Pruning,
Spark, RDDs DataFrames API Dataset API
MLib[41] Support for sparse feature vectors in DataFrame based primary API Performance improvement and
Scala, Java and Python support for Deep Learning
GraphX[42] Performance improvement in Performance improvement SparkGraph with cypher query
loading of graphs, reversal of edges language
and neighborhood computation
Streaming [43], [44] Performance optimizations for High level streaming API on top New Spark UI
stream transformation of Spark SQL and catalyst
optimizer
Performance and Good Comparatively better than Comparatively better than Spark
Scalability [35], [39] Spark-1.x 1.x and 2.x
Data Sources [35], [39] HDFS, Cassandra, HBase, Alluxio HDFS, Cassandra, HBase, HDFS, Cassandra, HBase,
(Techyon), MongoDB Alluxio (Techyon), MongoDB, Alluxio (Techyon), MongoDB,
Kafka, ElasticSearch Kafka, ElasticSearch
Cluster Managers [35], Spark Standalone, Hadoop YARN, Spark Standalone, Hadoop Spark Standalone, Hadoop
[39] Apache Mesos, Amazon EC2 YARN, Apache Mesos, Amazon YARN, Apache Mesos, Amazon
EC2, Kubernets EC2, Kubernets
Support for TCP- DS 55 out of 99 TCP- DS queries All 99 TCP- DS queries 2x to 17x performance
queries [39] (Spark 1.6) improvement over Spark 2.4 for
all 99 TCP- DS queries
Support for Windows Supported Supported Supported
and UNIX [35]
312
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.
compatibility with Microsoft Azure Data Lake file system Comput. Internet Things, ICGCIoT 2015, vol. 2, pp. 545–549, 2016.
[22] R. J. Chansler, “Data Availability and Durability with the Hadoop
makes the Hadoop-3.x a very effective data processing Distributed File System,” ;Login, vol. 37, no. 1, pp. 16–22, 2012.
tools in terms of processing performance and memory [23] Y. Liu and W. Wei, “A Replication-Based Mechanism for Fault
Tolerance in MapReduce Framework,” Math. Probl. Eng., vol. 2015,
utilization. Similarly, features like usage of Datasets in pp. 1–7, 2015.
place of RDDs, adaptive query execution, dynamic [24] V. K. Vavilapalli et al., “Apache Hadoop YARN,” in Proceedings of
the 4th annual Symposium on Cloud Computing, 2013, pp. 1–16.
partition pruning and improvement in Pandas API makes [25] A. P. Kulkami and M. Khandewal, “Survey on Hadoop and
Introduction to YARN,” Int. J. Emerg. Technol. Adv. Eng., vol. 4, no.
Spark 3.x a versatile, flexible, faster and memory efficient 5, pp. 82–87, 2014.
data processing and analytical tool, especially for [26] M. M. Shetty and D. H. Manjaiah, “Data security in Hadoop distributed
file system,” Proc. IEEE Int. Conf. Emerg. Technol. Trends Comput.
streaming and unstructured data. Hence, it can be Commun. Electr. Eng. ICETT 2016, pp. 939–944, 2017.
concluded that Hadoop MapReduce platform is very [27] A. C. Ko and W. T. Zaw, “Fault tolerant erasure coded replication for
HDFS based cloud storage,” Proc. - 4th IEEE Int. Conf. Big Data
effective for Batch-processing tasks whereas Apache Cloud Comput., pp. 104–109, 2014.
Spark is one of the most effective platform for the [28] A. Chiniah and A. Mungur, “Dynamic Erasure Coding Policy
Allocation (DECPA) in Hadoop 3.0,” Proc. - 6th IEEE Int. Conf. Cyber
processing and analysis of streaming data. Secur. Cloud Comput. CSCloud 2019 5th IEEE Int. Conf. Edge
Comput. Scalable Cloud, EdgeCom 2019, pp. 29–33, 2019.
REFERENCES [29] L. Kolb, A. Thor, and E. Rahm, “Load Balancing for MapReduce-
based Entity Resolution,” in 2012 IEEE 28th International Conference
on Data Engineering, 2012, Section III, pp. 618–629.
[1] D. P. Acharjya and K. Ahmed P, “A Survey on Big Data Analytics: [30] C. Y. Lin and Y. C. Lin, “An overall approach to achieve load
Challenges, Open Research Issues and Tools,” Int. J. Comput. Sci. balancing for Hadoop Distributed File System,” Int. J. Web Grid Serv.,
Eng., vol. 6, no. 6, pp. 1238–1244, 2018. vol. 13, no. 4, pp. 448–466, 2017.
[2] T. R. Rao, P. Mitra, R. Bhatt, and A. Goswami, “The big data system, [31] R. W. A. Fazul, P. V. Cardoso, and P. P. Barcelos, “Improving Data
components, tools, and technologies: a survey,” Knowl. Inf. Syst., vol. Availability in HDFS through Replica Balancing,” in 2019 9th Latin-
60, no. 3, pp. 1165–1245, Sep. 2019. American Symposium on Dependable Computing (LADC), 2019, pp. 1–
[3] Ishwarappa and J. Anuradha, “A brief introduction on big data 5Vs 6.
characteristics and hadoop technology,” Procedia Comput. Sci., vol. [32] “Cloud Computing Services | Microsoft Azure.” [Online]. Available:
48, pp. 319–324, 2015. https://azure.microsoft.com/en-in/. [Accessed: 13-Jul-2021].
[4] H. Singh and S. Bawa, “A MapReduce-based scalable discovery and [33] K. Aziz, D. Zaidouni, and M. Bellafkih, “Real-time data analysis using
indexing of structured big data,” Futur. Gener. Comput. Syst., vol. 73, Spark and Hadoop,” Proc. 2018 Int. Conf. Optim. Appl. ICOA 2018,
no. August 2017, pp. 32–43, 2017. pp. 1–6, 2018.
[5] M. Mittal, H. Singh, K. K. Paliwal, and L. M. Goyal, “Efficient random [34] J. G. Shanahan and L. Dai, “Large scale distributed data science using
data accessing in MapReduce,” in 2017 International Conference on apache spark,” in Proceedings of the ACM SIGKDD International
Infocom Technologies and Unmanned Systems (Trends and Future Conference on Knowledge Discovery and Data Mining, 2015, vol.
Directions) (ICTUS), 2017, vol. 2018-Janua, pp. 552–556. 2015-Augus, pp. 2323–2324.
[6] H. Singh and S. Bawa, “Scalability and Fault Tolerance of MapReduce [35] “Apache Spark TM - Unified Analytics Engine for Big Data.” [Online].
for Spatial data,” Glob. J. Eng. Sci. Res. Manag., vol. 3, no. 8, pp. 97– Available: https://spark.apache.org/. [Accessed: 05-Jan-2021].
103, 2016. [36] S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, “Big data
[7] H. Singh and S. Bawa, “Spatial data analysis with ArcGIS and analytics on Apache Spark,” International Journal of Data Science and
MapReduce,” in Proceeding - IEEE International Conference on Analytics, vol. 1, no. 3–4. Springer International Publishing, pp. 145–
Computing, Communication and Automation, ICCCA 2016, 2017, pp. 164, 2016.
45–49. [37] M. . Karau, H., Konwinski. A., Wendell, and P., Zaharia, Learning
[8] S. Shahrivari, “Beyond Batch Processing: Towards Real-Time and Spark: Lightning- Fast Data Analysis, 1st ed. O’Reilly Media, 2015.
Streaming Big Data,” Computers, vol. 3, no. 4, pp. 117–129, Oct. 2014. [38] M. Zaharia et al., “Resilient distributed datasets: A fault-tolerant
[9] “Apache Hadoop.” [Online]. Available: https://hadoop.apache.org/. abstraction for in-memory cluster computing,” Proc. NSDI 2012 9th
[Accessed: 23-Dec-2020]. USENIX Symp. Networked Syst. Des. Implement., pp. 15–28, 2012.
[10] I. Polato, R. Ré, A. Goldman, and F. Kon, “A comprehensive view of [39] “The Databricks Blog.” [Online]. Available:
Hadoop research—A systematic literature review,” J. Netw. Comput. https://databricks.com/blog. [Accessed: 13-Jul-2021].
Appl., vol. 46, pp. 1–25, Nov. 2014. [40] M. Armbrust et al., “Spark SQL: Relational data processing in spark,”
[11] S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin, “A in Proceedings of the ACM SIGMOD International Conference on
survey of open source tools for machine learning with big data in the Management of Data, 2015, vol. 2015-May, pp. 1383–1394.
Hadoop ecosystem,” J. Big Data, vol. 2, no. 1, pp. 1–36, 2015. [41] M. Assefi, E. Behravesh, G. Liu, and A. P. Tafti, “Big data machine
[12] D. Borthakur, “HDFS Architecture Guide,” The Apache Software learning using apache spark MLlib,” in Proceedings - 2017 IEEE
Foundation, 2008, pp. 1–14. International Conference on Big Data, Big Data 2017, 2017, vol. 2018-
[13] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing Janua, pp. 3492–3498.
on Large Clusters,” Commun. Acm, vol. 51, no. 1, pp. 107–113, 2008. [42] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, “GraphX,” in
[14] S. Sakr, A. Liu, and A. G. Fayoumi, “The family of mapreduce and First International Workshop on Graph Data Management Experiences
large-scale data processing systems,” ACM Comput. Surv., vol. 46, no. and Systems, 2013, pp. 1–6.
1, pp. 1–44, 2013. [43] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica,
[15] H. S. Bhosale and D. P. Gadekar, “A Review Paper on Big Data and “Discretized streams,” in Proceedings of the Twenty-Fourth ACM
Hadoop,” Int. J. Sci. Res. Publ., vol. 4, no. 1, pp. 2250–3153, 2014. Symposium on Operating Systems Principles, 2013, no. 1, pp. 423–438.
[16] C. S. Rani and B. Rama, “MapReduce with Hadoop for Simplified [44] M. Petrov, N. Butakov, D. Nasonov, and M. Melnik, “Adaptive
Analysis of Big Data,” Int. J. Adv. Res. Comput. Sci., vol. 8, no. 5, pp. performance model for dynamic scaling Apache Spark Streaming,”
2015–2018, 2017. Procedia Comput. Sci., vol. 136, pp. 109–117, 2018.
[17] K. Shim, “Databases in Networked Information Systems,” Proc. VLDB [45] P. D. Hung and D. Le Huynh, “E-Commerce Recommendation System
Endow., vol. 5, no. 12, pp. 2016–2017, 2010. Using Mahout,” in 2019 IEEE 4th International Conference on
[18] J. Xie, Y. Tian, S. Yin, J. Zhang, X. Ruan, and X. Qin, “Adaptive Computer and Communication Systems (ICCCS), 2019, pp. 86–90.
preshuffling in Hadoop Clusters,” Procedia Comput. Sci., vol. 18, pp. [46] V. K. Vavilapalli et al., “Apache Hadoop YARN,” in Proceedings of
2458–2467, 2013. the 4th annual Symposium on Cloud Computing, 2013, pp. 1–16.
[19] W. Yu, Y. Wang, X. Que, and C. Xu, “Virtual Shuffling for Efficient [47] Xinyi Liao, Zhiwei Gao, Weixing Ji, and Yizhuo Wang, “An
Data Movement in MapReduce,” IEEE Trans. Comput., vol. 64, no. 2, enforcement of real time scheduling in Spark Streaming,” in 2015 Sixth
pp. 556–568, 2015. International Green and Sustainable Computing Conference (IGSC),
[20] Q. Zhang, M. F. Zhani, Y. Yang, R. Boutaba, and B. Wong, “PRISM: 2015, pp. 1–6.
Fine-grained resource-aware scheduling for MapReduce,” IEEE Trans. [48] X. Meng et al., “MLlib: Machine learning in Apache Spark,” J. Mach.
Cloud Comput., vol. 3, no. 2, pp. 182–194, 2015. Learn. Res., vol. 17, pp. 1–7, 2016.
[21] J. F. Weets, M. K. Kakhani, and A. Kumar, “Limitations and
challenges of HDFS and MapReduce,” Proc. 2015 Int. Conf. Green
313
Authorized licensed use limited to: ULAKBIM UASL - Atilim Universitesi. Downloaded on November 21,2022 at 19:23:12 UTC from IEEE Xplore. Restrictions apply.