0% found this document useful (0 votes)
7 views

Apache spark vs MapReduce(1)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Apache spark vs MapReduce(1)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Apache spark

-Apache Spark is an open-source, distributed, general-purpose cluster computing engine designed for large-
scale data processing.This framework can run in a standalone mode or on a cloud or cluster manager such as
Apache Mesos, and other platforms. It is designed for fast performance and uses RAM for caching and
processing data.

-The Spark engine was created to improve the efficiency of MapReduce and keep its benefits. Even though Spark
does not have its file system, it can access data on many different storage solutions. The data structure that Spark uses
is called Resilient Distributed Dataset (RDD).

-Main Features of Apache Spark

1-In-Memory Processing: One of Spark’s standout features is its ability to perform operations entirely in
memory, reducing the time spent reading and writing from disk.
2-Unified Data Processing Engine: Spark provides a unified framework to handle a variety of data
processing needs:
 Batch Processing using Spark Core
 Stream Processing using Spark Streaming
 Machine Learning using MLlib
 Graph Processing using GraphX
 SQL Queries using Spark SQL
3-Ease of Use: With high-level APIs in languages such as Python, Scala, Java, and R, Spark is known for
its simplicity and flexibility .
4-Fault Tolerance: Similar to Hadoop, Spark is fault-tolerant. It automatically recovers data and
computations from node failures using RDD lineage. This ensures that no data is lost, and all tasks are
properly executed in the event of hardware or network issues.

Key Differences Between Hadoop and Spark


1. Processing Model:
 MapReduce processes data in batches, writing the results of intermediate steps to disk. This causes
a significant I/O overhead, especially for iterative tasks (like machine learning algorithms).
 Apache Spark, on the other hand, uses in-memory processing wherever possible. This allows
Spark to store data in RAM, reducing the need for time-consuming disk read/write operations. As a
result, Spark can perform jobs much faster than MapReduce, especially for iterative algorithms.
2. Speed:
- MapReduce suffers from high latency because of the continuous read/write operations to disk between the
Map and Reduce stages. This approach is less efficient for iterative tasks, where the same data must be
processed multiple times.
- Spark significantly improves speed by keeping data in-memory, enabling faster computation, especially
for iterative tasks. In some cases, Spark can be up to 100 times faster than MapReduce.
3. Fault Tolerance:
 Both systems ensure fault tolerance. However, MapReduce relies on re-reading the data from disk if
any task fails. This can slow down the recovery process.
 Spark uses RDD lineage for fault tolerance. It can recompute only the lost partitions of data without
needing to reload the entire dataset from disk, which ensures quicker recovery in case of node
failures.
4. Cost :
-Hadoop : An open-source platform, less expensive to run. Uses affordable consumer hardware. Easier to
find trained Hadoop professionals.
-Spark : An open-source platform, but relies on memory for computation, which considerably increases
running costs.
5. Scalability :
-Hadoop : Easily scalable by adding nodes and disks for storage. Supports tens of thousands of nodes
without a known limit.
- Spark : A bit more challenging to scale because it relies on RAM for computations. Supports thousands
of nodes in a cluster.
6.Security :
Comparing Hadoop vs. Spark security, Hadoop is the clear winner. Above all, Spark’s security is off by
default. This means your setup is exposed if you do not tackle this issue.You can improve the security of
Spark by introducing authentication via shared secret or event logging. However, that is not enough for
production workloads.
In contrast, Hadoop works with multiple authentication and access control methods. The most difficult to
implement is Kerberos authentication. If Kerberos is too much to handle, Hadoop also
supports Ranger, LDAP, ACLs, inter-node encryption, standard file permissions on HDFS, and Service
Level Authorization.
7.Ease of Use and Language Support:
Spark may be the newer framework with not as many available experts as Hadoop, but is known to be more
user-friendly. In contrast, Spark provides support for multiple languages next to the native language (Scala):
Java, Python, R, and Spark SQL. This allows developers to use the programming language they prefer.
In addition to the support for APIs in multiple languages, Spark wins in the ease-of-use section with its
interactive mode. You can use the Spark shell to analyze data interactively with Scala or Python. The shell
provides instant feedback to queries, which makes Spark easier to use than Hadoop MapReduce.
Another thing that gives Spark the upper hand is that programmers can reuse existing code where applicable.
By doing so, developers can reduce application-development time. Historical and stream data can be
combined to make this process even more effective.

8. Machine Learning :
-Hadoop :Slower than Spark. Data fragments can be too large and create bottlenecks. Mahout is the main
library.
-Spark : Much faster with in-memory processing. Uses MLlib for computations.
9.Scheduling and Resource Management :
-Hadoop : Uses external solutions. YARN is the most common option for resource management.
-Spark : Has built-in tools for resource allocation, scheduling, and monitoring.

You might also like