0% found this document useful (0 votes)
16 views

Sspark

Uploaded by

ahbajwa102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Sspark

Uploaded by

ahbajwa102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Industries are using Hadoop to analyze their data sets.

The reason is that the Hadoop


framework is based on a simple programming model (MapReduce) and enables a computing
solution that is scalable, flexible, fault-tolerant, and cost-effective. Here, the main deal is
maintaining speed in processing large datasets regarding waiting time between queries and
waiting time to run the program.

Apache Software Foundation introduced Spark to speed up the Hadoop computational


computing software process.

Spark is not a modified version of Hadoop and is not, dependent on Hadoop because it has
its cluster management. Hadoop is just one of the ways to implement Spark.

Spark uses Hadoop in two ways – one is storage and the second is processing. Since Spark has
its cluster management computation, it uses Hadoop for storage purposes only.

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce, and it extends the MapReduce model to efficiently use it for
more types of computations, which include interactive queries and stream processing. The
main feature of Spark is its in-memory cluster computing which increases the processing
speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications, interactive
queries, and streaming.

Features of Apache Spark

Apache Spark has the following features.


• Speed − Spark helps to run an application in a Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing the
number of read/write operations to disk. It stores the intermediate processing data in
memory.

• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80
high-level operators for interactive querying.

• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

• Flexibility: Spark supports more than one language and permits many developers to
write applications in Python, R, Scala, or Java.

• In-memory computing: Apache Spark can store the data inside the server's RAM which
permits quick access.

• Real-time processing: Apache Spark can process streaming data in real time. Unlike
MapReduce which only processes stored data, Apache Spark can process data in real-
time. Therefore, it can also produce instant results.

• Components of Spark
The following illustration depicts the different components of Spark.
Spark Core:
• The foundation of Spark, providing distributed task scheduling, memory management,
fault recovery, and storage.
Spark SQL:
• A module for working with structured and semi-structured data.
• Allows querying data using SQL-like syntax and supports integration with Hive.
Spark Streaming:
• Enables processing of real-time data streams.
• Ideal for tasks like log processing, event detection, and real-time analytics.
MLlib:
• A library for machine learning algorithms, such as classification, regression,
clustering, and recommendation.
GraphX:
• A library for graph-based computations and analytics.

You might also like