0% found this document useful (0 votes)
2 views27 pages

CISD 42 Introduction to Spark_Spark Transformation_Spark Actions

Apache Spark is an open-source distributed computing system designed for big data processing, offering high-speed, in-memory capabilities for various data tasks including machine learning and real-time processing. It consists of several modules such as Spark SQL, Spark Streaming, and MLlib, and operates on a master-slave architecture with features like fault tolerance and lazy evaluation. Spark provides two main data structures, RDDs and DataFrames, each with distinct use cases, advantages, and limitations, with DataFrames generally preferred for structured data due to their optimization capabilities.

Uploaded by

howardlee99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views27 pages

CISD 42 Introduction to Spark_Spark Transformation_Spark Actions

Apache Spark is an open-source distributed computing system designed for big data processing, offering high-speed, in-memory capabilities for various data tasks including machine learning and real-time processing. It consists of several modules such as Spark SQL, Spark Streaming, and MLlib, and operates on a master-slave architecture with features like fault tolerance and lazy evaluation. Spark provides two main data structures, RDDs and DataFrames, each with distinct use cases, advantages, and limitations, with DataFrames generally preferred for structured data due to their optimization capabilities.

Uploaded by

howardlee99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

INTRODUCTION TO SPARK

SPARK TRANSFORMATION
SPARK ACTIONS
WHAT IS APACHE SPARK?

• Open-source distributed computing system for big data processing


• Unified analytics engine for processing large scale datasets
• Offers high-speed, in memory processing capabilities
• Handles complex data processing tasks, data transformation, machine learning, graph processing, and
real-time stream processing.
• Apache Spark is an open-source, distributed processing system that's used for large-scale data
analytics, machine learning, and AI applications
• Apache Spark is known for its speed, flexibility, in-memory computing, real-time processing, and
better analytics.
SPARK MODULES

• Spark Core
• Spark SQL
• Spark Streaming
• Spark MLlib
• Spark GraphX
SPARK MODULES FOR APACHE SPARK
(CON’T)
• These modules enhance the capabilities of Apache Spark, an open-source technology for big
data processing:
• Spark SQL: An API that converts SQL queries and actions into Spark tasks. It provides a
programming abstraction called DataFrames and can run Hadoop Hive queries faster.
• Spark Streaming: A solution for processing live data streams.
• MLlib and ML: Machine learning modules for designing pipelines used for feature engineering
and algorithm training.
• GraphX: A graphing solution that converts RDDs to resilient distributed property graphs
(RDPGs).
SPARK ECO SYSTEM
APACHE SPARK ARCHITECTURE

• Spark works in a master-slave


architecture where the master is called
the “Driver” and slaves are called
“Workers”.
• When you run a Spark application,
Spark Driver creates a context that is
an entry point to your application, and
all operations (transformations and
actions) are executed on worker nodes,
and the resources are managed by
Cluster Manager.
FEATURES OF APACHE SPARK

• In-memory computation
• Distributed processing using parallelize
• Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
• Fault-tolerant
• Immutable
• Lazy evaluation
• Cache & persistence
• Inbuild-optimization when using DataFrames
• Supports ANSI SQL

Lazy evaluation in spark is a feature that defers the execution of operations until they are absolutely
necessary, optimizing performance by reducing the amount of data processing.
DIFFERENT IMPLEMENTATIONS OF SPARK.

• Spark – Default interface for Scala and Java


• PySpark – Python interface for Spark
• SparklyR – R interface for Spark.
PYSPARK

• PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-
scale data processing in a distributed environment using Python. It also provides a
PySpark shell for interactively analyzing your data.
• PySpark combines Python’s learnability and ease of use with the power of Apache
Spark to enable processing and analysis of data at any size for everyone familiar with
Python.
• PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured
Streaming, Machine Learning (MLlib) and Spark Core.
SPARK CORE AND RDDS

• Spark Core is the underlying general execution engine for the Spark platform that all other
functionality is built on top of.
• It provides RDDs (Resilient Distributed Datasets) and in-memory computing capabilities.
• Note that the RDD API is a low-level API which can be difficult to use and you do not get the
benefit of Spark’s automatic query optimization capabilities.
• We recommend using DataFrames instead of RDDs as it allows you to express what you
want more easily and lets Spark automatically construct the most efficient query for you.

RESILIENT DISTRIBUTED DATASETS

• Commonly known as RDDS


• The primary data structure in Spark
• Distributed, fault-tolerant, and parallelizable data structure.
• Efficiently processes large datasets across a cluster
• RDDS that are stored across nodes can be accessed in parallel
RDD CHARACTERISTICS:

• Immutable: Immutable means they can’t be modified once created. Instead, transformations on
RDDS create new RDDS allowing us to apply a series of transformations to the data.
• Distributed: data portioned and processed in parallel. RDDS are distributed across multiple
machines in a cluster. Spark automatically partitions the data and distributes it across the nodes,
enabling parallel processing.
• Resilient means they are able to recover quickly from any issues as the same data chunks are
replicated across multiple executor nodes. RDDs are designed to be resilient to failures. Spark keeps
track of the lineage of transformations applied to an RDD allowing it to recover lost data and
continue computations in case of failures
• Lazily evaluated: Execution plan optimized; transformations evaluated when necessary. Meaning,
they are not executed immediately. Spark optimizes the execution plan and evaluates transformations
only when necessary.
• Fault-tolerant operations: map, filter, reduce, collect, count, save, etc.
APACHE SPARK SUPPORT TWO TYPES OF OPERATIONS

Transformations Actions
• Create new RDDS from existing RDDs by • Are operations that return results or perform
applying computation/manipulation on the actions on RDD, triggering execution of all
data. preceding transformations.
• Lazy evaluation, lineage graph. They don’t • Eagerly evaluated meaning they compute the
execute immediately, instead they build a result immediately and may involve data
lineage graph to track the applied movement and computation across the cluster.
transformations. • Examples: collect, count, first, take, save,
• Examples: map, filter, flatMap, reduceByKey, foreach.
sortBy, join. • These actions provide us with final results or
• These transformations allow us to perform allow us to write the data to an external
computations and manipulations on the data storage system.
within RDDs.
SPARK DATAFRAMES

• Spark DataFrames: raw data organized into rows and columns, allowing filtering, aggregating,
etc.
• The data contained in DataFrames are physically located across the multiple nodes of the spark
cluster. But they appear to be a cohesive unit of data without exposing the complexity of the
underlying operations.
• In simple terms, a DataFrame is like a table in a relational database, where data is organized
into rows and columns.
• DataFrames offer schema information, allowing Spark to optimize the execution of queries and
apply various optimizations.
DATAFRAME ADVANTAGES OVER RDD

• Optimized Execution. DataFrames provide Spark with schema information, enabling it to


optimize the execution of queries and perform predicate pushdowns, leading to faster and more
efficient processing.
• Ease of Use. DataFrames offer a high-level, SQL-like interface, making it easier for developers
to interact with data compared to the more complex RDD transformations and actions."
• Integration with Ecosystem. DataFrames seamlessly integrate with Spark's ecosystem,
including Spark SQL, MLlib, and GraphX, enabling users to leverage various libraries and
functionalities.
• Built-in Optimization. DataFrames leverage Spark's Catalyst optimizer, which performs
advanced optimizations like predicate pushdown, constant folding, and loop unrolling."
• Interoperability. DataFrames can be easily converted to and from other data formats, such as
Pandas DataFrames in Python, enabling seamless integration with other data processing tools.
WHEN TO USE RDDS

• Use RDDs (Resilient Distributed Datasets) when you need low-level control over data
manipulation, are working with unstructured data, or require custom transformations,
while DataFrames are preferred for structured data, high-level operations, and when you
want to leverage SQL-like syntax and built-in optimizations for better performance; in
most cases, DataFrames are the recommended choice due to their ease of use and
optimization capabilities.
WHEN TO USE RDDS:

• Unstructured data: Processing text streams, raw logs, or data without a clear schema.
• Custom transformations: When you need fine-grained control over data manipulation
with complex logic.
• Low-level control: Situations where you want to directly manage data partitioning and
operations.
WHEN TO USE DATAFRAMES:

• Structured data:
• Working with data that fits neatly into a relational table with defined columns and data
types.

• SQL-like queries:
• When you want to use familiar SQL syntax to filter, aggregate, and join data.

• Performance optimization:
• Taking advantage of Spark's built-in optimizations for large-scale data processing.
KEY DIFFERENCES:

• Data Structure:
• RDDs are more flexible, handling any kind of data, while DataFrames require a defined
schema with columns and data types.
• Operations:
• RDDs use functional programming constructs for transformations, whereas DataFrames offer
a more intuitive, SQL-like interface.
• Optimization:
• DataFrames benefit from Spark's Catalyst optimizer, enabling significant performance gains
for complex queries, while RDDs have less optimization potential.
BENEFITS OF RDD

• Low-level Transformation Control: RDDs provide fine-grained control over data


transformations, allowing for complex custom processing.
• Fault-tolerance: RDDs are inherently fault-tolerant, automatically recovering from node
failures through lineage information.
• Immutability: RDDs are immutable, which ensures consistency and simplifies
debugging.
• Flexibility: Suitable for handling unstructured and semi-structured data, offering
versatility in processing diverse data types.
LIMITATIONS OF RDD

• Lack of Optimization: RDDs lack built-in optimization mechanisms, requiring


developers to optimize code for performance manually.
• Complex API: The low-level API can be more complicated and cumbersome, making
development and debugging more challenging.
• No Schema Enforcement: RDDs do not enforce schemas, which can lead to potential
data consistency issues and complicate the processing of structured data.
USE CASES OF RDD
• Iterative Machine Learning Algorithms: RDDs are particularly useful for algorithms that
require multiple passes over the data, such as K-means clustering or logistic regression.
Their immutability and fault tolerance are beneficial in these scenarios.
• Data Pipeline Construction: RDDs can be used to build complex data pipelines where data
undergoes several stages of transformations and actions, such as ETL (Extract, Transform,
Load) processes.
• Interactive Data Analysis: RDDs are suitable for interactive analysis of large datasets due
to their in-memory processing capabilities. They allow users to prototype and test their data
analysis workflows quickly.
USE CASES OF RDD

• Unstructured and Semi-structured Data Processing: RDDs are flexible enough to handle
various data formats, including JSON, XML, and CSV, making them suitable for processing
unstructured and semi-structured data.
• Streaming Data Processing: With Spark Streaming, you can use RDDs to process real-time
data streams, enabling use cases such as real-time analytics and monitoring.
• Graph Processing: RDDs can be used with GraphX, Spark’s API for graph processing, to
perform operations on large-scale graph data, like social network analysis.
SPARKCONTEXT VS SPARKSESSION

Spark Context SparkSession

• Represents the connection to a Spark • Introduced in Spark 2.0


Cluster • Unified entry point for interacting with
• Coordinates tasks execution across the Spark.
cluster • Combines functionalities of SparkContext,
• Entry point in earlier versions of Spark SQLContext, HiveContext, and Streaming
(1.x) Context
• Supports multiple programming languages
(Scala, Java, Python, R)
• Seamlessly integrates various Spark
features.
FUNCTIONALITY DIFFERERNCES BETWEEN SPARKCONTEXT AND
SPARKSESSION

SparkContext SparkSession
• Core functionality for low-level programming • Extends SparkContext functionality
an cluster interaction • Higher-level abstractions like DataFrames and
• Creates RDDs (Resilient Distributed DataSets) DataSets
• Performs transformations and defines actions • Supports structured querying using SQL or
DataFrames API
• Provides data source APIs, machine learning
algorithms, and streaming capabilities.
• Offers higher-level API for structured data
processing using DataFrames and Datasets.
SPARKCONTEXT VS SPARKSESSION (CON’T)

• SparkContext and SparkSession have different purposes and functionality


• SparkContext is low-level, while SparkSession is higher-level
• SparkSession simplifies interaction and supports structured data processing
• SparkContext is still supported for backward compatibility,
• SparkSession is the recommended entry point for Spark applications.
REFRENCES

• https://spark.apache.org/docs/latest/api/python/index.html
• https://sparkbyexamples.com/
• https://medium.com/@akhilasaineni7/exploring-sparkcontext-and-sparksession-8369e60f658e
• https://youtu.be/aq68N-FH1yY?si=f_SStYHe0-rDUF7N

• https://www.analyticsvidhya.com/blog/2020/11/what-is-the-difference-between-rdds-
dataframes-and-datasets/

You might also like