CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
SPARK TRANSFORMATION
SPARK ACTIONS
WHAT IS APACHE SPARK?
• Spark Core
• Spark SQL
• Spark Streaming
• Spark MLlib
• Spark GraphX
SPARK MODULES FOR APACHE SPARK
(CON’T)
• These modules enhance the capabilities of Apache Spark, an open-source technology for big
data processing:
• Spark SQL: An API that converts SQL queries and actions into Spark tasks. It provides a
programming abstraction called DataFrames and can run Hadoop Hive queries faster.
• Spark Streaming: A solution for processing live data streams.
• MLlib and ML: Machine learning modules for designing pipelines used for feature engineering
and algorithm training.
• GraphX: A graphing solution that converts RDDs to resilient distributed property graphs
(RDPGs).
SPARK ECO SYSTEM
APACHE SPARK ARCHITECTURE
• In-memory computation
• Distributed processing using parallelize
• Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
• Fault-tolerant
• Immutable
• Lazy evaluation
• Cache & persistence
• Inbuild-optimization when using DataFrames
• Supports ANSI SQL
Lazy evaluation in spark is a feature that defers the execution of operations until they are absolutely
necessary, optimizing performance by reducing the amount of data processing.
DIFFERENT IMPLEMENTATIONS OF SPARK.
• PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-
scale data processing in a distributed environment using Python. It also provides a
PySpark shell for interactively analyzing your data.
• PySpark combines Python’s learnability and ease of use with the power of Apache
Spark to enable processing and analysis of data at any size for everyone familiar with
Python.
• PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured
Streaming, Machine Learning (MLlib) and Spark Core.
SPARK CORE AND RDDS
• Spark Core is the underlying general execution engine for the Spark platform that all other
functionality is built on top of.
• It provides RDDs (Resilient Distributed Datasets) and in-memory computing capabilities.
• Note that the RDD API is a low-level API which can be difficult to use and you do not get the
benefit of Spark’s automatic query optimization capabilities.
• We recommend using DataFrames instead of RDDs as it allows you to express what you
want more easily and lets Spark automatically construct the most efficient query for you.
•
RESILIENT DISTRIBUTED DATASETS
• Immutable: Immutable means they can’t be modified once created. Instead, transformations on
RDDS create new RDDS allowing us to apply a series of transformations to the data.
• Distributed: data portioned and processed in parallel. RDDS are distributed across multiple
machines in a cluster. Spark automatically partitions the data and distributes it across the nodes,
enabling parallel processing.
• Resilient means they are able to recover quickly from any issues as the same data chunks are
replicated across multiple executor nodes. RDDs are designed to be resilient to failures. Spark keeps
track of the lineage of transformations applied to an RDD allowing it to recover lost data and
continue computations in case of failures
• Lazily evaluated: Execution plan optimized; transformations evaluated when necessary. Meaning,
they are not executed immediately. Spark optimizes the execution plan and evaluates transformations
only when necessary.
• Fault-tolerant operations: map, filter, reduce, collect, count, save, etc.
APACHE SPARK SUPPORT TWO TYPES OF OPERATIONS
Transformations Actions
• Create new RDDS from existing RDDs by • Are operations that return results or perform
applying computation/manipulation on the actions on RDD, triggering execution of all
data. preceding transformations.
• Lazy evaluation, lineage graph. They don’t • Eagerly evaluated meaning they compute the
execute immediately, instead they build a result immediately and may involve data
lineage graph to track the applied movement and computation across the cluster.
transformations. • Examples: collect, count, first, take, save,
• Examples: map, filter, flatMap, reduceByKey, foreach.
sortBy, join. • These actions provide us with final results or
• These transformations allow us to perform allow us to write the data to an external
computations and manipulations on the data storage system.
within RDDs.
SPARK DATAFRAMES
• Spark DataFrames: raw data organized into rows and columns, allowing filtering, aggregating,
etc.
• The data contained in DataFrames are physically located across the multiple nodes of the spark
cluster. But they appear to be a cohesive unit of data without exposing the complexity of the
underlying operations.
• In simple terms, a DataFrame is like a table in a relational database, where data is organized
into rows and columns.
• DataFrames offer schema information, allowing Spark to optimize the execution of queries and
apply various optimizations.
DATAFRAME ADVANTAGES OVER RDD
• Use RDDs (Resilient Distributed Datasets) when you need low-level control over data
manipulation, are working with unstructured data, or require custom transformations,
while DataFrames are preferred for structured data, high-level operations, and when you
want to leverage SQL-like syntax and built-in optimizations for better performance; in
most cases, DataFrames are the recommended choice due to their ease of use and
optimization capabilities.
WHEN TO USE RDDS:
• Unstructured data: Processing text streams, raw logs, or data without a clear schema.
• Custom transformations: When you need fine-grained control over data manipulation
with complex logic.
• Low-level control: Situations where you want to directly manage data partitioning and
operations.
WHEN TO USE DATAFRAMES:
• Structured data:
• Working with data that fits neatly into a relational table with defined columns and data
types.
• SQL-like queries:
• When you want to use familiar SQL syntax to filter, aggregate, and join data.
• Performance optimization:
• Taking advantage of Spark's built-in optimizations for large-scale data processing.
KEY DIFFERENCES:
• Data Structure:
• RDDs are more flexible, handling any kind of data, while DataFrames require a defined
schema with columns and data types.
• Operations:
• RDDs use functional programming constructs for transformations, whereas DataFrames offer
a more intuitive, SQL-like interface.
• Optimization:
• DataFrames benefit from Spark's Catalyst optimizer, enabling significant performance gains
for complex queries, while RDDs have less optimization potential.
BENEFITS OF RDD
• Unstructured and Semi-structured Data Processing: RDDs are flexible enough to handle
various data formats, including JSON, XML, and CSV, making them suitable for processing
unstructured and semi-structured data.
• Streaming Data Processing: With Spark Streaming, you can use RDDs to process real-time
data streams, enabling use cases such as real-time analytics and monitoring.
• Graph Processing: RDDs can be used with GraphX, Spark’s API for graph processing, to
perform operations on large-scale graph data, like social network analysis.
SPARKCONTEXT VS SPARKSESSION
SparkContext SparkSession
• Core functionality for low-level programming • Extends SparkContext functionality
an cluster interaction • Higher-level abstractions like DataFrames and
• Creates RDDs (Resilient Distributed DataSets) DataSets
• Performs transformations and defines actions • Supports structured querying using SQL or
DataFrames API
• Provides data source APIs, machine learning
algorithms, and streaming capabilities.
• Offers higher-level API for structured data
processing using DataFrames and Datasets.
SPARKCONTEXT VS SPARKSESSION (CON’T)
• https://spark.apache.org/docs/latest/api/python/index.html
• https://sparkbyexamples.com/
• https://medium.com/@akhilasaineni7/exploring-sparkcontext-and-sparksession-8369e60f658e
• https://youtu.be/aq68N-FH1yY?si=f_SStYHe0-rDUF7N
• https://www.analyticsvidhya.com/blog/2020/11/what-is-the-difference-between-rdds-
dataframes-and-datasets/