0% found this document useful (0 votes)

2 views27 pages

CISD 42 Introduction to Spark_Spark Transformation_Spark Actions

Apache Spark is an open-source distributed computing system designed for big data processing, offering high-speed, in-memory capabilities for various data tasks including machine learning and real-time processing. It consists of several modules such as Spark SQL, Spark Streaming, and MLlib, and operates on a master-slave architecture with features like fault tolerance and lazy evaluation. Spark provides two main data structures, RDDs and DataFrames, each with distinct use cases, advantages, and limitations, with DataFrames generally preferred for structured data due to their optimization capabilities.

Uploaded by

howardlee99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views27 pages

CISD 42 Introduction to Spark_Spark Transformation_Spark Actions

Uploaded by

howardlee99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

INTRODUCTION TO SPARK

SPARK TRANSFORMATION
SPARK ACTIONS
WHAT IS APACHE SPARK?

• Open-source distributed computing system for big data processing

• Unified analytics engine for processing large scale datasets
• Offers high-speed, in memory processing capabilities
• Handles complex data processing tasks, data transformation, machine learning, graph processing, and
real-time stream processing.
• Apache Spark is an open-source, distributed processing system that's used for large-scale data
analytics, machine learning, and AI applications
• Apache Spark is known for its speed, flexibility, in-memory computing, real-time processing, and
better analytics.
SPARK MODULES

• Spark Core
• Spark SQL
• Spark Streaming
• Spark MLlib
• Spark GraphX
SPARK MODULES FOR APACHE SPARK
(CON’T)
• These modules enhance the capabilities of Apache Spark, an open-source technology for big
data processing:
• Spark SQL: An API that converts SQL queries and actions into Spark tasks. It provides a
programming abstraction called DataFrames and can run Hadoop Hive queries faster.
• Spark Streaming: A solution for processing live data streams.
• MLlib and ML: Machine learning modules for designing pipelines used for feature engineering
and algorithm training.
• GraphX: A graphing solution that converts RDDs to resilient distributed property graphs
(RDPGs).
SPARK ECO SYSTEM
APACHE SPARK ARCHITECTURE

• Spark works in a master-slave

architecture where the master is called
the “Driver” and slaves are called
“Workers”.
• When you run a Spark application,
Spark Driver creates a context that is
an entry point to your application, and
all operations (transformations and
actions) are executed on worker nodes,
and the resources are managed by
Cluster Manager.
FEATURES OF APACHE SPARK

• In-memory computation
• Distributed processing using parallelize
• Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
• Fault-tolerant
• Immutable
• Lazy evaluation
• Cache & persistence
• Inbuild-optimization when using DataFrames
• Supports ANSI SQL

Lazy evaluation in spark is a feature that defers the execution of operations until they are absolutely
necessary, optimizing performance by reducing the amount of data processing.
DIFFERENT IMPLEMENTATIONS OF SPARK.

• Spark – Default interface for Scala and Java

• PySpark – Python interface for Spark
• SparklyR – R interface for Spark.
PYSPARK

• PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-
scale data processing in a distributed environment using Python. It also provides a
PySpark shell for interactively analyzing your data.
• PySpark combines Python’s learnability and ease of use with the power of Apache
Spark to enable processing and analysis of data at any size for everyone familiar with
Python.
• PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured
Streaming, Machine Learning (MLlib) and Spark Core.
SPARK CORE AND RDDS

• Spark Core is the underlying general execution engine for the Spark platform that all other
functionality is built on top of.
• It provides RDDs (Resilient Distributed Datasets) and in-memory computing capabilities.
• Note that the RDD API is a low-level API which can be difficult to use and you do not get the
benefit of Spark’s automatic query optimization capabilities.
• We recommend using DataFrames instead of RDDs as it allows you to express what you
want more easily and lets Spark automatically construct the most efficient query for you.
•
RESILIENT DISTRIBUTED DATASETS

• Commonly known as RDDS

• The primary data structure in Spark
• Distributed, fault-tolerant, and parallelizable data structure.
• Efficiently processes large datasets across a cluster
• RDDS that are stored across nodes can be accessed in parallel
RDD CHARACTERISTICS:

• Immutable: Immutable means they can’t be modified once created. Instead, transformations on
RDDS create new RDDS allowing us to apply a series of transformations to the data.
• Distributed: data portioned and processed in parallel. RDDS are distributed across multiple
machines in a cluster. Spark automatically partitions the data and distributes it across the nodes,
enabling parallel processing.
• Resilient means they are able to recover quickly from any issues as the same data chunks are
replicated across multiple executor nodes. RDDs are designed to be resilient to failures. Spark keeps
track of the lineage of transformations applied to an RDD allowing it to recover lost data and
continue computations in case of failures
• Lazily evaluated: Execution plan optimized; transformations evaluated when necessary. Meaning,
they are not executed immediately. Spark optimizes the execution plan and evaluates transformations
only when necessary.
• Fault-tolerant operations: map, filter, reduce, collect, count, save, etc.
APACHE SPARK SUPPORT TWO TYPES OF OPERATIONS

Transformations Actions
• Create new RDDS from existing RDDs by • Are operations that return results or perform
applying computation/manipulation on the actions on RDD, triggering execution of all
data. preceding transformations.
• Lazy evaluation, lineage graph. They don’t • Eagerly evaluated meaning they compute the
execute immediately, instead they build a result immediately and may involve data
lineage graph to track the applied movement and computation across the cluster.
transformations. • Examples: collect, count, first, take, save,
• Examples: map, filter, flatMap, reduceByKey, foreach.
sortBy, join. • These actions provide us with final results or
• These transformations allow us to perform allow us to write the data to an external
computations and manipulations on the data storage system.
within RDDs.
SPARK DATAFRAMES

• Spark DataFrames: raw data organized into rows and columns, allowing filtering, aggregating,
etc.
• The data contained in DataFrames are physically located across the multiple nodes of the spark
cluster. But they appear to be a cohesive unit of data without exposing the complexity of the
underlying operations.
• In simple terms, a DataFrame is like a table in a relational database, where data is organized
into rows and columns.
• DataFrames offer schema information, allowing Spark to optimize the execution of queries and
apply various optimizations.
DATAFRAME ADVANTAGES OVER RDD

• Optimized Execution. DataFrames provide Spark with schema information, enabling it to

optimize the execution of queries and perform predicate pushdowns, leading to faster and more
efficient processing.
• Ease of Use. DataFrames offer a high-level, SQL-like interface, making it easier for developers
to interact with data compared to the more complex RDD transformations and actions."
• Integration with Ecosystem. DataFrames seamlessly integrate with Spark's ecosystem,
including Spark SQL, MLlib, and GraphX, enabling users to leverage various libraries and
functionalities.
• Built-in Optimization. DataFrames leverage Spark's Catalyst optimizer, which performs
advanced optimizations like predicate pushdown, constant folding, and loop unrolling."
• Interoperability. DataFrames can be easily converted to and from other data formats, such as
Pandas DataFrames in Python, enabling seamless integration with other data processing tools.
WHEN TO USE RDDS

• Use RDDs (Resilient Distributed Datasets) when you need low-level control over data
manipulation, are working with unstructured data, or require custom transformations,
while DataFrames are preferred for structured data, high-level operations, and when you
want to leverage SQL-like syntax and built-in optimizations for better performance; in
most cases, DataFrames are the recommended choice due to their ease of use and
optimization capabilities.
WHEN TO USE RDDS:

• Unstructured data: Processing text streams, raw logs, or data without a clear schema.
• Custom transformations: When you need fine-grained control over data manipulation
with complex logic.
• Low-level control: Situations where you want to directly manage data partitioning and
operations.
WHEN TO USE DATAFRAMES:

• Structured data:
• Working with data that fits neatly into a relational table with defined columns and data
types.

• SQL-like queries:
• When you want to use familiar SQL syntax to filter, aggregate, and join data.

• Performance optimization:
• Taking advantage of Spark's built-in optimizations for large-scale data processing.
KEY DIFFERENCES:

• Data Structure:
• RDDs are more flexible, handling any kind of data, while DataFrames require a defined
schema with columns and data types.
• Operations:
• RDDs use functional programming constructs for transformations, whereas DataFrames offer
a more intuitive, SQL-like interface.
• Optimization:
• DataFrames benefit from Spark's Catalyst optimizer, enabling significant performance gains
for complex queries, while RDDs have less optimization potential.
BENEFITS OF RDD

• Low-level Transformation Control: RDDs provide fine-grained control over data

transformations, allowing for complex custom processing.
• Fault-tolerance: RDDs are inherently fault-tolerant, automatically recovering from node
failures through lineage information.
• Immutability: RDDs are immutable, which ensures consistency and simplifies
debugging.
• Flexibility: Suitable for handling unstructured and semi-structured data, offering
versatility in processing diverse data types.
LIMITATIONS OF RDD

• Lack of Optimization: RDDs lack built-in optimization mechanisms, requiring

developers to optimize code for performance manually.
• Complex API: The low-level API can be more complicated and cumbersome, making
development and debugging more challenging.
• No Schema Enforcement: RDDs do not enforce schemas, which can lead to potential
data consistency issues and complicate the processing of structured data.
USE CASES OF RDD
• Iterative Machine Learning Algorithms: RDDs are particularly useful for algorithms that
require multiple passes over the data, such as K-means clustering or logistic regression.
Their immutability and fault tolerance are beneficial in these scenarios.
• Data Pipeline Construction: RDDs can be used to build complex data pipelines where data
undergoes several stages of transformations and actions, such as ETL (Extract, Transform,
Load) processes.
• Interactive Data Analysis: RDDs are suitable for interactive analysis of large datasets due
to their in-memory processing capabilities. They allow users to prototype and test their data
analysis workflows quickly.
USE CASES OF RDD

• Unstructured and Semi-structured Data Processing: RDDs are flexible enough to handle
various data formats, including JSON, XML, and CSV, making them suitable for processing
unstructured and semi-structured data.
• Streaming Data Processing: With Spark Streaming, you can use RDDs to process real-time
data streams, enabling use cases such as real-time analytics and monitoring.
• Graph Processing: RDDs can be used with GraphX, Spark’s API for graph processing, to
perform operations on large-scale graph data, like social network analysis.
SPARKCONTEXT VS SPARKSESSION

Spark Context SparkSession

• Represents the connection to a Spark • Introduced in Spark 2.0

Cluster • Unified entry point for interacting with
• Coordinates tasks execution across the Spark.
cluster • Combines functionalities of SparkContext,
• Entry point in earlier versions of Spark SQLContext, HiveContext, and Streaming
(1.x) Context
• Supports multiple programming languages
(Scala, Java, Python, R)
• Seamlessly integrates various Spark
features.
FUNCTIONALITY DIFFERERNCES BETWEEN SPARKCONTEXT AND
SPARKSESSION

SparkContext SparkSession
• Core functionality for low-level programming • Extends SparkContext functionality
an cluster interaction • Higher-level abstractions like DataFrames and
• Creates RDDs (Resilient Distributed DataSets) DataSets
• Performs transformations and defines actions • Supports structured querying using SQL or
DataFrames API
• Provides data source APIs, machine learning
algorithms, and streaming capabilities.
• Offers higher-level API for structured data
processing using DataFrames and Datasets.
SPARKCONTEXT VS SPARKSESSION (CON’T)

• SparkContext and SparkSession have different purposes and functionality

• SparkContext is low-level, while SparkSession is higher-level
• SparkSession simplifies interaction and supports structured data processing
• SparkContext is still supported for backward compatibility,
• SparkSession is the recommended entry point for Spark applications.
REFRENCES

• https://spark.apache.org/docs/latest/api/python/index.html
• https://sparkbyexamples.com/
• https://medium.com/@akhilasaineni7/exploring-sparkcontext-and-sparksession-8369e60f658e
• https://youtu.be/aq68N-FH1yY?si=f_SStYHe0-rDUF7N

• https://www.analyticsvidhya.com/blog/2020/11/what-is-the-difference-between-rdds-
dataframes-and-datasets/

Project 2 Info IT 140 SNHU
No ratings yet
Project 2 Info IT 140 SNHU
5 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Module 3
No ratings yet
Module 3
51 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Pyspark
No ratings yet
Pyspark
31 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Py Spark
No ratings yet
Py Spark
9 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Spark
No ratings yet
Spark
96 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Spark
No ratings yet
Spark
51 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Unit 5
100% (1)
Unit 5
109 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
L3
No ratings yet
L3
30 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
UNIT V
No ratings yet
UNIT V
35 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Module 4
No ratings yet
Module 4
29 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Unit V Big data
No ratings yet
Unit V Big data
18 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Spark BD
No ratings yet
Spark BD
9 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
spark theory
No ratings yet
spark theory
26 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Spark
No ratings yet
Spark
9 pages
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Kiran
No ratings yet
Kiran
1 page
Laxis: Netstat - A Netstat - Ab
No ratings yet
Laxis: Netstat - A Netstat - Ab
18 pages
Working With Methods in C#
100% (1)
Working With Methods in C#
24 pages
Array - in C Programming
No ratings yet
Array - in C Programming
7 pages
CBSE Class 7 Computer Collection of Assignments For 2014
No ratings yet
CBSE Class 7 Computer Collection of Assignments For 2014
6 pages
Codecanyon 23686635 Mintreward Gaming Edition of Rewards App Incl Backend/frontend Documentatiion
No ratings yet
Codecanyon 23686635 Mintreward Gaming Edition of Rewards App Incl Backend/frontend Documentatiion
13 pages
Immediate Download Version Control With Git 1st Edition Jon Loeliger Ebooks 2024
100% (14)
Immediate Download Version Control With Git 1st Edition Jon Loeliger Ebooks 2024
70 pages
Python Lab Manual Created
No ratings yet
Python Lab Manual Created
13 pages
Pro ASP.NET Core 6: Develop Cloud-Ready Web Applications Using MVC, Blazor, and Razor Pages 9 (no TOC and Index) Edition Adam Freeman pdf download
100% (1)
Pro ASP.NET Core 6: Develop Cloud-Ready Web Applications Using MVC, Blazor, and Razor Pages 9 (no TOC and Index) Edition Adam Freeman pdf download
52 pages
How to Set Up an ABAP Cloud Development Environment
No ratings yet
How to Set Up an ABAP Cloud Development Environment
14 pages
API Testing Checklist
No ratings yet
API Testing Checklist
7 pages
Syllabus 305
No ratings yet
Syllabus 305
3 pages
Lab 11 Advance Angular
No ratings yet
Lab 11 Advance Angular
22 pages
AdvNative Week 4 - MVVM
No ratings yet
AdvNative Week 4 - MVVM
69 pages
TRAP Routines and Subroutines: System Calls
No ratings yet
TRAP Routines and Subroutines: System Calls
6 pages
Lalit Industrial Training
No ratings yet
Lalit Industrial Training
36 pages
IA Complexity 1
No ratings yet
IA Complexity 1
3 pages
Ctal-Tae: Question & Answers
No ratings yet
Ctal-Tae: Question & Answers
4 pages
Manish Python Project
No ratings yet
Manish Python Project
16 pages
Lab 5 OS
No ratings yet
Lab 5 OS
59 pages
R Book Guide
No ratings yet
R Book Guide
353 pages
Systemverilog Interview Questions
100% (2)
Systemverilog Interview Questions
31 pages
Event Driven and Structured Driven Difference
No ratings yet
Event Driven and Structured Driven Difference
7 pages
Objectorienteddbms Selective Inheritance
100% (2)
Objectorienteddbms Selective Inheritance
37 pages
Q11]PYTHON programming
No ratings yet
Q11]PYTHON programming
7 pages
C++ Multiple Choice Questions
No ratings yet
C++ Multiple Choice Questions
10 pages
Agile Extension Overview
No ratings yet
Agile Extension Overview
12 pages
PEP8 Cheatsheet
No ratings yet
PEP8 Cheatsheet
8 pages
Project On Text Editor: Under Supervision of
No ratings yet
Project On Text Editor: Under Supervision of
18 pages

CISD 42 Introduction to Spark_Spark Transformation_Spark Actions

Uploaded by

CISD 42 Introduction to Spark_Spark Transformation_Spark Actions

Uploaded by

INTRODUCTION TO SPARK

• Open-source distributed computing system for big data processing

• Spark works in a master-slave

• Spark – Default interface for Scala and Java

• Commonly known as RDDS

• Optimized Execution. DataFrames provide Spark with schema information, enabling it to

• Low-level Transformation Control: RDDs provide fine-grained control over data

• Lack of Optimization: RDDs lack built-in optimization mechanisms, requiring

Spark Context SparkSession

• Represents the connection to a Spark • Introduced in Spark 2.0

• SparkContext and SparkSession have different purposes and functionality

You might also like