0% found this document useful (0 votes)

8 views39 pages

Cse3002 Big Data m3 Detailed

Apache Spark is an open-source cluster computing framework designed for real-time data processing, significantly outperforming Hadoop's MapReduce in speed and efficiency. It offers various components like Spark SQL for structured data, Spark Streaming for live data streams, and MLlib for machine learning, all built on the core concept of resilient distributed datasets (RDDs). Spark's architecture allows it to handle both batch and real-time processing, making it a versatile tool for data scientists and engineers.

Uploaded by

Chinmayi D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views39 pages

Cse3002 Big Data m3 Detailed

Uploaded by

Chinmayi D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

MODULE - 4

Introduction to Data Analysis with Spark

• Apache Spark is an open-source cluster computing framework for real-time

processing.
• It is of the most successful projects in the Apache Software Foundation.
• On the speed side, Spark extends the popular MapReduce model to
efficiently support more types of computations, including interactive queries
and stream processing.
• Speed is important in processing large datasets, as it means the difference
between exploring data interactively and waiting minutes or hours.
• One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.
Why Spark when Hadoop is already there?

• Hadoop is based on the concept of batch processing where the processing

happens of blocks of data that have already been stored over a period of
time.
• At the time, Hadoop broke all the expectations with the revolutionary
MapReduce framework in 2005.
• Hadoop MapReduce is the best framework for processing data in batches.
• This went on until 2014, till Spark overtook Hadoop.
• It could process data in real time and was about 100 times faster than
Hadoop MapReduce in batch processing large data sets .
• In Spark, processing can take place in real-time. This real-time
processing power in Spark helps us to solve the use cases of Real
Time Analytics
• Alongside this, Spark is also able to do batch processing 100 times
faster than that of Hadoop MapReduce (Processing framework in
Apache Hadoop).
• Apache Spark works well for smaller data sets that can all fit into a
server's RAM.
• Hadoop is more cost effective processing massive data sets.
Apache Spark Vs Hadoop
SL NO Hadoop Apache Spark
1 Hadoop is an open source framework which uses a Spark is lightning fast cluster computing
MapReduce algorithm. technology, which extends the
MapReduce model to efficiently use
with more type of computations
2 Hadoop’s MapReduce model reads and writes from Spark reduces the number of read/write
a disk, thus slow down the processing speed. cycles to disk and store intermediate
data in-memory, hence faster-
processing speed.
3 Hadoop is designed to handle batch processing Spark is designed to handle real-time
efficiently data efficiently.
4 Hadoop is a high latency computing framework, Spark is a low latency computing and
which does not have an interactive mode. can process data interactively.
5 With Hadoop MapReduce, a developer can only Spark can process real-time data, from
process data in batch mode only real time events like twitter, facebook
6 Hadoop is a cheaper option available while Spark requires a lot of RAM to run in-
comparing it in terms of cost. memory, thus increasing the cluster and
hence cost.
7 Written in Java Written in Scala
A Unified Stack
Spark Core
• Spark Core contains the basic functionality of Spark, including components
for task scheduling, memory management, fault recovery, interacting with
storage systems, and more.
• Spark Core is also home to the API that defines resilient distributed datasets
(RDDs), which are Spark’s main programming abstraction.
• RDDs represent a collection of items distributed across many compute
nodes that can be manipulated in parallel.
• Spark Core provides many APIs for building and manipulating these
collections
Spark SQL

• Spark SQL is Spark’s package for working with structured data.

• It allows querying datavia SQL as well as the Apache Hive variant of
SQL — called the Hive Query Language(HQL) — and it supports
many sources of data, including Hive tables, Parquet, and JSON.
• Beyond providing a SQL interface to Spark, Spark SQL allows
developers to intermix SQL queries with the programmatic data
manipulations supported by RDDs in Python,Java, and Scala, all
within a single application, thus combining SQL with complex
analytics
Spark Streaming

• Spark Streaming is a Spark component that enables processing of

live streams of data.
• Examples of data streams include logfiles generated by production
web servers, or queuesof messages containing status updates posted
by users of a web service.
MLlib

• Spark comes with a library containing common machine learning

(ML) functionality, called MLlib.
• MLlib provides multiple types of machine learning algorithms,
including classification, regression, clustering, and collaborative
filtering, as well as supporting functionality such as model evaluation
and data import.
GraphX

• GraphX is a library for manipulating graphs (e.g., a social network’s

friend graph) and performing graph-parallel computations.
• Like Spark Streaming and Spark SQL, GraphX extends the Spark RDD
API, allowing us to create a directed graph with arbitrary properties
attached to each vertex and edge.
• GraphX also provides various operators for manipulating graphs
(e.g., subgraph and mapVertices) and a library of common graph
algorithms (e.g., PageRank and triangle counting).
Cluster Managers

• Under the hood, Spark is designed to efficiently scale up

from one to many thousands of compute nodes.
• To achieve this while maximizing flexibility, Spark can
run over a variety of cluster managers, including Hadoop
YARN, Apache Mesos, and a simple cluster manager
included in Spark itself called the Standalone Scheduler.
Who Uses Spark, and for What?

• Since Spark is a general-purpose framework for cluster computing, it

is used for a diverse range of applications.
• Two groups of users- data scientists and engineers.
• -Data Science Tasks
-Data Processing Applications
A Brief History of Spark
Spark is an open source project that has been built and is maintained by a
thriving and diverse community of developers. If you or your organization are
trying Spark for the first time, you might be interested in the history of the
project.

Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later
to become the AMPLab. The researchers in the lab had previously been working
on Hadoop MapReduce, and observed that MapReduce was inefficient for
iterative and interactive computing jobs.

Thus, from the beginning, Spark was designed to be fast for interactive queries
and iterative algorithms, bringing in ideas like support for in-memory storage
and efficient fault recovery.

15
Spark Versions and Releases

• Since its creation, Spark has been a very active project and
community, with the number of contributors growing with each
release.
• Spark 1.0 had over 100 individual contributors. Though the level of
activity has rapidly grown, the community continues to release
updated versions of Spark on a regular schedule.
• Spark-versions
Storage Layers for Spark

• Spark can create distributed datasets from any file stored in the
Hadoop distributed filesystem (HDFS) or other storage systems
supported by the Hadoop APIs (including your local filesystem,
Amazon S3, Cassandra, Hive, HBase, etc.).
• It’s important to remember that Spark does not require Hadoop; it
simply has support for storage systems implementing the Hadoop
APIs. Spark supports text files, Sequence Files, Avro, Parquet, and any
other Hadoop InputFormat.
Programming with RDDs

• An RDD is simply a distributed collection of elements.

• In Spark all work is expressed as either creating new RDDs,
transforming existing RDDs, or calling operations on RDDs to
compute a result.
• Under the hood, Spark automatically distributes the data contained
in RDDs across your cluster and parallelizes the operations you
perform on them.
Programming with RDDs
RDD Basics

• An RDD in Spark is simply an immutable distributed collection of objects.

• Each RDD is split into multiple partitions, which may be computed on
different nodes of the cluster. RDDs can contain any type of Python, Java, or
Scala objects, including user-defined classes.
• Users create RDDs in two ways: by loading an external dataset, or by
distributing a collection of objects (e.g., a list or set) in their driver program.
RDD Basics
• Once created, RDDs offer two types of operations: transformations and actions.
• Transformations construct a new RDD from a previous one.
• For example, one common transformation is filtering data that matches a predicate.
• In our text file example, we can use this to create a new RDD holding just the strings that
contain the word Python, as shown in Example 3-2.

• Actions, on the other hand, compute a result based on an RDD, and either return it to the driver
program or save it to an external storage system (e.g., HDFS).
• One example of an action we called earlier is first(), which returns the first element in an RDD and
is demonstrated in Example 3-3.
RDD Basics
• Transformations and actions are different because of the way Spark computes
RDDs.
• Although you can define new RDDs any time, Spark computes them only in a lazy
fashion — that is, the first time they are used in an action.
• This approach makes a lot of sense when you are working with Big Data.
• For instance, in Example 3-2 and Example 3-3, we defined a text file and then
filtered the lines that include Python.
• If Spark were to load and store all the lines in the file as soon as we wrote lines =
sc.textFile(…), it would waste a lot of storage space, given that we then
immediately filter out many lines.
• Instead, once Spark sees the whole chain of transformations, it can compute just
the data needed for its result.
• In fact, for the first() action, Spark scans the file only until it finds the first
matching line; it doesn’t even read the whole file
• Spark’s RDDs are by default recomputed each time you run an action on them.
RDD Basics

• If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it
using RDD.persist().
• In practice, we will often use persist() to load a subset of your data into memory and
query it repeatedly.
• For example, if we knew that we wanted to compute multiple results about the
README lines that contain Python, we could write the script as shown below.

• The behavior of not persisting by default makes a lot of sense for big datasets: if you will not
reuse the RDD, there’s no reason to waste storage space when Spark could instead stream
through the data once and just compute the result.
RDD Basics

To summarize, every Spark program and shell session will work as follows:
1. Create some input RDDs from external data.
2. Transform them to define new RDDs using transformations like filter().
3. Ask Spark to persist() any intermediate RDDs that will need to be reused.
4. Launch actions such as count() and first() to kick off a parallel
computation, which is then optimized and executed by Spark.
Creating RDDs

• Spark provides two ways to create RDDs: loading an external dataset and parallelizing
a collection in your driver program.
• The simplest way to create RDDs is to take an existing collection in your program and
pass it to SparkContext’s parallelize() method, as shown in Examples 3-5 through 3-7.
RDD Operations

• As we’ve discussed, RDDs support two types of operations: transformations and

actions.
• Transformations are operations on RDDs that return a new RDD, such as map()
and filter().
• Actions are operations that return a result to the driver program or write it to
storage, and kick off a computation, such as count() and first().
• Spark treats transformations and actions very differently, so understanding
which type of operation you are performing will be important.
• If you are ever confused whether a given function is a transformation or an
action, you can look at its return type: transformations return RDDs, whereas
actions return some other data type.
Transformations
• Transformations are operations on RDDs that return a new RDD.
• As discussed in “Lazy Evaluation”, transformed RDDs are computed lazily, only when you use them
in an action.
• Many transformations are element-wise; that is, they work on one element at a time; but this is
not true for all transformations.
• As an example, suppose that we have a logfile, log.txt, with a number of messages, and we want to
select only the error messages. We can use the filter() transformation seen before.
Transformations
• The filter() operation does not mutate the existing input RDD.
• Instead, it returns a pointer to an entirely new RDD.
• inputRDD can still be reused later in the program — for instance, to search for other words.
• Let’s use inputRDD again to search for lines with the word warning in them.
• Then, we’ll use another transformation, union(), to print out the number of lines that
contained either error or warning.
• union() is a bit different than filter(), in that it operates on two RDDs instead of one.
• Transformations can actually operate on any number of input RDDs.
Transformations

• As you derive new RDDs from each other using transformations, Spark keeps
track of the set of dependencies between different RDDs, called the lineage
graph.
• It uses this information to compute each RDD on demand and to recover lost
data if part of a persistent RDD is lost
Actions
• Actions are the second type of RDD operation.
• They are the operations that return a final value to the driver program or write data
to an external storage system.
• Actions force the evaluation of the transformations required for the RDD they were
called on, since they need to actually produce output.
Actions
• In this example, we used take() to retrieve a small number of elements in the RDD at the
driver program.
• We then iterate over them locally to print out information at the driver.
• RDDs also have a collect() function to retrieve the entire RDD.
• This can be useful if your program filters RDDs down to a very small size and you’d like to
deal with it locally.
• Entire dataset must fit in memory on a single machine to use collect() on it, so collect()
shouldn’t be used on large datasets.
• In most cases RDDs can’t just be collect()ed to the driver because they are too large.
• In these cases, it’s common to write data out to a distributed storage system such as
HDFS or Amazon S3.
• You can save the contents of an RDD using the saveAsTextFile() action,
saveAsSequenceFile(), or any of a number of actions for various built-in formats.
Passing Functions to Spark

• Most of Spark’s transformations, and some of its actions, depend on passing

in functions that are used by Spark to compute data.
In Python, we have three options for passing functions into Spark.
• For shorter functions, we can pass in lambda expressions. Alternatively, we
can pass in top-level functions, or locally defined functions.

Example 3-18. Passing functions in Python

word = rdd.filter(lambda s: “error” in s)
def containsError(s):
return “error” in s
word = rdd.filter(containsError)
Passing Functions to Spark
• One issue to watch out for when passing functions is inadvertently
serializing the object containing the function.
• When you pass a function that is the member of an object, or contains
references to fields in an object (e.g., self.field), Spark sends the entire object
to worker nodes, which can be much larger than the bit of information you
need (see Example 3-19).
• Sometimes this can also cause your program to fail, if your class contains
objects that Python can’t figure out how to pickle.
Passing Functions to Spark

• Instead, just extract the fields you need from your object into a local variable
and pass that in
Common Transformations and Actions
Basic RDDs
We will begin by describing what transformations and actions we can perform on all
RDDs regardless of the data.

Element-wise transformations
• The two most common transformations you will likely be using are map() and filter() .
• The map() transformation takes in a function and applies it to each element in the
RDD with the result of the function being the new value of each element in the
resulting RDD.
• The filter() transformation takes in a function and returns an RDD that only has
elements that pass the filter() function.
• We can use map() to do any number of things, from fetching the website associated
with each URL in our collection to just squaring the numbers.
• It is useful to note that map()’s return type does not have to be the same as its input
type, so if we had an RDD String and our map() function were to parse the strings and
return a Double, our input RDD type would be RDD[String] and the resulting RDD type
would be RDD[Double].
Common Transformations and Actions
Common Transformations and Actions
Common Transformations and Actions
• Sometimes we want to produce multiple output elements for each input element.
• The operation to do this is called flatMap(). As with map(), the function we provide to
flatMap() is called individually for each element in our input RDD.
• Instead of returning a single element, we return an iterator with our return values.
• Rather than producing an RDD of iterators, we get back an RDD that consists of the
elements from all of the iterators. A simple usage of flatMap() is splitting up an input
string into words, as shown in Examples 3-29 through 3-31.
Common Transformations and Actions
• We illustrate the difference between flatMap() and map() in Figure 3-3.
• You can think of flatMap() as “flattening” the iterators returned to it, so that instead of ending
up with an RDD of lists we have an RDD of the elements in those lists

Real-Time Data Processing & Analytics - Distributed Computing & Event Processing Using Spark, Flink, Storm, Kafka
100% (3)
Real-Time Data Processing & Analytics - Distributed Computing & Event Processing Using Spark, Flink, Storm, Kafka
422 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Module 3
No ratings yet
Module 3
51 pages
Unit 5
100% (1)
Unit 5
109 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Unit V Big data
No ratings yet
Unit V Big data
18 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
No ratings yet
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
27 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
BIG DATA ANLYTICS UNIT 3 R22 IT
No ratings yet
BIG DATA ANLYTICS UNIT 3 R22 IT
57 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
UNIT 5.1
No ratings yet
UNIT 5.1
9 pages
Shark
No ratings yet
Shark
24 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Bda 5
No ratings yet
Bda 5
21 pages
Spark
No ratings yet
Spark
9 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Spark
No ratings yet
Spark
96 pages
4. Introduction-to-Apache-Spark
No ratings yet
4. Introduction-to-Apache-Spark
22 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
99 Apache Spark Interview Questions For Professionals
33% (12)
99 Apache Spark Interview Questions For Professionals
11 pages
Data Science Create Teams That Doug Rose (WWW - Ebook DL - Com)
100% (4)
Data Science Create Teams That Doug Rose (WWW - Ebook DL - Com)
252 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
MCS-226(2025)
No ratings yet
MCS-226(2025)
3 pages
601
No ratings yet
601
18 pages
Visvesvaraya Technological University Belagavi: Scheme of Teaching and Examinations and Syllabus
No ratings yet
Visvesvaraya Technological University Belagavi: Scheme of Teaching and Examinations and Syllabus
29 pages
Apache Spark with Scala - cheatsheet (1) (1)
No ratings yet
Apache Spark with Scala - cheatsheet (1) (1)
7 pages
Resume Vishal
No ratings yet
Resume Vishal
2 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Croma Campus - DP-203 Data Engineering On Microsoft Azure Training Curriculum
No ratings yet
Croma Campus - DP-203 Data Engineering On Microsoft Azure Training Curriculum
7 pages
Project Report
No ratings yet
Project Report
37 pages
The Rise of Operational Analytics
No ratings yet
The Rise of Operational Analytics
48 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Debugging A Spark Application PDF
No ratings yet
Debugging A Spark Application PDF
3 pages
Using Open Data To Predict Market Movements: Ravinder Singh Marina Levina
No ratings yet
Using Open Data To Predict Market Movements: Ravinder Singh Marina Levina
31 pages
Apache Airflow - A Python Hands-On Guide
No ratings yet
Apache Airflow - A Python Hands-On Guide
9 pages
Naukri Vamsi (7y 5m)
No ratings yet
Naukri Vamsi (7y 5m)
5 pages
Data Report Martin Inline Graphics R8 1
No ratings yet
Data Report Martin Inline Graphics R8 1
6 pages
Shruti Mishra Resume 1561126094
No ratings yet
Shruti Mishra Resume 1561126094
1 page
Papier ICDTA-21
No ratings yet
Papier ICDTA-21
11 pages
Streaming Big-Data Analytic Platform For Unified Logholaye
No ratings yet
Streaming Big-Data Analytic Platform For Unified Logholaye
117 pages
Data Science - UNIT-3 - Notes
No ratings yet
Data Science - UNIT-3 - Notes
32 pages
Mastering Microsoft Fabric: SAASification of Analytics 1st Edition Debananda Ghosh download
100% (4)
Mastering Microsoft Fabric: SAASification of Analytics 1st Edition Debananda Ghosh download
66 pages
A Tale of Spark Session and Spark Context
No ratings yet
A Tale of Spark Session and Spark Context
8 pages
Resume Template 1
No ratings yet
Resume Template 1
2 pages
BS in Artificial intelligence
No ratings yet
BS in Artificial intelligence
4 pages
Tabular Iceberg-Spark Cheat-Sheet
No ratings yet
Tabular Iceberg-Spark Cheat-Sheet
1 page
Hcia Big Data Practice
No ratings yet
Hcia Big Data Practice
35 pages
Job Description For DS4
No ratings yet
Job Description For DS4
5 pages

Cse3002 Big Data m3 Detailed

Uploaded by

Cse3002 Big Data m3 Detailed

Uploaded by

MODULE - 4

Introduction to Data Analysis with Spark

• Apache Spark is an open-source cluster computing framework for real-time

• Hadoop is based on the concept of batch processing where the processing

• Spark SQL is Spark’s package for working with structured data.

• Spark Streaming is a Spark component that enables processing of

• Spark comes with a library containing common machine learning

• GraphX is a library for manipulating graphs (e.g., a social network’s

• Under the hood, Spark is designed to efficiently scale up

• Since Spark is a general-purpose framework for cluster computing, it

• An RDD is simply a distributed collection of elements.

• An RDD in Spark is simply an immutable distributed collection of objects.

• As we’ve discussed, RDDs support two types of operations: transformations and

• Most of Spark’s transformations, and some of its actions, depend on passing

Example 3-18. Passing functions in Python

You might also like