0% found this document useful (0 votes)

242 views38 pages

Big Data Processing with Apache Spark

Uploaded by

Luis Ballester Tamarit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

242 views38 pages

Big Data Processing with Apache Spark

Uploaded by

Luis Ballester Tamarit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Big Data Fundamentals

Unidad 3 Paralelización de Datos

PhD. Ronal Muresano

Unit Content

Distributed Programming
Introduction to Spark
Type of ﬁles (Csv, Parquet) and processing management
Analysing massive data with Spark
Spark Streaming using queues
Installation Steps
[Link]

Notebook to Test
Parallel Programming
Parallel Programming
Parallel Processing
Parallel processing is a method in computing of running two or more processors (CPUs) to
handle separate parts of an overall task. Breaking up different parts of a task among multiple
processors will help reduce the amount of time to run a program. Any system that has more
than one CPU can perform parallel processing, as well as multi-core processors which are
commonly found on computers today.

Distributed System:
A distributed system is one in which components located at
networked computers communicate and coordinate their
actions only by passing messages
Centralized vs Distributed System
Centralized Distributed
● Multiple autonomous
● One component with
components
non-autonomous parts
● Components are not shared by
● Component shared by users
all users
all the time
● Resources may not be
● All resources accessible
accessible
● Software runs in a single ● Software runs in concurrent
process processes on different
● Single point of control processors
● Single point of failure ● Multiple points of control
● Multiple points of failure
Big data and parallel programming

•Domain Decomposition: the data associated with a problem is

decomposed. Each parallel task then works on a portion of of the
data.
Current hardware for BigData
Spark as a distributed environment

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine
learning on single-node machines or clusters.

Apache Spark is a computer infrastructure which allow us to load and process big data workloads. It is
very fast due to it manages the data storage and the memory in a dynamic manner. It is compatible with a
lot of Databases and it can be used to process ad hoc, bash and streaming process.

The SPARK language is based on a subset of the Ada language. Ada is particularly well suited to formal
verification since it was designed for critical software development. SPARK builds on that foundation
Beneﬁts of Spark

Fast and expressive cluster computing system interoperable with Apache Hadoop

Improves efficiency through:

In-memory computing primitives

General computation graphs

Improves usability through:

Rich APIs in Scala, Java, Python

Interactive shell

Spark is not

a modified version of Hadoop

dependent on Hadoop because it has its own cluster management

Spark uses Hadoop for storage purpose only

Beneﬁts of Spark

Simplicity: Spark’s capabilities are accessible via a set of rich APIs, all designed specifically for interacting quickly
and easily with data at scale. These APIs are well-documented and structured in a way that makes it straightforward
for data scientists and application developers to quickly put Spark to work.

Speed: Spark is designed for speed, operating both in memory and on disk.

Support: Spark supports a range of programming languages, including Java, Python, R, and Scala. Spark includes
support for tight integration with a number of leading storage solutions in the Hadoop ecosystem and beyond,
including HPE Ezmeral Data Fabric (file system, database, and event store), Apache Hadoop (HDFS), Apache
HBase, and Apache Cassandra

Source: [Link]
Spark Components
Why to use Spark
● A Spark application runs as independent processes, coordinated by the SparkSession object in the
driver program.
● The resource or cluster manager assigns tasks to workers, one task per partition.
● A task applies its unit of work to the dataset in its partition and outputs a new partition dataset.
Because iterative algorithms apply operations repeatedly to data, they benefit from caching datasets
across iterations.
● Results are sent back to the driver application or can be saved to disk.

Spark supports the following resource/cluster managers:

● Spark Standalone – a simple cluster manager included with Spark

● Apache Mesos – a general cluster manager that can also run Hadoop applications
● Apache Hadoop YARN – the resource manager in Hadoop 2
● Kubernetes – an open source system for automating deployment, scaling, and management of
containerized applications

Spark also has a local mode, where the driver and executors run as threads on
your computer instead of a cluster, which is useful for developing your
applications from a personal computer.
Source: [Link]
What Can spark do?
Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual
servers. It has an extensive set of developer libraries and APIs and supports languages such as Java, Python, R, and Scala;

Source: [Link]
What can Spark do?

Typical use cases include:

Stream processing: From log files to sensor data, application developers are increasingly having to cope with
"streams" of data. This data arrives in a steady stream, often from multiple sources simultaneously. While it is
certainly feasible to store these data streams on disk and analyze them retrospectively, it can sometimes be sensible
or important to process and act upon the data as it arrives. Streams of data related to financial transactions, for
example, can be processed in real time to identify– and refuse– potentially fraudulent transactions.

Machine learning: As data volumes grow, machine learning approaches become more feasible and increasingly
accurate. Software can be trained to identify and act upon triggers within well-understood data sets before applying
the same solutions to new and unknown data. Spark’s ability to store data in memory and rapidly run repeated
queries makes it a good choice for training machine learning algorithms. Running broadly similar queries again and
again, at scale, significantly reduces the time required to go through a set of possible solutions in order to find the
most efficient algorithms.
What can Spark do?

Interactive analytics: Rather than running pre-defined queries to create static dashboards of sales or production line
productivity or stock prices, business analysts and data scientists want to explore their data by asking a question, viewing
the result, and then either altering the initial question slightly or drilling deeper into results. This interactive query process
requires systems such as Spark that are able to respond and adapt quickly.

Data integration: Data produced by different systems across a business is rarely clean or consistent enough to simply
and easily be combined for reporting or analysis. Extract, transform, and load (ETL) processes are often used to pull data
from different systems, clean and standardize it, and then load it into a separate system for analysis. Spark (and Hadoop)
are increasingly being used to reduce the cost and time required for this ETL process.
API de Spark
Sparkcontext
SparkContext is the entry point to any spark functionality. When we run any Spark application, a driver
program starts, which has the main function and your SparkContext gets initiated here. The driver program
then runs the operations inside the executors on worker nodes.
SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext. By default, PySpark has
SparkContext available as ‘sc’, so creating a new SparkContext won't work.
Parameters
Following are the parameters of a SparkContext.

● Master − It is the URL of the cluster it connects to.

● appName − Name of your job.
● sparkHome − Spark installation directory.
● pyFiles − The .zip or .py files to send to the cluster and add to the PYTHONPATH.
● Environment − Worker nodes environment variables.
● batchSize − The number of Python objects represented as a single Java object. Set 1 to
disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use
an unlimited batch size.
● Serializer − RDD serializer.
● Conf − An object of L{SparkConf} to set all the Spark properties.
● Gateway − Use an existing gateway and JVM, otherwise initializing a new JVM.
● JSC − The JavaSparkContext instance.
● profiler_cls − A class of custom Profiler used to do profiling (the default is
[Link]).
SparkContext Example
Resilient Distributed Dataset (RDD)

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of
objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on
either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in
parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in
an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input
Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how
MapReduce operations take place and why they are not so efficient.
Rdd Beneﬁts

In-Memory Processing: PySpark loads the data from disk and process in memory and keeps the data in
memory, this is the main diﬀerence between PySpark and Mapreduce (I/O intensive). In between the transformations, we
can also cache/persists the RDD in memory to reuse the previous computations.

Immutability: RDD’s are immutable in nature meaning, once RDDs are created you cannot modify. When we apply
transformations on RDD, PySpark creates a new RDD and maintains the RDD Lineage.

Fault Tolerance: operates on fault-tolerant data stores on HDFS, S3 e.t.c hence any RDD operation fails, it
automatically reloads the data from other partitions. Also, When PySpark applications running on a cluster, PySpark task
failures are automatically recovered for a certain number of times (as per the conﬁguration) and ﬁnish the application
seamlessly.

Lazy Evolution: it does not evaluate the RDD transformations as they appear/encountered by Driver instead it
keeps the all transformations as it encounters(DAG) and evaluates the all transformation when it sees the ﬁrst RDD
action.

Partitioning: When you create RDD from a data, It by default partitions the elements in a RDD. By default it
partitions to the number of cores available.
Source: [Link]
Parallelized Collections

Parallelized collections are created by calling SparkContext’s parallelize method on an existing iterable or collection in your
driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For
example, here is how to create a parallelized collection holding the numbers 1 to 5:
Transformation and Action in Spark

Transformation − These are the operations, which are applied on a RDD to create a new RDD. Filter, groupBy and
map are the examples of transformations. Transformations are lazy in nature i.e., they get execute when we call an
action. They are not executed immediately. Two most basic type of transformations is a map(), filter().

Action − These are the operations that are applied on RDD, which instructs Spark to perform computation and send
the result back to the driver

Transformation

Action
Filter a RDD

Filter:
Spark Transformations

Narrow transformation – In Narrow transformation, all the elements that are required to compute the records in single partition live in the single
partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

Wide transformation – In wide transformation, all the elements that are required to compute the records in the single partition may live in many
partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and
reducebyKey().

Source: [Link]
Read a ﬁle from Local System
Spark Reading Operations

Main entry point to Spark functionality

>[Link]([1, 2, 3])
# Load text file from local FS, HDFS, or S3
>[Link](“[Link]”)
>[Link](“directory/*.txt”)
>[Link](“hdfs://namenode:9000/path/file”)
Spark Reading/Writing Operations
Samples Files: [Link]

Inserting into the HDFS

Spark Operations

Copy from HDFS to

LocalSystem
Some Spark Operations
Some Spark Operations
Key Values Operations
GroupBykey
Key Values Operations
ReduceBykey
Key Values Operations (SortByKeys)
Example: Word Count
Example: Word Count
Other Key value Operations
Key Values Operations
Big Data as a Service: A new approach for the SMEs

Muchas gracias por su atención

Ronal Muresano
rmuresano@[Link]
rmuresano@[Link]

Big Data Analytics with Apache Spark
No ratings yet
Big Data Analytics with Apache Spark
94 pages
Databricks Data Engineering Insights
No ratings yet
Databricks Data Engineering Insights
39 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
17 pages
Big Data with Spark and Python Guide
No ratings yet
Big Data with Spark and Python Guide
28 pages
Top 55 Apache Spark Interview Questions
No ratings yet
Top 55 Apache Spark Interview Questions
10 pages
Essential PySpark Transformations Guide
No ratings yet
Essential PySpark Transformations Guide
10 pages
Spark
No ratings yet
Spark
96 pages
PySpark Interview Questions & Answers
No ratings yet
PySpark Interview Questions & Answers
8 pages
Apache Spark Basics for Data Engineers
100% (1)
Apache Spark Basics for Data Engineers
15 pages
Understanding Apache Spark Concepts
No ratings yet
Understanding Apache Spark Concepts
63 pages
Apache Spark Architecture Overview
0% (1)
Apache Spark Architecture Overview
30 pages
Caching DataFrames in PySpark
No ratings yet
Caching DataFrames in PySpark
51 pages
PySpark: Python API for Spark
No ratings yet
PySpark: Python API for Spark
35 pages
Real-Time PySpark Scenarios Explained
No ratings yet
Real-Time PySpark Scenarios Explained
5 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
Spark Production Insights by Databricks
No ratings yet
Spark Production Insights by Databricks
34 pages
Spark Tuning with Ganglia Insights
No ratings yet
Spark Tuning with Ganglia Insights
37 pages
PySpark Optimization Techniques Guide
No ratings yet
PySpark Optimization Techniques Guide
1 page
PySpark Interview Questions and Solutions
No ratings yet
PySpark Interview Questions and Solutions
275 pages
Azure Synapse Analytics Interview Q&A
No ratings yet
Azure Synapse Analytics Interview Q&A
5 pages
Comprehensive PySpark Guide PDF
No ratings yet
Comprehensive PySpark Guide PDF
3 pages
Overview of Spark Architecture
No ratings yet
Overview of Spark Architecture
7 pages
Spark Interview Questions: Driver & Data Skew
No ratings yet
Spark Interview Questions: Driver & Data Skew
34 pages
Spark Shell: Local Mode Guide
No ratings yet
Spark Shell: Local Mode Guide
56 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Introduction to PySpark Basics
No ratings yet
Introduction to PySpark Basics
73 pages
Spark Repartition vs Coalesce Explained
No ratings yet
Spark Repartition vs Coalesce Explained
7 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
PySpark Basics and Data Management
No ratings yet
PySpark Basics and Data Management
102 pages
Top PySpark Interview Questions Explained
No ratings yet
Top PySpark Interview Questions Explained
4 pages
PySpark Interview Questions for 2025
No ratings yet
PySpark Interview Questions for 2025
1 page
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Python Interview Questions and Answers
No ratings yet
Python Interview Questions and Answers
4 pages
100 Days of Data Engineering Notes
No ratings yet
100 Days of Data Engineering Notes
6 pages
SQL Subquery Examples and Insights
No ratings yet
SQL Subquery Examples and Insights
4 pages
Python Interview Questions Guide
No ratings yet
Python Interview Questions Guide
14 pages
PySpark Employee Salary Queries
No ratings yet
PySpark Employee Salary Queries
22 pages
Pyspark Window Functions Overview
100% (1)
Pyspark Window Functions Overview
8 pages
Caching vs Persisting in PySpark
No ratings yet
Caching vs Persisting in PySpark
3 pages
Apache Spark RDD API Overview
No ratings yet
Apache Spark RDD API Overview
38 pages
Unix Commands Cheat Sheet Guide
No ratings yet
Unix Commands Cheat Sheet Guide
12 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
Key Features of PySpark Explained
No ratings yet
Key Features of PySpark Explained
19 pages
Understanding Lazy Evaluation in PySpark
No ratings yet
Understanding Lazy Evaluation in PySpark
2 pages
Stratascratch PySpark Coding Questions
No ratings yet
Stratascratch PySpark Coding Questions
23 pages
Pyspark Union and UnionByName Guide
No ratings yet
Pyspark Union and UnionByName Guide
66 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
PySpark RDD Operations Cheat Sheet
No ratings yet
PySpark RDD Operations Cheat Sheet
1 page
Pyspark Fundamentals and Basics Guide
No ratings yet
Pyspark Fundamentals and Basics Guide
74 pages
Spark RDD Transformations Explained
No ratings yet
Spark RDD Transformations Explained
9 pages
Azure Data Engineering Interview Q&A
No ratings yet
Azure Data Engineering Interview Q&A
21 pages
Data Engineering Career Guide
No ratings yet
Data Engineering Career Guide
5 pages
Working with Apache Spark and Delta Lake
No ratings yet
Working with Apache Spark and Delta Lake
40 pages
Machine Learning with Spark Overview
No ratings yet
Machine Learning with Spark Overview
26 pages
Data Cleaning with Apache Spark
No ratings yet
Data Cleaning with Apache Spark
21 pages
Advanced PySpark Interview Questions
No ratings yet
Advanced PySpark Interview Questions
1 page
Understanding RDD Operations in Spark
No ratings yet
Understanding RDD Operations in Spark
20 pages
Data Analysis with Apache Spark Overview
No ratings yet
Data Analysis with Apache Spark Overview
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Administrative Assistant & IT Specialist
No ratings yet
Administrative Assistant & IT Specialist
5 pages
Arduino Guide for Nokia 5110 LCD
0% (1)
Arduino Guide for Nokia 5110 LCD
4 pages
iSmartViewPro PC Operation Guide
No ratings yet
iSmartViewPro PC Operation Guide
14 pages
Utility in Fitness Tracking Apps
No ratings yet
Utility in Fitness Tracking Apps
7 pages
Student Management System Overview
No ratings yet
Student Management System Overview
30 pages
PIC16F87XA EEPROM Read/Write Guide
No ratings yet
PIC16F87XA EEPROM Read/Write Guide
5 pages
Employee Clearance Form Instructions
No ratings yet
Employee Clearance Form Instructions
2 pages
10 Cloud5-Pricing-Cloud-5-Pack - v2
No ratings yet
10 Cloud5-Pricing-Cloud-5-Pack - v2
1 page
Installing Vulcan 6 Security Key Guide
No ratings yet
Installing Vulcan 6 Security Key Guide
6 pages
Chacon IPCAM-RI01 User Manual
No ratings yet
Chacon IPCAM-RI01 User Manual
79 pages
Understanding Networking Basics
No ratings yet
Understanding Networking Basics
18 pages
Datasheet
No ratings yet
Datasheet
15 pages
Synchronous vs Asynchronous in ABAP
No ratings yet
Synchronous vs Asynchronous in ABAP
8 pages
Getting Started with Raspberry Pi OS
No ratings yet
Getting Started with Raspberry Pi OS
6 pages
486 Sps 103
No ratings yet
486 Sps 103
61 pages
Aryan Kaushik: Tech Profile & Achievements
No ratings yet
Aryan Kaushik: Tech Profile & Achievements
1 page
Zingangwa Cluster Computer Studies Exam Guide
100% (2)
Zingangwa Cluster Computer Studies Exam Guide
8 pages
Hadoop Ecosystem Components Overview
No ratings yet
Hadoop Ecosystem Components Overview
97 pages
Install Jammer User Guide
No ratings yet
Install Jammer User Guide
264 pages
IGI 2: System and Hardware Requirements
No ratings yet
IGI 2: System and Hardware Requirements
3 pages
Wonderware FactorySuite Support Matrix
No ratings yet
Wonderware FactorySuite Support Matrix
2 pages
Microprocessor Based Systems Syllabus
No ratings yet
Microprocessor Based Systems Syllabus
3 pages
FactoryTalk View Machine Edition 10.00.01
No ratings yet
FactoryTalk View Machine Edition 10.00.01
24 pages
Network Software and Monitoring Tools - Cambium Networks
No ratings yet
Network Software and Monitoring Tools - Cambium Networks
18 pages
Evolving Safe C++ with array_view
No ratings yet
Evolving Safe C++ with array_view
26 pages
AS Computer Science 9618 Overview
100% (1)
AS Computer Science 9618 Overview
20 pages
Item-Level Restore in Azure Backup
No ratings yet
Item-Level Restore in Azure Backup
5 pages
Operating Systems Syllabus Overview
No ratings yet
Operating Systems Syllabus Overview
107 pages
n6316 NVR User Manual v1.0 20130814 English
No ratings yet
n6316 NVR User Manual v1.0 20130814 English
114 pages
AWS Real Exam Questions
100% (3)
AWS Real Exam Questions
31 pages

Big Data Processing with Apache Spark

Uploaded by

Big Data Processing with Apache Spark

Uploaded by

Big Data Fundamentals

Unidad 3 Paralelización de Datos

PhD. Ronal Muresano

•Domain Decomposition: the data associated with a problem is

Improves efficiency through:

In-memory computing primitives

General computation graphs

Improves usability through:

Rich APIs in Scala, Java, Python

a modified version of Hadoop

dependent on Hadoop because it has its own cluster management

Spark uses Hadoop for storage purpose only

Spark supports the following resource/cluster managers:

● Spark Standalone – a simple cluster manager included with Spark

Typical use cases include:

● Master − It is the URL of the cluster it connects to.

Main entry point to Spark functionality

Inserting into the HDFS

Copy from HDFS to

Muchas gracias por su atención

You might also like