0% found this document useful (0 votes)
12 views

Parallel Processing

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Parallel Processing

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Big Data Fundamentals

Unidad 3 Paralelización de Datos

PhD. Ronal Muresano


Unit Content

Distributed Programming
Introduction to Spark
Type of files (Csv, Parquet) and processing management
Analysing massive data with Spark
Spark Streaming using queues
Installation Steps
https://sparkbyexamples.com/pyspark/install-pyspark-in-anaconda-jupyter-notebook/

Notebook to Test
Parallel Programming
Parallel Programming
Parallel Processing
Parallel processing is a method in computing of running two or more processors (CPUs) to
handle separate parts of an overall task. Breaking up different parts of a task among multiple
processors will help reduce the amount of time to run a program. Any system that has more
than one CPU can perform parallel processing, as well as multi-core processors which are
commonly found on computers today.

Distributed System:
A distributed system is one in which components located at
networked computers communicate and coordinate their
actions only by passing messages
Centralized vs Distributed System
Centralized Distributed
● Multiple autonomous
● One component with
components
non-autonomous parts
● Components are not shared by
● Component shared by users
all users
all the time
● Resources may not be
● All resources accessible
accessible
● Software runs in a single ● Software runs in concurrent
process processes on different
● Single point of control processors
● Single point of failure ● Multiple points of control
● Multiple points of failure
Big data and parallel programming

•Domain Decomposition: the data associated with a problem is


decomposed. Each parallel task then works on a portion of of the
data.
Current hardware for BigData
Spark as a distributed environment

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine
learning on single-node machines or clusters.

Apache Spark is a computer infrastructure which allow us to load and process big data workloads. It is
very fast due to it manages the data storage and the memory in a dynamic manner. It is compatible with a
lot of Databases and it can be used to process ad hoc, bash and streaming process.

The SPARK language is based on a subset of the Ada language. Ada is particularly well suited to formal
verification since it was designed for critical software development. SPARK builds on that foundation
Benefits of Spark

Fast and expressive cluster computing system interoperable with Apache Hadoop

Improves efficiency through:

In-memory computing primitives

General computation graphs

Improves usability through:

Rich APIs in Scala, Java, Python

Interactive shell

Spark is not

a modified version of Hadoop

dependent on Hadoop because it has its own cluster management

Spark uses Hadoop for storage purpose only


Benefits of Spark

Simplicity: Spark’s capabilities are accessible via a set of rich APIs, all designed specifically for interacting quickly
and easily with data at scale. These APIs are well-documented and structured in a way that makes it straightforward
for data scientists and application developers to quickly put Spark to work.

Speed: Spark is designed for speed, operating both in memory and on disk.

Support: Spark supports a range of programming languages, including Java, Python, R, and Scala. Spark includes
support for tight integration with a number of leading storage solutions in the Hadoop ecosystem and beyond,
including HPE Ezmeral Data Fabric (file system, database, and event store), Apache Hadoop (HDFS), Apache
HBase, and Apache Cassandra

Source: https://developer.hpe.com/blog/spark-101-what-is-it-what-it-does-and-why-it-matters/
Spark Components
Why to use Spark
● A Spark application runs as independent processes, coordinated by the SparkSession object in the
driver program.
● The resource or cluster manager assigns tasks to workers, one task per partition.
● A task applies its unit of work to the dataset in its partition and outputs a new partition dataset.
Because iterative algorithms apply operations repeatedly to data, they benefit from caching datasets
across iterations.
● Results are sent back to the driver application or can be saved to disk.

Spark supports the following resource/cluster managers:

● Spark Standalone – a simple cluster manager included with Spark


● Apache Mesos – a general cluster manager that can also run Hadoop applications
● Apache Hadoop YARN – the resource manager in Hadoop 2
● Kubernetes – an open source system for automating deployment, scaling, and management of
containerized applications

Spark also has a local mode, where the driver and executors run as threads on
your computer instead of a cluster, which is useful for developing your
applications from a personal computer.
Source: https://developer.hpe.com/blog/spark-101-what-is-it-what-it-does-and-why-it-matters/
What Can spark do?
Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual
servers. It has an extensive set of developer libraries and APIs and supports languages such as Java, Python, R, and Scala;

Source: https://developer.hpe.com/blog/spark-101-what-is-it-what-it-does-and-why-it-matters/
What can Spark do?

Typical use cases include:

Stream processing: From log files to sensor data, application developers are increasingly having to cope with
"streams" of data. This data arrives in a steady stream, often from multiple sources simultaneously. While it is
certainly feasible to store these data streams on disk and analyze them retrospectively, it can sometimes be sensible
or important to process and act upon the data as it arrives. Streams of data related to financial transactions, for
example, can be processed in real time to identify– and refuse– potentially fraudulent transactions.

Machine learning: As data volumes grow, machine learning approaches become more feasible and increasingly
accurate. Software can be trained to identify and act upon triggers within well-understood data sets before applying
the same solutions to new and unknown data. Spark’s ability to store data in memory and rapidly run repeated
queries makes it a good choice for training machine learning algorithms. Running broadly similar queries again and
again, at scale, significantly reduces the time required to go through a set of possible solutions in order to find the
most efficient algorithms.
What can Spark do?

Interactive analytics: Rather than running pre-defined queries to create static dashboards of sales or production line
productivity or stock prices, business analysts and data scientists want to explore their data by asking a question, viewing
the result, and then either altering the initial question slightly or drilling deeper into results. This interactive query process
requires systems such as Spark that are able to respond and adapt quickly.

Data integration: Data produced by different systems across a business is rarely clean or consistent enough to simply
and easily be combined for reporting or analysis. Extract, transform, and load (ETL) processes are often used to pull data
from different systems, clean and standardize it, and then load it into a separate system for analysis. Spark (and Hadoop)
are increasingly being used to reduce the cost and time required for this ETL process.
API de Spark
Sparkcontext
SparkContext is the entry point to any spark functionality. When we run any Spark application, a driver
program starts, which has the main function and your SparkContext gets initiated here. The driver program
then runs the operations inside the executors on worker nodes.
SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext. By default, PySpark has
SparkContext available as ‘sc’, so creating a new SparkContext won't work.
Parameters
Following are the parameters of a SparkContext.

● Master − It is the URL of the cluster it connects to.


● appName − Name of your job.
● sparkHome − Spark installation directory.
● pyFiles − The .zip or .py files to send to the cluster and add to the PYTHONPATH.
● Environment − Worker nodes environment variables.
● batchSize − The number of Python objects represented as a single Java object. Set 1 to
disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use
an unlimited batch size.
● Serializer − RDD serializer.
● Conf − An object of L{SparkConf} to set all the Spark properties.
● Gateway − Use an existing gateway and JVM, otherwise initializing a new JVM.
● JSC − The JavaSparkContext instance.
● profiler_cls − A class of custom Profiler used to do profiling (the default is
pyspark.profiler.BasicProfiler).
SparkContext Example
Resilient Distributed Dataset (RDD)

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of
objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on
either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in
parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in
an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input
Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how
MapReduce operations take place and why they are not so efficient.
Rdd Benefits

In-Memory Processing: PySpark loads the data from disk and process in memory and keeps the data in
memory, this is the main difference between PySpark and Mapreduce (I/O intensive). In between the transformations, we
can also cache/persists the RDD in memory to reuse the previous computations.

Immutability: RDD’s are immutable in nature meaning, once RDDs are created you cannot modify. When we apply
transformations on RDD, PySpark creates a new RDD and maintains the RDD Lineage.

Fault Tolerance: operates on fault-tolerant data stores on HDFS, S3 e.t.c hence any RDD operation fails, it
automatically reloads the data from other partitions. Also, When PySpark applications running on a cluster, PySpark task
failures are automatically recovered for a certain number of times (as per the configuration) and finish the application
seamlessly.

Lazy Evolution: it does not evaluate the RDD transformations as they appear/encountered by Driver instead it
keeps the all transformations as it encounters(DAG) and evaluates the all transformation when it sees the first RDD
action.

Partitioning: When you create RDD from a data, It by default partitions the elements in a RDD. By default it
partitions to the number of cores available.
Source: https://sparkbyexamples.com/pyspark-rdd/
Parallelized Collections

Parallelized collections are created by calling SparkContext’s parallelize method on an existing iterable or collection in your
driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For
example, here is how to create a parallelized collection holding the numbers 1 to 5:
Transformation and Action in Spark

Transformation − These are the operations, which are applied on a RDD to create a new RDD. Filter, groupBy and
map are the examples of transformations. Transformations are lazy in nature i.e., they get execute when we call an
action. They are not executed immediately. Two most basic type of transformations is a map(), filter().

Action − These are the operations that are applied on RDD, which instructs Spark to perform computation and send
the result back to the driver

Transformation

Action
Filter a RDD

Filter:
Spark Transformations

Narrow transformation – In Narrow transformation, all the elements that are required to compute the records in single partition live in the single
partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

Wide transformation – In wide transformation, all the elements that are required to compute the records in the single partition may live in many
partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and
reducebyKey().

Source: https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
Read a file from Local System
Spark Reading Operations

Main entry point to Spark functionality


>sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
>sc.textFile(“file.txt”)
>sc.textFile(“directory/*.txt”)
>sc.textFile(“hdfs://namenode:9000/path/file”)
Spark Reading/Writing Operations
Samples Files: https://filesamples.com/formats/txt

Inserting into the HDFS

Spark Operations

Copy from HDFS to


LocalSystem
Some Spark Operations
Some Spark Operations
Key Values Operations
GroupBykey
Key Values Operations
ReduceBykey
Key Values Operations (SortByKeys)
Example: Word Count
Example: Word Count
Other Key value Operations
Key Values Operations
Big Data as a Service: A new approach for the SMEs

Muchas gracias por su atención

Ronal Muresano
[email protected]
[email protected]

You might also like