0% found this document useful (0 votes)
6 views

Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert

Uploaded by

abhimanyu thakur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert

Uploaded by

abhimanyu thakur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

09/11/2022

Big Data with Apache Spark and


Python: from zero to expert

Introduction to Apache Spark

1
09/11/2022

Apache Spark

Spark is an open source Big Data solution. Developed by the RAD laboratory at UC Berkeley (2009).
It has become the most used environment
in Big Data.

Apache Spark vs MapReduce

Easier and faster than Hadoop MapReduce.


Differences:
• Spark is faster as processes data in RAM (memory) while Hadoop reads and writes files to
HDFS (on disk)
• Spark is optimized for better parallelism, CPU utilization, and faster startup
• Spark has richer functional programming model
• Spark is especially useful for iterative algorithms

2
09/11/2022

How works Spark in a cluster

• A Spark application runs as independent


processes, coordinated by the SparkSession
object in the driver program.
• The resource or cluster manager assigns tasks to
workers, one task per partition.
• A task applies its unit of work to the dataset in
its partition and outputs a new partition dataset.
Because iterative algorithms apply operations
repeatedly to data, they benefit from caching
datasets across iterations.
• Results are sent back to the driver application
or can be saved to disk.

Spark Components

Spark contains a very complete ecosystem of tools.

• Core: Contains the basic functionality of Spark. Also,


home to the API that defines RDDs.
• SQL: Package for working with structured data. It allows
querying data via SQL or Hive. It supports various
sources.
• Streaming: Enables processing of live streams of data.
Spark Streaming provides an API for manipulating data
streams that are similar to Spark Core’s RDD API.
• Mllib: Provides multiple types of machine learning
algorithms, like classification, regression, clustering, etc.
• GraphX: Library for manipulating graphs and performing
graph-parallel computations.

3
09/11/2022

PySpark

PySpark is the open source, Python API for Apache Spark. It is a distributed computing framework for
Big Data processing. Advantages of PySpark:
• Easy to learn
• Extensive set of libraries for Machine Learning and Data Science
• Great support from the community

PySpark Architecture

Apache Spark works on a master-slave architecture. Operations are executed on workers, and the
Cluster Manager manages resources.

4
09/11/2022

Types of cluster administrators

Spark supports the following cluster administrators:


• Standalone : Simple cluster administrator
• Apache Mesos : is a cluster administrator who can also run Hadoop MapReduce and PySpark.
• Hadoop YARN : the resource manager in Hadoop 2
• Kubernetes: to automate the deployment and management of containerized applications.

Installing Apache Spark

10

5
09/11/2022

Steps to install Spark (1)

1. Download Spark from https://spark.apache.org/downloads.html


2. Modify the log4j.properties.template, put log4j.rootCategory=ERROR instead of INFO.
3. Install Anaconda from https://www.anaconda.com/
4. Download winutils.exe. It's a Hadoop binary for Windows. Go to this GitHub repository:
https://github.com/steveloughran/winutils/ Select the corresponding Hadoop version with the
Spark distribution and look for winutils.exe in /bin.

1 4

11

Steps to install Spark (2)

1. If you do not have Java or the Java version is 7.x or less, download and install Java from Oracle
https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html
2. Unzip Spark in C:\spark
3. Add the downloaded winutils.exe to a winutils folder in C:. It should look like this:
C:\winutils\bin\winutils.exe.
4. From cmd run: "cd C:\winutils\bin" and then: winutils.exe chmod 777 \tmp\hive
5. Add the environment variables:
• HADOOP_HOME -> C:\winutils
• SPARK_HOME -> C:\spark
• JAVA_HOME -> C:\jdk
• Path -> %SPARK_HOME%\bin
• Path -> %JAVA_HOME%\bin

12

6
09/11/2022

Validating the Spark Installation

1. From the Anaconda prompt run: "cd C:\spark" and then "pyspark". You should see something like
picture 1.
2. From jupyter notebook install findspark with "pip install findspark" and run the following code
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
sc

1 2

13

Resilient Distributed Datasets (RDDs)

14

7
09/11/2022

Apache Spark RDDs

RDDs are the building blocks of any Spark application. RDD stands for:
• Resilient: It is fault tolerant and they can be rebuilt in case of failure
• Distributed: Data is distributed across multiple nodes in a cluster
• Dataset: Collection of partitioned data

15

Operations in RDDs

With RDDs, you can perform two types of operations:


• Transformation: Transformation refers to the operation applied on a RDD to create new RDD. Filter,
groupBy and map are the examples of transformations.
• Actions: Actions refer to an operation which also applies on RDD, that instructs Spark to perform
computation and send the result back to driver. Collect is an example of action.

16

8
09/11/2022

DataFrames on Apache Spark

17

Introduction to DataFrames

Dataframes are tabular structures. They allow several data types within the same table
(heterogeneous), while each variable usually has the same data type (homogeneous).
Dataframes are similar to SQL tables or Excel spreadsheets.

18

9
09/11/2022

Advantages of DataFrames

Some of the advantages of working with Dataframes in Spark are:


• Process large amounts of structured or semi-structured data
• Easy data handling and imputation of missing values
• Multiple formats as data sources
• Multi-language support

19

Features of DataFrames

Spark DataFrames are characterized by: being distributed, have lazy evaluation, immutability and fault
tolerance

20

10
09/11/2022

DataFrames Data Sources

Data frames in Pyspark can be created in several ways: with files, using RDDs, or with databases.

21

Advanced Spark Features

22

11
09/11/2022

Advanced features

Spark contains numerous advanced features to optimize its performance and perform complex
transformations on data. Some of them are: UDF, cache ( ), etc.

23

Performance optimization

One of the optimization techniques are cache() and persist() methods. These methods are used to
store an intermediate calculation of an RDD, DataFrame, and Dataset so that they can be reused in
subsequent actions.

1 2

24

12
09/11/2022

Advanced Analytics with Spark

25

Functions for data analytics

In order to train a model or perform statistical analysis in our data, the following functions and tasks
are necessary:
• Generate a Spark session
• Import the data and generate the correct schema
• Methods for inspecting data
• Data and column transformation
• Dealing with missing values
• Execute queries (SQL, Python, PySpark…)
• Data visualization

26

13
09/11/2022

Data visualization

PySpark supports numerous Python data visualization libraries such as seaborn, matplotlib, bokehn, ...

27

Apache Spark Koalas

28

14
09/11/2022

Introduction to Koalas

Koalas provides a direct replacement for Pandas, allowing efficient scaling to hundreds of nodes for
data science and machine learning.
Pandas doesn't scale to Big Data.
PySpark DataFrame is more compatible with SQL and Koalas DataFrame is closer to Python

29

Koalas and PySpark DataFrames

Koalas and PySpark DataFrames are different. Koalas DataFrames follows the structure of Pandas and
implements an index. The PySpark DataFrame is more compatible with tables in relational databases
and has no indexes. Koalas translates pandas APIs to Spark SQL logic plan.

30

15
09/11/2022

Example: Feature Engineering with Koalas

In data science, the get_dummies( ) function of pandas is often needed to encode categorical variables
as dummy (numerical) variables.
Thanks to Koalas you can do this in Spark with just a few settings.

Pandas

Koalas

31

Example: Feature Engineering with Koalas

In data science you often need to work with time data. Pandas allows you to work with this type of data easily.
With PySpark it is more complicated.

Pandas

Koalas

32

16
09/11/2022

Machine Learning with Spark

33

Spark Machine Learning

Machine Learning: is the construction of algorithms that can learn from data and make predictions
about it. Spark ML has machine learning algorithms and functions.

34

17
09/11/2022

Spark Machine Learning Tools

Spark ML libraries:
• spark.mllib contains the original API built on top of RDD
• spark.ml provides a top-level API built on top of DataFrames for building ML pipelines. The
main ML API.

Resource: https://www.r-bloggers.com/

35

Spark Machine Learning Components

Spark ML provides the following tools:


• ML algorithms: Include common Machine Learning algorithms such
as classification, regression, clustering, and collaborative filtering.
• Preprocessing functions: Includes: extraction, transformation,
dimensionality reduction and feature selection.
• Pipelines: are tools for building ML models in stages.
• Persistence: To save and load algorithms, models and pipelines.
• Utilities: for linear algebra, statistics and data management.

36

18
09/11/2022

Machine Learning Process

Resource: https://www.r-bloggers.com/

37

Feature Engineering with Spark

The most commonly used data preprocessing techniques in Spark are:


• VectorAssembler
• Grouping
• Scaling and normalization
• Working with categorical features
• Text Data Transformers
• Function manipulation
• PCA

38

19
09/11/2022

Feature Engineering with Spark

• Vector Asembler: It is used to concatenate features into a single vector that can be passed to the estimator or
the ML algorithm.
• Grouping: is the simplest method for converting continuous variables into categorical variables. It can be done
with the Bucketizer class.
• Scaling and standardization: is another common task for numerical variables. It transform data to obtain a
normal distribution.
• MinMaxScaler and StandardScaler: standardize variables with a mean of zero and a standard deviation of 1.
• StringIndexer : to convert categorical variables to numerical.

39

Pipelines in PySpark

In pipelines, the different stages of machine learning work can be grouped together as a single entity
and can be used as an uninterrupted workflow. Each stage is a Transformer. They run in sequence and
the input data is transformed as they go through each stage.

40

20
09/11/2022

Spark Streaming

41

Spark Streaming Fundamentals

PySpark Streaming is a scalable and fault-tolerant system that follows the RDD batch paradigm. It
operates in batch intervals, receiving a stream of continuous input data from sources such as Apache
Flume, Kinesis, Kafka, TCP sockets, etc.
Spark Engine processes them.

42

21
09/11/2022

How Spark Streaming Works

Spark Streaming receives data from multiple sources and groups it into small batches (Dstreams) over a
time interval. The user can define the range. Each input batch forms an RDD and is processed using
Spark jobs to create other RDDs.

43

Example: Counting Words

44

22
09/11/2022

Output modes

Spark uses several output modes to store the data:


• Complete: the entire table will be stored
• Append: only the new rows of the last process will be added. Only for queries in which existing rows are
not expected to change.
• Update: only rows that were updated will be stored. This mode only generates the rows that have
changed in the last process. If the query does not contain aggregations, it will be equivalent to append
mode.

Complete,
Append,
Update

45

Types of transformations

For allow fault tolerance the data is copied into two nodes and there is also a mechanism called
checkpointing. Transformations can be grouped into :
• Stateless transformation: each microbatch of data does not depend on the previous data
batches, so each batch is fully independent of whatever batches of data preceded it.
• Stateful transformations: each microbatch of data depends partially or wholly on the previous
batches of data, so each batch considers what happened prior to it and uses that information
while being processed.

46

23
09/11/2022

Spark Streaming Capabilities

47

Introduction to Databricks

48

24
09/11/2022

Introduction to Databricks

Databricks is the Apache Spark-based data analytics platform developed by the creators of Spark.
Databricks enables advanced analytics, Big Data and ML in a simple and collaborative way.
Available as a cloud service on Azure, AWS, and GCP.

49

Features of Databricks

Databricks auto-scale and size Spark environments in a simple way. Facilitates deployments and accelerates the
installation and configuration of Big Data environments

50

25
09/11/2022

Databricks Architecture

51

Databricks Community

Databricks community is the free version. It allows you to use a small cluster with limited resources and
non-collaborative notebooks. Paid version has more capabilities

52

26
09/11/2022

Terminology

Important terms to know:

1. Workspaces
2. Notebooks
3. Libraries
4. Tables
5. Clusters
6. Jobs

53

Delta Lake

Delta Lake is the open source storage layer developed for Spark and Databricks. Provides ACID
transactions and advanced metadata management.
It includes a Spark-compatible query engine that accelerates operations and improves performance.
The data stored in Parquet format.

54

27
09/11/2022

Resources

55

Resources:
• https://spark.apache.org/docs/2.2.0/index.html Official Spark Documentatio
• https://colab.research.google.com/ Google Colab to be able to have additional computing capacity

56

28

You might also like