Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
1
09/11/2022
Apache Spark
Spark is an open source Big Data solution. Developed by the RAD laboratory at UC Berkeley (2009).
It has become the most used environment
in Big Data.
2
09/11/2022
Spark Components
3
09/11/2022
PySpark
PySpark is the open source, Python API for Apache Spark. It is a distributed computing framework for
Big Data processing. Advantages of PySpark:
• Easy to learn
• Extensive set of libraries for Machine Learning and Data Science
• Great support from the community
PySpark Architecture
Apache Spark works on a master-slave architecture. Operations are executed on workers, and the
Cluster Manager manages resources.
4
09/11/2022
10
5
09/11/2022
1 4
11
1. If you do not have Java or the Java version is 7.x or less, download and install Java from Oracle
https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html
2. Unzip Spark in C:\spark
3. Add the downloaded winutils.exe to a winutils folder in C:. It should look like this:
C:\winutils\bin\winutils.exe.
4. From cmd run: "cd C:\winutils\bin" and then: winutils.exe chmod 777 \tmp\hive
5. Add the environment variables:
• HADOOP_HOME -> C:\winutils
• SPARK_HOME -> C:\spark
• JAVA_HOME -> C:\jdk
• Path -> %SPARK_HOME%\bin
• Path -> %JAVA_HOME%\bin
12
6
09/11/2022
1. From the Anaconda prompt run: "cd C:\spark" and then "pyspark". You should see something like
picture 1.
2. From jupyter notebook install findspark with "pip install findspark" and run the following code
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
sc
1 2
13
14
7
09/11/2022
RDDs are the building blocks of any Spark application. RDD stands for:
• Resilient: It is fault tolerant and they can be rebuilt in case of failure
• Distributed: Data is distributed across multiple nodes in a cluster
• Dataset: Collection of partitioned data
15
Operations in RDDs
16
8
09/11/2022
17
Introduction to DataFrames
Dataframes are tabular structures. They allow several data types within the same table
(heterogeneous), while each variable usually has the same data type (homogeneous).
Dataframes are similar to SQL tables or Excel spreadsheets.
18
9
09/11/2022
Advantages of DataFrames
19
Features of DataFrames
Spark DataFrames are characterized by: being distributed, have lazy evaluation, immutability and fault
tolerance
20
10
09/11/2022
Data frames in Pyspark can be created in several ways: with files, using RDDs, or with databases.
21
22
11
09/11/2022
Advanced features
Spark contains numerous advanced features to optimize its performance and perform complex
transformations on data. Some of them are: UDF, cache ( ), etc.
23
Performance optimization
One of the optimization techniques are cache() and persist() methods. These methods are used to
store an intermediate calculation of an RDD, DataFrame, and Dataset so that they can be reused in
subsequent actions.
1 2
24
12
09/11/2022
25
In order to train a model or perform statistical analysis in our data, the following functions and tasks
are necessary:
• Generate a Spark session
• Import the data and generate the correct schema
• Methods for inspecting data
• Data and column transformation
• Dealing with missing values
• Execute queries (SQL, Python, PySpark…)
• Data visualization
26
13
09/11/2022
Data visualization
PySpark supports numerous Python data visualization libraries such as seaborn, matplotlib, bokehn, ...
27
28
14
09/11/2022
Introduction to Koalas
Koalas provides a direct replacement for Pandas, allowing efficient scaling to hundreds of nodes for
data science and machine learning.
Pandas doesn't scale to Big Data.
PySpark DataFrame is more compatible with SQL and Koalas DataFrame is closer to Python
29
Koalas and PySpark DataFrames are different. Koalas DataFrames follows the structure of Pandas and
implements an index. The PySpark DataFrame is more compatible with tables in relational databases
and has no indexes. Koalas translates pandas APIs to Spark SQL logic plan.
30
15
09/11/2022
In data science, the get_dummies( ) function of pandas is often needed to encode categorical variables
as dummy (numerical) variables.
Thanks to Koalas you can do this in Spark with just a few settings.
Pandas
Koalas
31
In data science you often need to work with time data. Pandas allows you to work with this type of data easily.
With PySpark it is more complicated.
Pandas
Koalas
32
16
09/11/2022
33
Machine Learning: is the construction of algorithms that can learn from data and make predictions
about it. Spark ML has machine learning algorithms and functions.
34
17
09/11/2022
Spark ML libraries:
• spark.mllib contains the original API built on top of RDD
• spark.ml provides a top-level API built on top of DataFrames for building ML pipelines. The
main ML API.
Resource: https://www.r-bloggers.com/
35
36
18
09/11/2022
Resource: https://www.r-bloggers.com/
37
38
19
09/11/2022
• Vector Asembler: It is used to concatenate features into a single vector that can be passed to the estimator or
the ML algorithm.
• Grouping: is the simplest method for converting continuous variables into categorical variables. It can be done
with the Bucketizer class.
• Scaling and standardization: is another common task for numerical variables. It transform data to obtain a
normal distribution.
• MinMaxScaler and StandardScaler: standardize variables with a mean of zero and a standard deviation of 1.
• StringIndexer : to convert categorical variables to numerical.
39
Pipelines in PySpark
In pipelines, the different stages of machine learning work can be grouped together as a single entity
and can be used as an uninterrupted workflow. Each stage is a Transformer. They run in sequence and
the input data is transformed as they go through each stage.
40
20
09/11/2022
Spark Streaming
41
PySpark Streaming is a scalable and fault-tolerant system that follows the RDD batch paradigm. It
operates in batch intervals, receiving a stream of continuous input data from sources such as Apache
Flume, Kinesis, Kafka, TCP sockets, etc.
Spark Engine processes them.
42
21
09/11/2022
Spark Streaming receives data from multiple sources and groups it into small batches (Dstreams) over a
time interval. The user can define the range. Each input batch forms an RDD and is processed using
Spark jobs to create other RDDs.
43
44
22
09/11/2022
Output modes
Complete,
Append,
Update
45
Types of transformations
For allow fault tolerance the data is copied into two nodes and there is also a mechanism called
checkpointing. Transformations can be grouped into :
• Stateless transformation: each microbatch of data does not depend on the previous data
batches, so each batch is fully independent of whatever batches of data preceded it.
• Stateful transformations: each microbatch of data depends partially or wholly on the previous
batches of data, so each batch considers what happened prior to it and uses that information
while being processed.
46
23
09/11/2022
47
Introduction to Databricks
48
24
09/11/2022
Introduction to Databricks
Databricks is the Apache Spark-based data analytics platform developed by the creators of Spark.
Databricks enables advanced analytics, Big Data and ML in a simple and collaborative way.
Available as a cloud service on Azure, AWS, and GCP.
49
Features of Databricks
Databricks auto-scale and size Spark environments in a simple way. Facilitates deployments and accelerates the
installation and configuration of Big Data environments
50
25
09/11/2022
Databricks Architecture
51
Databricks Community
Databricks community is the free version. It allows you to use a small cluster with limited resources and
non-collaborative notebooks. Paid version has more capabilities
52
26
09/11/2022
Terminology
1. Workspaces
2. Notebooks
3. Libraries
4. Tables
5. Clusters
6. Jobs
53
Delta Lake
Delta Lake is the open source storage layer developed for Spark and Databricks. Provides ACID
transactions and advanced metadata management.
It includes a Spark-compatible query engine that accelerates operations and improves performance.
The data stored in Parquet format.
54
27
09/11/2022
Resources
55
Resources:
• https://spark.apache.org/docs/2.2.0/index.html Official Spark Documentatio
• https://colab.research.google.com/ Google Colab to be able to have additional computing capacity
56
28