Spark Introduction
Spark Introduction
Pyspark Architecture:
1. Apache Spark works in a master-slave architecture
2. master is called “Driver” and slaves are called “Workers”.
3. When you run a Spark application, Spark Driver creates a
context that is an entry point to your application, and all
operations (transformations and actions) are executed on worker
nodes, and the resources are managed by Cluster Manager.
4. Clustor Manager:
Standalone – a simple cluster manager included with
Spark that makes it easy to set up a cluster.
Apache Mesos – Mesons is a Cluster manager that can
also run Hadoop MapReduce and PySpark applications.
Hadoop YARN – the resource manager in Hadoop 2. This
is mostly used, cluster manager.
Kubernetes – an open-source system for automating
deployment, scaling, and management of containerized
applications.
PySpark Installation:
Installation guide is word file.
PySpark RDD:
Fundamental building block of PySpark which is fault-tolerant,
immutable
In other words, RDDs are a collection of objects similar to list in
Python, with the difference being RDD is computed on several
processes scattered across multiple physical servers also called nodes
in a cluster while a Python collection lives and process in just one
process.
On PySpark RDD, you can perform two kinds of operations.
1. Transformation - instead of updating a current RDD, these
operations return another RDD
2. Action - operations that trigger computation and return values
PySpark DataFrame:
PySpark DataFrame is mostly similar to Pandas DataFrame with
the exception PySpark DataFrames are distributed in the cluster
Due to parallel execution on all cores on multiple machines,
PySpark runs operations faster then pandas.
PySpark SQL:
Once you have a DataFrame created, you can interact with the data by
using SQL syntax.
In order to use SQL, first, create a temporary table on DataFrame
using createOrReplaceTempView() function. Once created, this table
can be accessed throughout the SparkSession using sql() and it will be
dropped along with your SparkContext termination.