0% found this document useful (0 votes)
40 views4 pages

Spark Introduction

Spark is an open-source cluster computing framework developed at UC Berkeley that uses in-memory computing to speed up processing. It can be used with Scala, Java, Python, and R. PySpark is a Python API for Spark that allows Python applications to leverage Spark's distributed capabilities for processing large datasets in parallel across clusters. PySpark is commonly used for machine learning and by companies with large datasets like Netflix due to its ability to process data 100 times faster than traditional Python. It uses a master-slave architecture with a driver program coordinating work across worker nodes.

Uploaded by

VIKAS YADAV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views4 pages

Spark Introduction

Spark is an open-source cluster computing framework developed at UC Berkeley that uses in-memory computing to speed up processing. It can be used with Scala, Java, Python, and R. PySpark is a Python API for Spark that allows Python applications to leverage Spark's distributed capabilities for processing large datasets in parallel across clusters. PySpark is commonly used for machine learning and by companies with large datasets like Netflix due to its ability to process data 100 times faster than traditional Python. It uses a master-slave architecture with a driver program coordinating work across worker nodes.

Uploaded by

VIKAS YADAV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Spark Introduction

 Developed at AMPLab UC Berkeley, now by Databricks.com


 The main feature is its in-memory cluster computing which in
turns increases the processing speed
In-memory computing means using a type of middleware
software that allows one to store data in RAM, across a cluster of
computers, and process it in parallel.

 Spark was first coded as a Scala project.


 But now it has become a polyglot framework that the user can
interface with using Scala, Java, Python or the R language.
PySpark :
1. What:
 It a python API to interact with spark.
 PySpark is a Spark library written in Python to run Python
applications using Apache Spark capabilities, using
PySpark we can run applications parallelly on the
distributed cluster (multiple nodes).
 PySpark released for Python using Py4J. Py4J is a Java
library that is integrated within PySpark and allows python
to dynamically interface with JVM objects
2. Why:
a. In real-time, PySpark has used a lot in the machine
learning & Data scientists community; thanks to vast
python machine learning libraries. Spark runs operations
on billions and trillions of data on distributed clusters 100
times faster than the traditional python applications.
3. Who:
PySpark is used by Netflix, amazon and other big companies
which has lot of realtime data which needs to be processed. Iot
devices also, for satellite communication also a high speed
cluster and processing is required.

Pyspark Architecture:
1. Apache Spark works in a master-slave architecture
2. master is called “Driver” and slaves are called “Workers”.
3. When you run a Spark application, Spark Driver creates a
context that is an entry point to your application, and all
operations (transformations and actions) are executed on worker
nodes, and the resources are managed by Cluster Manager.
4. Clustor Manager:
 Standalone – a simple cluster manager included with
Spark that makes it easy to set up a cluster.
 Apache Mesos – Mesons is a Cluster manager that can
also run Hadoop MapReduce and PySpark applications.
 Hadoop YARN – the resource manager in Hadoop 2. This
is mostly used, cluster manager.
 Kubernetes – an open-source system for automating
deployment, scaling, and management of containerized
applications.
PySpark Installation:
Installation guide is word file.

PySpark RDD:
Fundamental building block of PySpark which is fault-tolerant,
immutable
In other words, RDDs are a collection of objects similar to list in
Python, with the difference being RDD is computed on several
processes scattered across multiple physical servers also called nodes
in a cluster while a Python collection lives and process in just one
process.
On PySpark RDD, you can perform two kinds of operations.
1. Transformation - instead of updating a current RDD, these
operations return another RDD
2. Action - operations that trigger computation and return values

PySpark RDD Limitations


PySpark RDDs are not much suitable for applications that make
updates to the state store such as storage systems for a web
application. For these applications, it is more efficient to use systems
that perform traditional update logging and data checkpointing, such
as databases. The goal of RDD is to provide an efficient programming
model for batch analytics and leave these asynchronous applications.

PySpark DataFrame:
 PySpark DataFrame is mostly similar to Pandas DataFrame with
the exception PySpark DataFrames are distributed in the cluster
 Due to parallel execution on all cores on multiple machines,
PySpark runs operations faster then pandas.

we can get the schema of the DataFrame using df.printSchema()


df.show() shows the 20 elements from the DataFrame.

PySpark SQL:
Once you have a DataFrame created, you can interact with the data by
using SQL syntax.
In order to use SQL, first, create a temporary table on DataFrame
using createOrReplaceTempView() function. Once created, this table
can be accessed throughout the SparkSession using sql() and it will be
dropped along with your SparkContext termination.

You might also like