0% found this document useful (0 votes)
168 views

Introduction To Apache Spark (Spark) : - by Praveen

Spark is a cluster computing software used for large-scale data processing. It provides a programming model where developers can write parallel programs to process large datasets across a cluster. When data exceeds the capacity of a single machine or server, Spark can distribute the data and processing across multiple nodes in a cluster. Developers write Spark programs using APIs in Scala, Java, Python or R to analyze large datasets stored in HDFS, S3, Cassandra or other data sources.

Uploaded by

vasari8882573
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
168 views

Introduction To Apache Spark (Spark) : - by Praveen

Spark is a cluster computing software used for large-scale data processing. It provides a programming model where developers can write parallel programs to process large datasets across a cluster. When data exceeds the capacity of a single machine or server, Spark can distribute the data and processing across multiple nodes in a cluster. Developers write Spark programs using APIs in Scala, Java, Python or R to analyze large datasets stored in HDFS, S3, Cassandra or other data sources.

Uploaded by

vasari8882573
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Introduction to Apache Spark

(Spark)
-By Praveen
PART 1 Longs for around 20 to 30 mins …

We are going to answer these questions …

What is Spark ?
When to use Spark?
How to use Spark?

PART 2 Longs for around 10 to 15 mins …

Overview of Spark Software


The activity is.. Filter the data by country

Data size .. > 5MB


# Records .. ~40K
#Attributes / columns .. 15

Schema of the Data:


userID, visited_page, country, device_type,time,os_type,interaction_type,pincode, ……
The activity .. Filter the data by country

Data size .. > 5MB


# Records .. ~40K
#Attributes / columns .. 15 Apply Filter Functionally
The activity .. Filter the data by country

What if the data size exceeds the capacity of Excel .. i.e ~1M rows ?

Data size .. > 50GB


# Records .. ~x Millions
#Attributes / columns .. 15
The activity .. Filter the data by country

What if the data size exceeds the capacity of Excel .. i.e ~1M rows ?

HERE IS THE SOLUTION ….

Data size .. > 50GB


# Records .. ~x Millions
#Attributes / columns .. 15

*May take longer time (~10to 12 hrs) to finish the “FILTER” Process … due to unavailability of enough computing resources
The activity .. Filter the data by country

What if the data size exceeds the capacity (RAM, PROCESSOR & Disk) of the existing Server?

HERE IS THE SOLUTION …

Either Increase the capacity of existing server (Scale-up) OR replace existing


server with brand new high capacity server
Data size .. > 500GB
# Records .. ~x Billions
#Attributes / columns .. 15
Data size .. > 250GB
Data size .. > 50GB # Records .. ~x Billions
# Records .. ~x Billions #Attributes / columns .. 15
#Attributes / columns .. 15

Expensive & single point of failure ...


The activity .. Filter the data by country
What if the Server goes down during the processing? (Single Point of Failure)
Eg: After processing ~498GB for some ~12 hours

Data size .. > 500GB


# Records .. ~x Billions
#Attributes / columns .. 15

High Capacity Server


The activity .. Filter the data by country
What if the data size exceeds the capacity of Server Size .. ? OR
What if the Server goes down during the processing? (Single Point of Failure)

HERE IS THE ULTIMATE (CHEAPER & RELIABLE) SOLUTION ….

The solution is ... Build a “Cluster Computing System ” (Scale-out)

CLUSTER SOFTWARES
Cluster Capacity = Sum of the
allocated capacities available in HADOOP
SPARK CLUSTER
every individual server/computer SPARK
NOSQL DATABASES
Solution .. Use the “spark cluster computing software” to
solve large-volume data (BigData) problem
OUR STRATEGY STEPS …

STEP 1): LOCAL CLUSTER / DEVELOPMENT CLUSTER (To deal with some sample data)
 Set-up local Spark Cluster in your Computer / Server (Spark setup details available in up-coming video
 Write a piece of pySpark / SparkR program to filter the data
 Test & make sure your pySpark/SparkR program is working perfect over sample data in your local
computer/server

STEP 2): PRODUCTION CLUSTER (To deal with large-volume data)


 Assume, you already got access to your org’s / client’s production spark cluster
 Ask your client or admin or your manager the data path in the cluster
 Usually the data will reside in Hadoop’s HDFS, HIVE, AWS S3, NoSQL DBs like Cassandra etc
 Add this data path in your program and do some minor configuration changes if required
 Submit your program to cluster and wait for a while to get the result/output
Some facts of existing “production” clusters over
the world …
 Usually organisations enable “Spark Processing Engine” in their existing Hadoop Cluster
 Remember, Spark is independent, it is not a pre-requisite to have Hadoop to set-up Spark Cluster
 We can use Spark with out Hadoop

 To save Infrastructure & maintenance cost, organisations may enable Spark service in their
existing Hadoop Cluster
 If you ever come across the term “MapReduce Processing Engine (MR Engine)” ..
note that “Spark Processing Engine” is an alternative to MR Engine.
 Writing programs for “Spark Processing Engine” is much easier than Writing programs for “MR Engine”
 Spark Processing engine is super faster than MR Engine
 Many a times – Hadoop’s HDFS/ Hive/HBase, AWS S3, Vertica, Cassandra etc. are the data
sources for our Spark programs in Production Clusters
Now, Let’s answer these questions …

What is Spark ?
When to use Spark?
How to use Spark?
The Answers …

 What is Spark ?
 Spark is a cluster software (to be specific .. It is a general purpose “large-scale
data processing engine”)
 When to use Spark?
 When ever you have to process large volumes of data
 When ever you have to process high velocity steaming data
 Also to implement ML / AI solutions
 How to use Spark?
 To make use of a Spark Cluster, as a developer /analyst, you need to write your
programs/queries using your favourite programming language following Spark’s
programming guidelines
Have a cup of Coffee .. Let’s continue with Part 2 …
Introduction to Apache Spark
(Spark) - Part 2 of 2
PART 1
We are going to answer these questions …
 What is Spark ?
 When to use Spark?
 How to use Spark?

PART 2
Longs for around 10 to 15 mins …

Overview of Spark Software


Overview of Spark Software -
When we install/set-up the Spark software …
 We get 4 built-in Libraries/Modules ..
 SparkSQL & DataFrames
 Spark Streaming
 MLlib (Mahine Learning Library)
 GraphX (Graph)

 We get 4 built-in APIs ..


 Scala API
 Java API
 Python API
 R API
Now let’s us land on Apache Spark home page
... to learn more about Spark …

https://spark.apache.org/
Let’s connect at “Contact class” to learn more …

You might also like