Introduction To Apache Spark (Spark) : - by Praveen
Introduction To Apache Spark (Spark) : - by Praveen
(Spark)
-By Praveen
PART 1 Longs for around 20 to 30 mins …
What is Spark ?
When to use Spark?
How to use Spark?
What if the data size exceeds the capacity of Excel .. i.e ~1M rows ?
What if the data size exceeds the capacity of Excel .. i.e ~1M rows ?
*May take longer time (~10to 12 hrs) to finish the “FILTER” Process … due to unavailability of enough computing resources
The activity .. Filter the data by country
What if the data size exceeds the capacity (RAM, PROCESSOR & Disk) of the existing Server?
CLUSTER SOFTWARES
Cluster Capacity = Sum of the
allocated capacities available in HADOOP
SPARK CLUSTER
every individual server/computer SPARK
NOSQL DATABASES
Solution .. Use the “spark cluster computing software” to
solve large-volume data (BigData) problem
OUR STRATEGY STEPS …
STEP 1): LOCAL CLUSTER / DEVELOPMENT CLUSTER (To deal with some sample data)
Set-up local Spark Cluster in your Computer / Server (Spark setup details available in up-coming video
Write a piece of pySpark / SparkR program to filter the data
Test & make sure your pySpark/SparkR program is working perfect over sample data in your local
computer/server
To save Infrastructure & maintenance cost, organisations may enable Spark service in their
existing Hadoop Cluster
If you ever come across the term “MapReduce Processing Engine (MR Engine)” ..
note that “Spark Processing Engine” is an alternative to MR Engine.
Writing programs for “Spark Processing Engine” is much easier than Writing programs for “MR Engine”
Spark Processing engine is super faster than MR Engine
Many a times – Hadoop’s HDFS/ Hive/HBase, AWS S3, Vertica, Cassandra etc. are the data
sources for our Spark programs in Production Clusters
Now, Let’s answer these questions …
What is Spark ?
When to use Spark?
How to use Spark?
The Answers …
What is Spark ?
Spark is a cluster software (to be specific .. It is a general purpose “large-scale
data processing engine”)
When to use Spark?
When ever you have to process large volumes of data
When ever you have to process high velocity steaming data
Also to implement ML / AI solutions
How to use Spark?
To make use of a Spark Cluster, as a developer /analyst, you need to write your
programs/queries using your favourite programming language following Spark’s
programming guidelines
Have a cup of Coffee .. Let’s continue with Part 2 …
Introduction to Apache Spark
(Spark) - Part 2 of 2
PART 1
We are going to answer these questions …
What is Spark ?
When to use Spark?
How to use Spark?
PART 2
Longs for around 10 to 15 mins …
https://spark.apache.org/
Let’s connect at “Contact class” to learn more …