Bdhs - Ebook
Bdhs - Ebook
Course Introduction
Course Objectives
• Learn how to navigate the Hadoop Ecosystem and understand how to optimize
its use
• Ingest data using Sqoop, Flume, and Kafka
• Implement partitioning, bucketing, and indexing in Hive
• Work with RDD in Apache Spark
• Process real-time streaming data
• Perform DataFrame operations in Spark using SQL queries
• Implement User-Defined Functions (UDF) and User-Defined Attribute Functions
(UDAF) in Spark
Course Prerequisites
1 3 5
Hadoop Architecture, Distributed Processing: NoSQL
Distributed Storage (HDFS), MapReduce Framework Databases:
and YARN and Pig HBase
7 9 11 13
Apache Spark: Next Spark SQL: Processing Stream Processing
Generation Big Data DataFrames Frameworks and Spark
Framework Streaming
8 10 12
Project Highlights
Skills Covered
Use Hadoop features to predict patterns and share
1. HDFS actionable insights for a car insurance company.
1. MapReduce
1. Flume
Use Hive features for data engineering and analysis
of New York stock exchange data.
1. Kafka
1. Hive
Big Data is the data that has high volume, variety, velocity, veracity, and value.
6 1
Banking
Manufacturing
5 2
Consumer Healthcare
4 3
Technology Energy
According to US Bureau of Labour Statistics, Big Data alone will fetch 11.5 million jobs by 2026.
Traditional Decision-Making
Takes a long time to arrive at a decision, therefore losing the competitive advantage
Provides limited scope of data analytics, that is, it provides only a bird's eye view
The decision-making is based on what you know which in turn is based on data
analytics.
Solution
It helps in faster decision-making thus improving the competitive advantage and
saving time and energy.
Case Study: Google’s Self-Driving Car
Technical Data
Community Data
Personal Data
Big Data Analytics Pipeline
What Is Big Data?
What Is Big Data?
Big data refers to extremely large data sets that may be analyzed computationally to reveal
patterns, trends, and associations, especially relating to human behavior and interactions.
Big Data at a Glance
Different Types of Data
Growth in Data
Volume Variety
Veracity Velocity
• More than 50,000 Google searches
are completed
Inherent discrepancies in the data
• More than 125,000 YouTube videos
collected results in inaccurate
are viewed
predictions
• 7,000 tweets are sent out
• More than 2 million e-mails are sent
Unstructured Data Conundrum
Social Media
Case Study: Royal Bank of Scotland
Case Study: Royal Bank of Scotland
100% of this data could be processed whereas only 3% could be processed earlier
with traditional systems.
Case Study: Royal Bank of Scotland
The case study of Royal Bank of Scotland gave the following three things:
Improved customer
Sentiment analysis Reduced processing time
satisfaction
Challenges of Traditional System
Challenges of Traditional Systems (RDBMS and DWH)
DATA SIZE
01 Data ranges from terabytes
(10^12 bytes) to exabytes
(10^18 bytes).
GROWTH RATE
RDBMS systems are designed
for steady data retention
rather than rapid growth.
UNSTRUCTURED DATA
Relational databases can’t
categorize unstructured data.
03 02
Advantages of Big Data
4 03
Better decision-making, thanks to Hadoop
Companies Using Big Data
Big Data: Case Study
4 0
How often do they pause a program?
5
3
How often do they re-watch a program?
Multiple systems
Since, multiple computers are used in a distributed system, there are high chances of:
Doug Cutting discovered Hadoop and named it after his son’s yellow toy
elephant. It is inspired by the technical document published by Google.
Characteristics of Hadoop
The four key characteristics of Hadoop are:
Scalable
Can follow both horizontal and
vertical scaling
Reliable
Stores copies of the data on
different machines and is Flexible
resistant to hardware failure
Can store huge data and decide to
use it later
Economical
Can use ordinary computers
for data processing
Traditional Database Systems vs. Hadoop
VS.
Data Processing
YARN Resource
Management
Storage
Hadoop Core
Components of Hadoop Ecosystem
Components of Hadoop Ecosystem
Sqoop
Cluster Resource
YARN Management
Flume
Components of Hadoop Ecosystem
HDFS (HADOOP DISTRIBUTED FILE SYSTEM)
Provides file
Streaming access to
permissions and
file system data
authentication
Components of Hadoop Ecosystem
HBase
Apache Spark
Components of Hadoop Ecosystem
HADOOP MAP-REDUCE
An alternative to writing
Map-Reduce code
An open source
dataflow system
Similar to Impala
Action1 C
Action2
Action3
End
Components of Hadoop Ecosystem
HUE (HADOOP USER EXPERIENCE)
2 4
1 5 HDInsight
Big Data Processing
Problem Statement: In this demonstration, we will walk you through the Simplilearn cloud lab.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways
Unstructured data comprises of data that is usually not easily searchable, including formats like audio, video, and social media
postings.
Knowledge
Check A bank wants to process 1000 transactions per second.
Which one of the following Vs reflects this real-world use case?
2
a. Volume
b. Variety
c. Velocity
d. Veracity
Knowledge
Check A bank wants to process 1000 transactions per second.
Which one of the following Vs reflects this real-world use case?
2
a. Volume
b. Variety
c. Velocity
d. Veracity
Velocity is the frequency of incoming data that needs to be processed. Given use case is an example of an application that
handles the velocity of data.
Knowledge
Check Why has popularity of big data increased tremendously in the recent years?
3
Unstructured data is growing at astronomical rates, contributing to the big data deluge that's sweeping across enterprise data
storage environments.
Knowledge
Check
What is Hadoop?
4
Hadoop is a framework that allows distributed processing of large datasets across clusters of commodity computers using a
simple programming model.
Knowledge
Check Which of the following is a column-oriented NoSQL database that runs on
top of HDFS?
5
a. MongoDB
b. Flume
c. Ambari
d. HBase
Knowledge
Check Which of the following is a column-oriented NoSQL database that runs on
top of HDFS?
5
a. MongoDB
b. Flume
c. Ambari
d. HBase
Apache HBase is a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store.
Knowledge
Check
Scoop is used to _______.
6
a. Import data from relational databases to Hadoop HDFS and export from Hadoop file
system to relational databases
Enable nontechnical users to search and explore data stored in or ingested into
c. Hadoop and HBase
a. Import data from relational databases to Hadoop HDFS and export from Hadoop file
system to relational databases
Enable nontechnical users to search and explore data stored in or ingested into
c. Hadoop and HBase
Scoop is used to import data from relational databases to Hadoop HDFS and export from Hadoop file system to relational
databases.
Thank You
Big Data Hadoop and Spark Developer
Hadoop Architecture, Distributed Storage (HDFS),
and YARN
Learning Objectives
Demonstrate how to use Hue, YARN Web UI, and the YARN
command to monitor the cluster
Hadoop Distributed File System (HDFS)
What Is HDFS?
HDFS is a distributed file system that provides access to data across Hadoop clusters.
In the traditional system, storing and retrieving volumes of data had three major issues:
HDFS resolves all the three major issues of the traditional file system.
Cost Reliability
Speed
A patron gifts his popular books The librarian decides to arrange Also, he distributes multiple copies
collection to a college library. the books on a small rack. of each book on other racks.
Regular File System vs. HDFS
Size of Data
Metadata
Metadata keeps information about
the block and its replication.
HDFS stores files in a It is stored in NameNode.
number of blocks.
NameNode
Node A Node D
B2 B3 B1
B1
Very large
B4 B2
B2
data file B3
B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4
Node C
B1
B3
Fault tolerant
1
Zookeeper Service
Heartbeat (leader election) Heartbeat
Metadata
(shared edit logs)
Monitor Health Monitor Health
Active Standby
NameNode NameNode
1 3 1 3 1 3
Types of HDFS HA Architecture
Quorum-based Storage Shared storage using NFS
DN DN DN DN
DD DD DD
HDFS Component: NameNode
Metadata
File system
DN1 1 3
File.txt =
DN2 1 3
AC
DN3 1 3
NameNode
Cli
ent Block ops
Read
Write
Rack 1 Cli
ent
Rack 1
HDFS Component: Zookeeper
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace.
The ZooKeeper implementation is simple and replicated which puts a premium on high performance, high
availability, and a strictly ordered access.
HDFS Component: Zookeeper
/Dir 1 /Dir 1
File B
NameNode Operation
DataNodes
NameNode
Blocks Files
Data Block Split
Each file is split into one or more blocks which are stored and replicated in DataNodes.
NameNod
e
A file split
into blocks … … …
b1 b2 … b2 b3 b1 b3 b1 b2
The data block approach provides simplified replication, fault-tolerance, and reliability.
Block Replication Architecture
NameNod
JobTracker e
B1 B2 B3
Job 1
DataNode DataNode
server1
Block server 2
Replication
Resubmit Job 1
Server Error
Replication Method
Each file is split into a sequence of blocks.
Except for the last one, all blocks in the file are of the same size.
/foo/data0,
replication:2, {1, 4}
/foo/data1,
Blockreport replication:2, {2, 3}
Block operations
(create block #, Heartbeat
delete block #,
replicate block #, …) NameNod
e
/foo/data0 1 2 3 3 4 2 1 4
/foo/data1 3 1 4 1 2 4 2 3
Data Replication Topology
NameNode
Client
Rack 1 Rack 2
R3N1 R2N1: B1
R3N2 R3N2
R1N3: B1 R2N3: B1
HDFS Access
FS shell for
Web GUI utilized executing
through an HTTP commands
browser on HDFS
HDFS Command Line
Copy file simplilearn.txt from local disk to the user’s directory in HDFS $ hdfs dfs -put simplilearn.txt
simplilearn.txt
–This will copy the file to /user/username/simplilearn.txt
Display a list of the contents of the directory path provided by the user,
showing the names, permissions, owner, size, and modification date for $hdfs dfs –ls /user/simpl/test
each entry
Create a directory called test under the user’s home directory $hdfs dfs –mkdir /user/simpl/test
Delete the directory testing and all its contents hdfs dfs -rm -r testing
Assisted Practice
Problem Statement: In this demonstration, you will explore few basic command lines of HDFS .
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unassisted Practice
Problem Statement: Using command lines of HDFS, perform the below tasks:
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unassisted Practice
Steps to Perform
• Create a directory named “Simplilearn"
hdfs dfs -mkdir Simplilearn
• Transfer a sample text file from your local filesystem to HDFS directory
hdfs dfs -put /home/simplilearn_learner/test.txt Simplilearn
The file browser in Hue lets you view and manage your HDFS directories and files.
Additionally, you can create, move, rename, modify, upload, download, and delete directories and files.
Assisted Practice
Problem Statement: In this demonstration, you will learn, how to access HDFS using Hue.
You will also learn how to view and manage your HDFS directories and files using Hue.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
YARN: Introduction
What Is YARN?
MapReduce Others
(data processing) (data processing)
MapReduce
(cluster resource YARN
management (cluster resource management)
and data processing)
HDFS HDFS
(redundant, reliable (redundant, reliable storage)
storage)
YARN: Use Case
• Yahoo was the first company to embrace Hadoop, and it is a trendsetter within the
Hadoop ecosystem. In late 2012, it struggled to handle iterative and stream
processing of data on Hadoop infrastructure due to MapReduce limitations.
• After implementing YARN in the first quarter of 2013, Yahoo has installed more than
30,000 production nodes on
o Spark for iterative processing
o Storm for stream processing
o Hadoop for batch processing
• Such a solution was possible only after YARN was introduced and multiple
processing frameworks were implemented.
YARN: Advantages
Lower
operational
costs
HADOOP 2.7
MapReduce Others
(data processing) (data processing)
Client
ResourceManag
Each ApplicationMaster requests er
resources from the ResourceManager, The ApplicationMaster negotiates
Applications resources for a single application.
then works with containers provided by Scheduler
NodeManagers.
Manager The application runs in the first
container allotted to it.
Client
YARN
Resource Data
Processing A container is a fraction of the NM
Resource Manager capacity and is used by the client for
Each NodeManager takes running a program.
instructions from the Applications
ResourceManager, reports and
Scheduler The NodeManager
Manager
handles containers on a single node. (NM) is the slave.
When it starts, it
announces itself to
the RM and offers
Node Manager Node Manager Node Manager resources to the
cluster.
App App App
Container Container Container
Master Master Master
Data Node Data Node Data Node
YARN Architecture Element: ResourceManager
The RM mediates the available resources in the cluster among competing applications for
maximum cluster utilization.
Applications
Scheduler
Manager
The Scheduler has a policy plug-in
to partition cluster resources
among various applications.
The ApplicationsManager is an interface which maintains a list of applications that have been
submitted, currently running, or completed.
Resource Manager
The ApplicationsManager accepts job
submissions, negotiates the first container
Applications
Scheduler
Manager for executing the application, and restarts
the ApplicationMaster container on failure.
How a ResourceManager Operates
The figure shown here displays all the internal components of the ResourceManager.
ResourceManager
AdminService NMLivelinessMonitor
ResourceTrackerService
Applications
Manager
ApplicationMasterService
AMLivelinessMonitor
Security
ApplicationMasterLauncher
ContainerAllocationExpirer
ResourceManager in High Availability Mode
Before Hadoop 2.4, the ResourceManager was the single point of failure in a YARN cluster.
The High Availability, or HA feature is an Active/Standby ResourceManager pair to remove this single point of failure.
Automatic Failover
Active Resource
Manager App
Client Master
Elector
ZK
RM Store
Elector
Client
Standby Resource Node Manager
Manager
YARN Architecture Element: ApplicationMaster
The ApplicationMaster in YARN is a framework-specific library which negotiates resources from the RM and
works with the NodeManager or Managers to execute and monitor containers and their resource consumption.
The ApplicationMaster:
When a container is leased to an application, the NodeManager sets up the container’s environment
including the resource constraints specified in the lease and any dependencies.
A YARN container is a collection of a specific set of resources to use in certain numbers on a specific node.
OTHER
BATCH INTERACTIVE ONLINE STREAMING GRAPH IN-MEMORY HPC MPI
(Search)
(MapReduce) (Tez) (HBase) (Strom, S4,…) (Giraph) (Spark) (OpenMPI)
(Weave…)
Cluster Resource
YARN Management
Distributed file
system
How YARN Runs an Application
There are five steps involved in running an application by YARN:
$ my-Hadoop-app
NodeManager DataNode
Client
NodeManager DataNode
ResourceManager
NodeManager DataNode
Application
NameNode
Master
NodeManager DataNode
Step2: ResourceManager Allocates a Container
When the ResourceManager accepts a new application submission,
one of the first decisions the Scheduler makes is selecting a container.
$ my-Hadoop-app
NodeManager DataNode
Client
NodeManager DataNode
Resources Request:
-1xNode1/1GB/1 core
-1xNode2/1GB/1 core
ResourceManage
r NodeManager DataNode
Application
NameNode
Master
NodeManager DataNode
Step3: ApplicationMaster Contacts NodeManager
After a container is allocated, the ApplicationMaster asks the NodeManager managing the host on
which the container was allocated to use these resources to launch an application-specific task.
$ my-Hadoop-app
NodeManager DataNode
Client
NodeManager DataNode
ResourceManager
NodeManager DataNode
NodeManager DataNode
Step4: ResourceManager Launches a Container
The NodeManager does not monitor tasks; it only monitors the resource usage in the containers.
$ my-Hadoop-app
NodeManager DataNode
Client
NodeManager DataNode
ResourceManager
NodeManager DataNode
NodeManager DataNode
Step5: Container Executes the ApplicationMaster
After the application is complete, the ApplicationMaster shuts itself and releases its own container.
$ my-Hadoop-app
NodeManager DataNode
Client
NodeManager DataNode
ResourceManager
NodeManager DataNode
Application
“I’m done!” NameNode
Master
NodeManager DataNode
Tools for YARN Developers
YARN
YARN Hue Job
Command
Web UI Browser
Line
YARN Web UI
Hue Job
The Hue Job Browser allows you to monitor the status of a job, kill a running job, and view
Browser logs.
YARN Command Line
YARN
Command Most of the YARN commands are for the administrator rather than the developer.
Line
- yarn –help
list all command of yarn
- yarn –version
print the version
Problem Statement: In this demonstration, you will learn how to use YARN.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways
Demonstrate how to use Hue, YARN Web UI, and the YARN
command to monitor the cluster
Quiz
Knowledge
Check Which of the following statements best describes how a large (100 GB) file
1 is stored in HDFS?
a. The file is replicated three times by default. Each copy of the file is stored on a separate DataNode.
The master copy of the file is stored on a single DataNode. The replica copies
b. are divided into fixed-size blocks which are stored on multiple DataNodes.
c. The file is divided into fixed-size blocks which are stored on multiple
DataNodes. Each block is replicated three times by default.
Multiple blocks from the same file might reside on the same DataNode.
d.
The file is divided into fixed-size blocks which are stored on multiple
d. DataNodes. Each block is replicated three times by default. HDFS guarantees
that different blocks from the same file are never on the same DataNode.
Knowledge
Check Which of the following statements best describes how a large (100 GB) file
1 is stored in HDFS?
a. The file is replicated three times by default. Each copy of the file is stored on a separate DataNode.
The master copy of the file is stored on a single DataNode. The replica copies
b. are divided into fixed-size blocks which are stored on multiple DataNodes.
c. The file is divided into fixed-size blocks which are stored on multiple
DataNodes. Each block is replicated three times by default.
Multiple blocks from the same file might reside on the same DataNode.
d.
The file is divided into fixed-size blocks which are stored on multiple
d. DataNodes. Each block is replicated three times by default. HDFS guarantees
that different blocks from the same file are never on the same DataNode.
a. 4
b. 5
c. 6
d. 7
Knowledge
Check How many blocks are required to store a file of size 514 MB in HDFS using
default block size configuration?
2
a. 4
b. 5
c. 6
d. 7
c. Both A and B
c. Both A and B
a. Master/Slave
c. Peer to Peer
a. Master/Slave
c. Peer to Peer
a. FIFO scheduler
b. Capacity scheduler
c. Fair scheduler
a. FIFO scheduler
b. Capacity scheduler
c. Fair scheduler
a. Node Manager
b. Resource Manager
c. Application Master
a. Node Manager
b. Resource Manager
c. Application Master
Problem Statement:
PV Consulting is one of the top consulting firms for big data projects.
They mostly help big and small companies to analyze their data.
For Spark & Hadoop MR application they started using YARN as a resource manager. Your
task is to provide the following information for any job which is submitted to YARN using
YARN console and YARN Cluster UI:
Data
Log Files IOT devices
Lake
Data sources include log files, data from click-streams, social media, and internet connected devices.
Data Lake vs. Data Warehouse
Any data that may or may not be Highly curated data that serves as
Data Quality
curated (i.e. raw data) the central version of the truth
01 Big data ingestion involves transferring data, especially unstructured data from where
it originated, into a system where it can be stored and analyzed such as Hadoop.
03 In scenarios where the source and destination do not have the same data format or protocol,
data transformation or conversion is done to make the data usable by the destination system.
Big Data Ingestion Tools
Choosing an appropriate data ingestion tool is important which in-turn is based on factors like
data source, target, and transformations.
Data ingestion tools provide users with a data ingestion framework that makes it easier to
extract data from different types of sources and support a range of data transport protocols.
Data ingestion tools also eliminate the need for manually coding individual data pipelines for
every data source and accelerates data processing by helping you deliver data efficiently to
ETL tools.
Apache Sqoop
What Is Sqoop?
• Exports can be used to put data from Hadoop into a relational database.
Why Sqoop?
While companies across industries were trying to move from structured relational databases to
Hadoop, there were concerns about the ease of transitioning existing databases.
Production system
resource consumption
User consideration
Sqoop is an Apache Hadoop Ecosystem project whose responsibility is to import or export operations
across relational databases. The reasons for using Sqoop are as follows:
Sqoop is required when a database is imported from a Relational Database (RDB) to Hadoop or vice versa.
It has access to the Hadoop core, which helps in using mappers to slice the incoming data into
unstructured formats and place the data in HDFS.
It exports data back into the RDB, ensuring that the schema of the data in the database is
maintained.
Sqoop Execution Process
To import data present in MySQL database using Sqoop, use the following command:
Assisted Practice
Problem Statement: In this demonstration, you will learn, how to list table of MySQL DB through Sqoop.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Sqoop Import Process
The imported data is saved in a directory on HDFS based on the table being imported.
Users can:
• Specify any alternative directory where the files should be populated
• Override the format in which data is copied by explicitly specifying the
field separator and recording terminator characters
• Import data in Avro data format by specifying the option, as-avrodatafile,
with the import command
Sqoop supports different data formats for importing data and provides several options for
tuning the import operation.
Exporting Data from Hadoop Using Sqoop
Use the following command to export data from Hadoop using Sqoop:
Exporting Data from Hadoop Using Sqoop
Perform the following steps to export data from Hadoop using Sqoop:
1 2
Introspect the database
Transfer the data from
for metadata and
HDFS to DB
transfer the data
Default
Sqoop
connector
• By default, Sqoop typically imports data using four parallel tasks called mappers
• Increasing the number of tasks might improve import speed
• You can influence the number of tasks using the -m or --num-mappers option
Sample Sqoop Commands
$Sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –table Sqoop_demo –target-dir
/user/Sqoop_batch6 – m 1 –as-textfile
$Sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –table Sqoop_demo –target-dir
/user/Sqoop_batch6 – m 1 –as-textfile –where “id>2”
$Sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –table Sqoop_demo –target-dir
/user/Sqoop_batch6 –e “select * from Sqoop_demo where id =13” – m 1 –as-textfile
$Sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –table Sqoop_demo –target-
dir /user/Sqoop_batch6 – m 1 –as-textfile –split-by id
$Sqoop export–connect jdbc:mysql://localhost/a2 –username root –password root –table report1 –export-dir
/user/hive/warehouse/mobile_vs_sentiments1/ --input-fields-terminated-by ‘\001’
Exploring a Database with Sqoop
use Simplilearn;
show tables;
Limitations of Sqoop
• Not best supported with NoSQL DB because it is tightly coupled with JDBC
semantics.
Assisted Practice
Problem Statement: In this demonstration, you will learn, how to use Sqoop commands to import and
export data from MySQL to HDFS and vice-versa.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unassisted Practice
Problem Statement: Using SQL and Sqoop commands, perform the below tasks:
• Create a database
• Create a table “employee” with the following fields: Id, Name, Salary, Department, and Designation
Id is the primary key for the table
• Insert at least 5 records into the table
• Import the database and table into Sqoop
• Import only the records for which Salary is greater than 50000
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unassisted Practice
Steps to Perform
• MySQL
mysql -u labuser -p
Password : simplilearn
create table employee(Id INT NOT NULL, Name VARCHAR(100) NOT NULL, Salary INT NOT NULL,
Department VARCHAR(100) NOT NULL, Designation VARCHAR(100) NOT NULL, PRIMARY KEY(Id));
Steps to Perform
• Sqoop
Apache Flume is a distributed and reliable service for efficiently collecting, aggregating, and moving large
amounts of streaming data into the Hadoop Distributed File System (HDFS).
It has a simple and flexible architecture which is robust and fault-tolerant based on streaming data flows.
Log Data
Source Sink
The current issue involves determining how to send the logs to a setup that has
Hadoop. The channel or method used for the sending process must be reliable,
scalable, extensive, and manageable.
Why Flume?
To solve this problem log aggregation tool called Flume can be used. Apache Sqoop and
Flume are the tools that are used to gather data from different sources and load them
into HDFS. Sqoop in Hadoop is used to extract structured data from databases like
Teradata, Oracle, and so on, whereas Flume in Hadoop sources data that is stored in
different sources, and deals with unstructured data.
Flume Model
Flume Model
Agent
Source Sink
Tail Apache
HTTPD logs Downstream
Processor node
Processor
Source Sink
Decorator
Tail Apache Extract browser name from log
string and attach it to event Downstream
HTTPD logs Processor node
Collector
Source Sink
Achieve a scalable data path that can be used to form a topology of agents
Extensibility:
Flume can be extended by adding Sources and Sinks to existing storage layers or data platforms
• General Sources include data from files, syslog, and standard output from any Linux process
Flume
Source Sink
HDFS Log
Files Files HDFS
Scalability in Flume
Flume has a horizontally scalable data path which helps in achieving load balance
in case of higher load in the production environment.
Collector
Sensor Data
Log Files
Unix Syslog
Network Sockets
Status Updates
Flume Data Flow
Flume Agent
• Source:
Receives events from the external actor that generates them
• Sink:
Source Sink
Sends an event to its destination. It stores the data into centralized
Web Server HDFS
stores like HDFS and HBASE Channel
• Channel:
Buffers events from the source until they are drained by the sink. It
acts as a bridge between the sources and the sinks
• Agent:
Java process that configures and hosts the source, channel, and sink
Flume Source
Netcat:
Listens on a given port and turns each line of text into an event
Kafka:
Receives events as messages from a Kafka topic
Syslog:
Captures messages from UNIX syslog daemon over the network
Spooldir:
Used for ingesting data by placing files to be ingested into a "spooling" directory on disk
Flume Sink
Following are the types of Flume Sink:
Null
Discards all events received
(Flume equivalent of /dev/null)
HDFS
HBaseSink Writes event to a file in the
Stores event in HBase specified directory in HDFS
Flume Channels
Following are the types of Flume channel:
Memory
• Stores events in the machine’s RAM
• Extremely fast, but not reliable as memory is volatile
File
• Stores events on the machine’s local disk
• Slower than RAM, but more reliable as data is written to disk
JDBC
• Stores events in a database table using JDBC
• Slower than file channel
Flume Agent Configuration File
All Flume agents can be configured in a single Java file.
Example : Configure a Flume Agent to collect data from remote spool directories and save to HDFS
through memory channel.
HDFS
/var/Flume/incoming
Memory Channel
/simplilearn/logdata
agent1
Example : Configuring Flume Configuration
agent1.sources = src1
agent1.sinks = sink1
agent.channels = ch1
agent1.channels.ch1.type = memory
agent1.sources.src1.type = spooldir
agent1.sources.src1.spoolDir = /var/Flume/incoming
Connect source
agent1.sources.src1.channels = ch1 and channel
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /simplilearn/logicdata Connect source
agent1.sinks.sink1.channel = ch1 and channel
Flume: Sample Use Cases
Flume can be used for a variety of use cases:
Problem Statement: In this demonstration, you will learn, how to ingest Twitter data from Apache Flume and
ingest into HDFS.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Apache Kafka
Apache Kafka: Introduction
Kafka is a high-performance, real-time messaging system. It is an open source tool and is a part of Apache projects.
• It is highly fault-tolerant.
• It is highly scalable.
• It can process and send millions of messages per second to several receivers.
Apache Kafka: Use Cases
Messaging service Kafka can be used to send and receive millions of messages in real-time.
Real-time stream Kafka can be used to process a continuous stream of information in real-time and
processing pass it to stream processing systems such as Storm.
Kafka can be used to collect physical log files from multiple systems and store them
Log aggregation
in a central location such as HDFS.
Commit log
Kafka can be used as an external commit log for distributed systems.
service
Event sourcing Kafka can be used to maintain a time ordered sequence of events.
Website Activity Kafka can be used to process real-time website activity such as page views,
Tracking searches, or other actions users may take.
Aggregating User Activity Using Kafka
Kafka can be used to aggregate user activity data such as clicks, navigation, and searches from
different websites of an organization; such user activities can be sent to a real-time monitoring
system and hadoop system for offline processing.
Customer
Portal 1
Real-time
monitoring
Customer system
Portal 2 Kafka Cluster
Hadoop
offline
Customer processing
Portal 3
Kafka in LinkedIn
Monitoring Messaging
• Collect metrics • Used for message queues in
• Create monitoring dashboards content feeds
• Used as publish-subscribe
system for searches and
content feeds
• Messages represent information such as, lines in a log file, a row of stock market data, or an error message.
• Messages are grouped into categories called topics. Example: LogMessage and StockMessage.
Kafka Cluster
Messages Messages
Broker
Producer 1 Consumer 1
Topic 1 Topic 2
Topic: simple
Partition 0
6 5 4 3 2 1
Write Reads
s 5 4 3 2 1
Partition 1
A topic is divided into one or more partitions which consist of ordered set of messages.
Topics
Topics are divided into partitions, which are the unit of parallelism in Kafka.
Topic: simple
Partition 0
6 5 4 3 2 1
Write Reads
s 5 4 3 2 1
Partition 1
Partition Distribution
• One server is marked as a leader for the partition and the others are marked as followers.
• The leader controls the read and write for the partition, whereas, the followers replicate the data.
Partition 0
6 5 4 3 2 1
Server 1
Writes Reads
5 4 3 2 1
Partition 1
Server 2
Producers
• The producers also decide which partition to place the message into.
Consumer 2 Consumer 5
Consumer 3
Apache Kafka Architecture
Kafka Architecture
Given below is a Kafka cluster architecture:
Problem Statement: In this demonstration, you will learn, how to setup a Kafka Cluster on CloudLab.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Kafka APIs
Kafka has four core APIs:
Stream
Connectors
Processors The Consumer API allows applications to read
streams of data from topics in the Kafka cluster.
The producer side APIs provide interface to connect to the cluster and insert messages into a topic.
The steps involved in programming are as follows:
1 2 3
Set up producer Get a handle to the Create messages as
configuration producer key and value pairs
connection
4 5
Submit the
Close the
messages to a
connection
particular topic
By default, a message is submitted to a particular partition of the topic based on the hash value of the key.
A programmer can override this with a custom partitioner.
Producer Side API Example: Step 1
The consumer side APIs provide interface to connect to the cluster and get messages from a topic.
The steps involved in programming are:
1 2 3
Set up consumer Get a handle to the Get a stream of
configuration consumer messages for a
connection topic
4 5
Loop over the Close the
messages and connection
process them
Messages can be read from a particular partition or from all the partitions.
Consumer Side API Example: Step 1
Confluent connector is an alternative to Kafka connect which comes with some additional tools and
clients, compared to plain Kafka, as well as some additional pre-built Connectors.
Problem Statement: In this demonstration, you will learn, how to create a sample Kafka data pipeline using
producer and consumer.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways
a. Apache Flume
b. Apache Kafka
c. Apache Sqoop
a. Apache Flume
b. Apache Kafka
c. Apache Sqoop
a. Null
b. HDFS
c. Spooldir
d. Sink
Knowledge
Check
Which of the following is a Flume Source?
2
a. Null
b. HDFS
c. Spooldir
d. Sink
c. Both A and B
c. Both A and B
a. Sqoop
b. Flume
c. Kafka
a. Sqoop
b. Flume
c. Kafka
a. replication_factor
b. partitions
c. threads
d. concurrent
Knowledge
Check The parallelism of a Kafka topic can be set using the parameter:
5
a. replication_factor
b. partitions
c. threads
d. concurrent
Problem Statement:
Recently, they started getting a lot of errors on the portal. They have collected these errors
from all the applications and compiled them in a text file. Processing logs is a big task as an
application can generate a lot of logs in a single day. They want to send all logs to HDFS so
they can check which are the most frequent errors they are getting.
You have given an error log file containing the below details.
1. Dates
2. Server
3. Error message
You must read data from the text file and send it to Kafka and flume script.
Also, you should be able to read data from Kafka and push it into HDFS.
Understand Pig
MapReduce is a programming model that simultaneously processes and analyzes huge data sets logically into
separate clusters. While Map sorts the data, Reduce segregates it into logical clusters, thus removing the bad data
and retaining the necessary information.
Why MapReduce?
Election results
Polling booth
Tellers
ballots Poll count Total count for each
in each booth candidate
MapReduce: Analogy
Map phase Partition phase Shuffle phase Sort phase Reduce phase
●Reads assigned input ●Each mapper must • Fetches input data • Merge sorts all ●Appliesuser-defined
split from HDFS determine which from all map tasks map outputs into a reduce function to
●Parses input into reducer will receive for the portion single run the merged run
records as key-value each of the outputs corresponding to
pairs ●For any key, the the reduce tasks
●Applies map function destination partition is bucket
to each record the same
●Informs master node ●Number of partitions
of its completion = Number of reducers
Map Execution: Distributed Two Node Environment
Node 1 Node 2
Files loaded from local HDFS stores Files loaded from local HDFS stores
InputFormat InputFormat
file file
RecordReaders RR RR RR RR RR RR Record
Input (k, v) pairs Readers
Input (k, v) pairs
map map map map map map
The reduce tasks are read from every map task, and each read
returns the record groups for that reduce task.
Reduce phase cannot start until all mappers have finished processing.
MapReduce Jobs
MapReduce Jobs
A job is a MapReduce program that causes multiple map and reduce functions to run parallelly over the
life of the program.
Map process
An initial ingestion and transformation step where
initial input records are processed in parallel
Node Manager
Keeps track of individual map tasks and can
run in parallel
Hadoop MapReduce Job Work Interaction
MapReduce
Characteristics
Leverages commodity Allows parallelism
hardware and storage
The user or developer is required to set the framework with the following parameters:
Responsibilities
Developer Framework
“Hi, John!”
“Crazy”
Writable data types: In the Hadoop environment,
objects that can be put to or received from files and
across the network must obey the Writable interface.
Interfaces
The interfaces in Hadoop are as follows:
The table lists a few important data types and their functions:
MapReduce can specify how its input is to be read by defining an InputFormat. The table lists some of the
classes of InputFormats provided by the Hadoop framework:
The table lists some of the key classes of OutputFormats provided by the Hadoop framework:
It is the default OutputFormat and writes records as lines of text. Each key-
value pair is separated by a TAB character. This can be customized by using
TextOutputFormat
the mapred.textoutputformat.separator property. The corresponding
InputFormat is KeyValueTextInputFormat.
SequenceFileOutputFormat It writes sequence files to save the output. It is compact and compressed.
SequenceFileAsBinaryOutputFormat It writes key and value in raw binary format into a sequential file container.
Helps to boost efficiency when a map or a reduce task needs access to common data.
Allows a cluster node to read the imported files from its local file system instead
of retrieving the files from other cluster nodes.
Allows both single files and archives (such as zip and tar.gz).
Copies files only to slave nodes. If there are no slave nodes in the cluster,
distributed cache copies the files to the master node.
Allows access to the cached files from mapper or reducer applications to make sure that
the current working directory (./) is added into the application path.
Allows one to reference the cached files as though they are present in the
current working directory.
Using Distributed Cache - Step 1
Composite join
It is a map-side join on very large Replicated join
formatted input datasets sorted and It is a map-side join that works in
partitioned by a foreign key. situations where one of the datasets is
small enough to cache.
Reduce Side Join
Data set A
(Bob,”md”)
Input Join
Split Mapper
Tem
A reduce side join works in the following ways: p List
Data set A
• It reads all files from the distributed cache and stores Input
Split
Rep Join
Mapper
Output
part
them in in-memory lookup tables.
• The mapper processes each record and joins it with the Input Rep Join Output
data stored in memory. Split Mapper part
Data set B
Composite Input
Split 1
Steve
hash (fk)%5=0 Input
Mike
Split Mapper Output
Amy
• All datasets are divided into the same number of hash (fk)%5=1
Sorted
Steve
A1 part
Composite Input
Split n
Pat
hash (fk)%5=4 Input
Lisa
Sorted Split Mapper Output
Lisa
A1 part
hash (fk)%5=5
Lisa
Input
Split
B1
Composite Join
SQL analogy
SELECT users.ID, users.Location,
When all datasets comments.upVotes
are sufficiently large FROM users
[INNER|LEFT|RIGHT] JOIN
When there is a need for comments
an inner join or a full ON users.ID=comments.UserID
outer join
Extensible
Self-optimizing
Easily programmed
Pig is a scripting platform designed to process and analyze large data sets, and it runs
on Hadoop clusters.
Pig—Example
Yahoo has scientists who use grid tools to scan through petabytes of data.
Components of Pig
Pig Operations
Load data
Execution
and write Pig
of the plan
script
Pig
Operations
A = LOAD ‘myfile’ • Results are
AS (x, y, z); dumped on
screen or stored
B = FILTER A by x > 0; • Parses and checks the script in HDFS
C = GROUP B BY x; • Optimizes the scripts
D = FOREACH A GENERATE • Plans execution
x, COUNT(B); • Submits to Hadoop
• Monitors job progress
STORE D INTO ‘output’;
Salient Features
Developers and analysts like to use Pig as it offers many features.
D1
D2
D4
D3
Schemas can be
Assigned
Dynamically
Step-by-Step Supports UDFs and
Procedural Control Data Types
Pig Data Model
Data Model
Advantages of nested
Atomic data model
Values
SQL Pig
Use the following URLs to download different datasets for Pig development:
Datasets URL
Logical Plan
1 3 Logical
2 4
MapReduce
Optimizer
launcher
Query
parser
Type check Execute the
with schema statement Logical to
Physical
Optimized Translator
Logical Plan
Physical Plan
Physical to M/R
Translator
MapReduce Plan
Various Relations Performed by Developers
Some of the relations performed by Big Data and Hadoop Developers are as follows:
Relations
Relations
Filtering can be defined as filtering of data based on a conditional clause such as grade and pay.
Various Relations Performed by Developers
Some of the relations performed by Big Data and Hadoop Developers are as follows:
Relations
Transforming refers to making data presentable for the extraction of logical data.
Various Relations Performed by Developers
Some of the relations performed by Big Data and Hadoop Developers are as follows:
Relations
Relations
Sorting of data refers to arranging the data in either ascending or descending order.
Various Relations Performed by Developers
Some of the relations performed by Big Data and Hadoop Developers are as follows:
Relations
Combining refers to performing a union operation on the data stored in the variable.
Various Relations Performed by Developers
Some of the relations performed by Big Data and Hadoop Developers are as follows:
Relations
Problem Statement: In this demonstration, you will analyze web log data using MapReduce and solve
various real-world KPIs.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Assisted Practice
Problem Statement: In this demonstration, you will analyze sales data and solve KPIs using Pig.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unassisted Practice
Problem Statement: Use pig to count the number of words from a text file.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unassisted Practice
Steps to Perform
• Pig
$ pig -x local
Understand Pig
a. Pig
b. Pig -x MapReduce
c. Pig -x local
a. Pig
b. Pig -x MapReduce
c. Pig -x local
Pig or Pig -x MapReduce command can be used to run Pig in MapReduce mode.
Knowledge
Check Which of the following commands is used to start Pig in local mode?
2
a. Pig -x local
b. Pig -x MapReduce
c. Pig
d. Both b and c
Knowledge
Check Which of the following commands is used to start Pig in local mode?
2
a. Pig -x local
b. Pig -x MapReduce
c. Pig
d. Both b and c
a. 4
b. 5
c. 6
d. 2
Knowledge
Check How many phases exist in MapReduce?
3
a. 4
b. 5
c. 6
d. 2
a. 2
b. 4
c. 3
d. 5
Knowledge
Check In how many ways can Pig Latin program be written?
4
a. 2
b. 4
c. 3
d. 5
a. Enterprise Analytics
b. Gaussian analysis
a. Enterprise Analytics
b. Gaussian analysis
Problem Statement:
The US Department of Transport collects statistics from all the airlines, which include airline
details, airport details, and flight journey details.
These airlines have global presence and they operate almost in every country.
Flight data can help to decide which airline provides better service and find the routes in
which flights are getting delayed.
The data collected is present in the files: flight.csv, airline.csv, and aiport.csv.
You are hired as a big data consultant to provide important insights.
Hive provides a SQL like interface for users to extract data from the Hadoop system.
Resource Management
Storage
HDFS HBase
Features of Hive
Uses HiveQL
The organization analyzes positive, negative, and neutral reviews using Hive.
Hive Architecture
Hive Architecture
The major components of Hive architecture are: Hadoop core components, Metastore, Driver, and Hive clients.
JDBC/ODBC
Drivers
Parser Planner
Metastore
Execution Optimizer
MapReduce
RDBMS
HDFS
Job Execution Flow in Hive
1 Parse HiveQL
2 Make optimizations
3 Plan execution
5 Monitor progress
• Command-line shell
– Hive: Beeline
• Hue Web UI
– Hive Query Editor
• Metastore Manager
– ODBC / JDBC
Connecting with Hive
beeline –u … -f
To execute file using the –u option simplilearn.hql
To use HiveQL directly from the command line using the beeline –u ... -e
-e option 'SELECT * FROM users‘
beeline –u … -
To continue running script even after an error force=TRUE
Running Hive query
SQL commands are terminated with a semicolon (;)
Diagram 1 Diagram 2
Hive Metastore
Managing Data with Hive
Hive uses Metastore service to store metadata for Hive tables.
• Hive Tables are stored in HDFS and the relevant metadata is stored in the Metastore
What Is Hive Metastore?
The Metastore is the component that stores the system catalog which
contains metadata about tables, columns, and partitions.
Use of Metastore in Hive
Hive Server
/user/hive/warehouse
• Each table is a directory within the default location having one or more files
Customers Table
customer_id name country /user/hive/warehouse/customers
In HDFS, Hive data can be split into more than one file.
Hive DDL and DML
Defining Database and Table
• Databases and tables are created and managed using the DDL (Data Definition Language) of HiveQL
/user/hive/warehouse/dbname.db/tablename
Table Creation: Example
• String: STRING
Changing Table Data Location
• Use LOCATION to specify the directory where you want to reside your data in HDFS
SE
Q T
X
T
sqoop import \
-connect jdbc:mysql://localhost/simplilearn \
-username training \
-password training \
hive-import creates a table accessible
in Hive.
-fields-terminated-by '\t' \
-table employees \
-hive-import
What Is HCatalog
Problem Statement: In this demonstration, you will learn how to use Hive query editor for real-time analysis
and data filtrations.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Assisted Practice
Problem Statement: In this demonstration, you will learn how to use the Hive editor in web console.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Assisted Practice
Problem Statement: In this demonstration, you will learn how to use Hive to import data from an external
source and perform data representation and analysis.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
File Format Types
File Format Types
Parquet
Format Text Files
Formats to create
Hive table in HDFS
Not human-readable
File Format: Avro File Format
Embeds schema
Efficient storage metadata in the file
Avro File
Format
File Format: Parquet File Format
It can be serialized as 4 bytes when stored as a Java int and 9 bytes when stored as a Java String.
Data Serialization Framework
Offers compatibility
Data Types Supported in Avro
Name Description
boolean A binary
Name Description
A user-defined type composed of one or more named
record
fields
Hive Table - CREATE TABLE orders (id INT, name STRING, title STRING)
{"namespace":"com.simplilearn",
"type":"record",
"name":"orders",
"fields":[
Avro Schema {"name":"id", "type":"int"},
{"name":"name", "type":"string"},
{"name":"title", "type":"string"}]
}
Other Avro Operations
{"namespace":"com.simplilearn",
"type":"record",
"name":"orders",
"fields":[
{"name":"id", "type":"int"},
{"name":"name", "type":"string", "default":"simplilearn"},
{"name":"title", "type":"string","default":"bigdata"}]
}
Create New Table with Parquet
Reading Parquet Files Using Tools
Hive Optimization: Partitioning, Bucketing, and Sampling
Data Storage
All files in a data set are stored in a single Hadoop Distributed File System or HDFS directory.
DB
Tables
HDFS
Data file partitioning
Director reduces query time.
y Hive
Bucket
s
(Files)
Example of a Non-Partitioned Table
A partition column is a
“virtual column” where data
is not actually stored in the
file.
Data insertion into partitioned tables can be done in two ways or modes:
Static Dynamic
partitioning partitioning
Static Partitioning
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
Viewing Partitions
Commands that are supported on Hive partitioned tables to view and delete partitions.
Following are the instances when you need to use partitioning for tables:
Following are the instances when you should avoid using a partitioning:
When columns have too When creating a dynamic When the partition is less
many unique rows partition as it can lead to than 20k
high number of partitions
Bucketing
Bucketing in Hive
Partitioned column
Country
DATA Bucket
What Do Buckets Do?
Buckets distribute the data load into user-defined set of clusters by calculating
the hash code of the key mentioned in the query.
Buckets
(Cluster) CREATE TABLE page_views( user_id INT, session_id BIGINT, url
STRING)
Bucket
PARTIONED BY (day INT)
CLUSTERED BY (user_id) INTO 100;
0100
1101 Bucket
1001
The processor will first calculate the hash
DATA Bucket
number of the user_id in the query and will
look for only that bucket.
Hive Query Language: Introduction
HiveQL is a SQL-like query language for Hive to process and analyze structured data in a Metastore.
MetaStore
SELECT
dt,
COUNT (DISTINCT (user_id))
FROM events
GROUP BY dt;
HiveQL: Extensibility
An important principle of HiveQL is its extensibility. HiveQL can be extended in multiple ways:
Pluggable user-
defined
functions
Pluggable
Pluggable data MapReduce
formats scripts
Pluggable user-
defined types
Hive Analytics: UDF and UDAF
User-Defined Function
Hive has the ability to define a function. UDFs extend the functionality of Hive, with a function
written in Java, that can be evaluated in HiveQL statements.
{…} All UDFs extend the Hive UDF class. After that, a UDF sub-class
implements one or more methods named ‘evaluate’.
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text; {…}
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}
Built-in Functions of Hive
Writing the functions in JAVA scripts creates its own UDF. Hive also provides some inbuilt functions
that can be used to avoid own UDFs from being created.
Mathematical Collection
round, floor, ceil, rand, and exp size, map_keys, map_values,
and array_contains
Conditional String
if, case, and coalesce length, reverse, upper, and
trim
Other Functions of Hive
Lateral view:
Creates the output if String pageid Array<int> adid_list
full set of data is given
"front_page" [1,2,3]
Aggregate "contact_page" [3, 4, 5]
Single "front_page" 1
input row
Table- "front_page" 2
generating
…… ……
Multiple
output rows
MapReduce Scripts
MapReduce scripts are written in scripting languages, such as Python.
Pluggable user-
Example: my_append.py defined
functions
for line in sys.stdin:
line = line.strip() Pluggable
Pluggable data key = line.split('\t')[0] MapReduce
formats value = line.split('\t')[1] scripts
print key+str(i)+'\t'+value+str(i)
i=i+1
Pluggable user-
defined types
Using the function:
a. Hive
b. RDBMS
c. Both A and B
a. Hive
b. RDBMS
c. Both A and B
a. /hive
b. /user/hive/
c. /user/hive/warehouse
a. /hive
b. /user/hive/
c. /user/hive/warehouse
c. Not human-readable
c. Not human-readable
c. Be cautious while creating dynamic partition as it can lead to high number of partition
c. Be cautious while creating dynamic partition as it can lead to high number of partitions
a. File Formatting
b. Data Serialization
c. Both A and B
a. File Formatting
b. Data Serialization
c. Both A and B
Problem Statement:
Everybody loves movies. Nowadays, movie releases per year has increased compared to
earlier days because of an increase in the number of production houses. A few giants, like
Netflix and Amazon, have started creating their content as well.
Hollywood is spreading its wings in most countries because of its graphics, story, and actors.
In Hollywood, few directors have made great impact among audiences. Among these, few of
them have received nominations and won awards.
Before watching a movie, people tend to validate the director’s credentials like, what kind of
movies he has made in the past and if he has won any awards.
The given data set has details about the movie directors and whether they have
received nominations and won awards.
1. Directors who were nominated and have won awards in the year 2011
2. Award categories available in the Berlin International Film Festival
3. Directors who won awards for making movies in French
4. Directors who have won awards more than 10 times
Thank You
Big Data Hadoop and Spark Developer
NoSQL Databases: HBase
Learning Objectives
DB NoSQL
Structured Unstructured
Why NoSQL?
With the explosion of social media sites, such as Facebook and Twitter, the demand to manage
large data has grown tremendously.
Graph
Example:
Record Record
s s
Relationship
Nodes Organiz s
e
Hav Hav
e e
Properties
Problem Statement: In this demonstration, you will learn, how to tune YARN and allow HBase to run
smoothly without being resource starved.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
HBase Overview
What Is HBase?
It can store huge amount of data in tabular format for extremely fast reads and writes.
HBase is mostly used in a scenario that requires regular and consistent inserting and overwriting of data.
Why HBase?
Therefore, a solution is required to access, read, or write data anytime regardless of its sequence in the
clusters of data.
Characteristics of HBase
HBase is a database in which tables have no schema. At the time of table creation, column families are
defined, not columns.
HBase: Real-Life Connect
Facebook’s messenger platform needs to store over 135 trillion messages every month.
HBase has two types of nodes: Master and RegionServer. Their characteristics are as follows:
Master RegionServer
• Single Master node running at a • One or more RegionServers
time running at a time
• Manages cluster operations • Hosts tables and performs reads
HBase
• Not a part of the read or write and buffer writes
Nodes
path • RegionServer is communicated in
order to read and write
A region in HBase is the subset of a table’s rows. The Master node detects the status of RegionServers and
assigns regions to it.
HBase Components
HDFS
Storage Model of HBase
Partitioning:
• A table is horizontally partitioned into regions.
• Each region is managed by a RegionServer.
• A RegionServer may hold multiple regions.
A1
A2 Region
Null🡪 A3
Logical View-All rows in a table
A22
A3 Region
A3🡪 F34
…
…
Region
K4 F34🡪 K80
…
… Region
k80🡪 095
O90 Region
… 095🡪 null
. .. CF2:C1 CF1:C8
rowkey CF1:C1 CF1:C2 CF1:C3
. ..
rowkey CF1:C1 CF1:C2 CF1:C3 CF2:C1 CF1:C8
.
Cells within a column family are sorted physically. Very sparse as most cells have NULL values.
Row Key
The table shows a comparison between HBase and a Relational Database Management System (RDBMS):
HBase RDBMS
Scales linearly and automatically with new Usually scales vertically by adding more hardware
nodes resources
Leverages batch processing with MapReduce Relies on multiple threads or processes rather
distributed processing than MapReduce distributed processing
Connecting to HBase
Connecting to HBase
Mapreduce
Rest/Thrift
Hive/Pig/Hcatalog Java Application
Gateway
/Hue
Java API
ZooKeeper
HBase
HDFS
HBase Shell Commands
Common commands include, but are not limited to, the following:
Drop the table named. Table must first be disabled HBase> drop ‘t1′
Delete Put
Deleting a cell value Putting a cell value
Problem Statement: Create a sample HBase table on the cluster, enter some data, query the table, then
clean up the data and exit.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unassisted Practice
Steps to Perform
• HBase Shell
// Create a table called simplilearn with one column family named stats:
create 'simplilearn', 'stats’
// Add a test value to the daily column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:daily', 'test-daily-value’
Unassisted Practice
Steps to Perform
• HBase Shell
// Add a test value to the weekly column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:weekly', 'test-weekly-value’
// Add a test value to the weekly column in the stats column family for row 2:
put 'simplilearn', 'row2', 'stats:weekly', 'test-weekly-value’
c. In variable schema
c. In variable schema
Global transport private limited is in transport analytics and they are keen to ensure the
safety of people. Nowadays, as the population is increasing accidents are also becoming
more and more frequent. Accidents occur mostly when the route is long, the driver is drunk,
or the roads are damaged. The company collects data of all the accidents and provides
important insights that can reduce the number of accidents. The company wants to create a
public portal where anyone can see the accident’s aggregated data.
Your task is to suggest a suitable database and design a schema which can cover most of the
use cases.
You are given a file that contains details about the various parameter of accidents.
The column details are as follows:
1. Year
2. TYPE
3. 0-3 hrs. (Night)
4. 3-6 hrs. (Night)
5. 6-9 hrs (Day)
6. 9-12 hrs (Day)
7. 12-15 hrs (Day)
8. 15-18 hrs (Day)
9. 18-21 hrs (Night)
10. 21-24 hrs (Night)
11. Total
Lesson-End-Project
Problem Statement:
You have to save the given data in HBase in such a way that you can solve the below queries.
Please mention what you are selecting as a row key and why.
Why is Spark
programmed in Scala?
Why Is Spark Programmed in Scala?
1 2 3 4 5
Assisted
Access: Click on the Practice Practice
Labs tab on1the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Functional Programming
Functional Programming
Functional programming languages are designed on the concept of mathematical functions that use
conditional expressions and recursion to perform computation.
Flow control is done using function calls and Flow control is done using loops and
function calls with recursion conditional statements
Supports both "Abstraction over Data" and Supports only "Abstraction over Data"
"Abstraction over Behavior"
The basic literals used in Scala are listed below with examples:
Example: Examples:
scala> val hex = 0x6; output - hex: Int = 6 scala> val big = 1.2345; output - big: Double = 1.2345
scala> val little = 1.2345F; output - little: Float = 1.2345
Character Literals
Example:
scala> val a = 'A'
Basic Literals
Examples: Example:
scala> val bool = true; output - bool: Boolean = true scala> updateRecordByName('favoriteBook,
scala> val fool = false; output - fool: Boolean = false “Spark in Action")
String Literals
Examples:
scala> val hello = "hello“; output - hello:
java.lang.String = hello
println("""Welcome to Ultamix 3000 Type "HELP"
for help.""")
Welcome to Ultamix 3000
Type "HELP" for help
Introduction to Operators
The class Int includes a method called + that takes an Int and provides an Int as a result. To invoke the +
method, you need to add two Ints as follows:
scala> val sum = 2 + 1 //Scala invokes (2)+(1)
Types of Operators
Arithmetic Operators
Bitwise Operators
Basic Literals and Arithmetic Operators Duration: 5 mins
Problem Statement: In this demonstration, you will use basic literals and arithmetic operators in Scala.
Assisted
Access: Click on the Practice
Practice Labs tab on 2
the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Logical Operators Duration: 5 mins
Problem Statement: In this demonstration, you will use the logical operators in Scala.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Type Inference, Classes, Objects, and Functions in Scala
Type Inference
Type Inference is a built-in mechanism that allows you to omit certain types of annotations and
return types of methods.
Example:
object InferenceTest1 extends Application {
val x = 1 + 2 * 3 // the type of x is Int
val y = x.toString() // the type of y is String
def succ(x: Int) = x + 1 // method succ returns Int values
}
Objects
Scala has singleton objects instead of static members, which is a class with only one instance and
can be created using the keyword object.
package logging
object Logger {
def info(message: String): Unit = println(s”INFO: $message”)
}
Classes
Classes in Scala are blueprints for creating objects. They can contain methods, values, variables,
types, objects, and traits which are collectively called members.
Class User
Scala provides a rich set of built-in functions and allows you to create user defined functions also.
In Scala, functions are first class values. These can be returned as a result or passed as a parameter.
Example:
(y: Int) => y * y
Higher-Order Functions
Problem Statement: In this demonstration, you will learn how to define type inference, functions,
anonymous functions, and class in Scala.
Assisted
Access: Click on the Practice
Practice Labs tab on 4
the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Collections
Collections
As discussed, a map is an
example of higher-order
function. But what exactly is
a map?
Collections
Collections are containers of things, that can be sequenced as linear sets of items.
Iterators 6 1 Lists
Options 5 2 Sets
Tuples 4 3 Maps
Types of Collections
In Scala lists, all the elements have the same data type. They are immutable and represent a linked
list.
//List of strings
val fruit: List[String] = List("apples", "oranges", "pears")
//List of integers
val nums: List[Int] = List(1, 2, 3, 4)
Sets
01 Tests 02 Additions
//Empty hash table whose keys are strings and values are integers:
var A:Map[Char,Int] = Map()
05 Transformations
Tuples
In Scala, a tuple is a value that contains a fixed number of elements, each with a distinct type.
//Tuple of integers
val t = (1,2,3,4)
An Option[T] is a container for one element or zero, which represents a missing value.
//Define an option
val x:Option[Int] = Some(5)
Iterators
next() hasNext()
next() hasNext()
Problem Statement: In this demonstration, you will learn how to use different types of collections
such as List, Set, Map, Tuple, and Option in Scala.
Assisted
Access: Click on the Practice
Practice Labs tab on 5
the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Perform Different Operations on List Duration: 10 mins
Problem Statement: In this demonstration, you will learn how to perform different operations on list.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Scala REPL
Scala REPL
Scala REPL is a tool for evaluating expressions in Scala. The REPL reads expressions at the prompt,
wraps them in an executable template, and then compiles and executes the result.
$intp lastException
//print<tab> :help
:load :paste
Scala REPL
-Yrepl-outdir :power
:settings :replay
Implementation Notes
User code can be wrapped in either an object or a class. The switch is -Yrepl-
class-based.
1
Problem Statement: In this demonstration, you will demonstrate the features of Scala REPL.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways
a. Character literal
b. Integer literal
c. Floating literal
a. Character literal
b. Integer literal
c. Floating
1 literal
d. None ofval
scala> thehex
above
= 0x6; output - hex: Int = 6 is an example of integer literal.
a. Map
b. Tuple
c. List
d. Set
Knowledge
Check _____________ is a value that contains a fixed number of elements, each with a distinct
b type.
a. Map
b. Tuple
c. List
2
d. Set
Tuple is a value that contains a fixed number of elements, each with a distinct type.
a. Option
b. Iterator
c. Map
d. Set
Knowledge
Check _______________ is a way to access the elements of a collection one by one.
b
a. Option
b. Iterator
c. Map
3
d. Set
Iterator is a way to access the elements of a collection one by one.
The following details are given for the list of companies in a CSV file.
1. Name
2. Domain
3. Year founded
4. Industry
5. Size range
6. Country
7. LinkedIn URL
8. Current employee estimate
9. Total employee estimate
Filename: companies.csv
You must read the file and use Scala collections to solve the following
problems:
1. Companies which were founded before 1980.
2. Top 5 companies which have the maximum number of employees.
3. Top Industry in which the maximum number of companies were founded.
Thank You
Big Data Hadoop and Spark Developer
Apache Spark - Next Generation Big Data Framework
Learning Objectives
2010 2014
• A large group of data or transactions is • Data processing takes place upon data entry
processed in a single run. or command receipt instantaneously.
• Jobs are run without any manual intervention. • It must execute real-time within stringent
constraints.
• The entire data is pre-selected and fed using
command-line parameters and scripts.
Example: Fraud detection
• It is used to execute multiple operations,
handle heavy data load, reporting, and offline
data workflow.
Example:
Regular reports that require decision-making
Limitations of MapReduce in Hadoop
Is suitable for real-time processing, trivial operations, and processing larger data on a
network
Provides up to 100 times faster performance for a few applications with in-memory
primitives, compared to the two-stage disk-based MapReduce paradigm of Hadoop
Is suitable for machine learning algorithms, as it allows programs to load and query
data repeatedly
Spark Core
and Machine
Resilient Spark Learning
Spark SQL GraphX
Distributed Streaming Library
Datasets (MLlib)
(RDDs)
Apache Spark
Components of Spark
Components of a Spark Project
The components of a Spark project are explained below:
Spark Core and As the foundation, it provides basic I/O, distributed task dispatching,
RDDs and scheduling. RDDs can be created by applying coarse-grained
transformations or referencing external datasets.
Spark Streaming It leverages the fast scheduling capability of Spark Core, ingests data
in small batches, and performs RDD transformations on them.
In column-centric databases, informations that are similar, can be stored together. The working of in-
memory processing can be explained as below:
The entire information is loaded into memory, eliminating the need for indexes,
aggregates, optimized databases, star schemas, and cubes.
Compression algorithms are used by most of the in-memory tools, thereby reducing
the in-memory size.
Querying the data loaded into the memory is different from caching.
With in-memory tools, the analysis of data can be flexible in size and can be accessed
within seconds by concurrent users with an excellent analytics potential.
Hadoop Ecosystem
Hadoop Ecosystem vs. Spark
You can perform every type of data processing using Spark that you execute in Hadoop. They are:
Machine Learning Analysis: MLlib can be used for clustering, recommendations, and
classification.
Interactive SQL Analysis: Spark SQL can be used over Stringer, Tez, or Impala.
Real-time Streaming Data Analysis: Spark streaming can be used over specialized
library like Storm.
Advantages of Spark
Advantages of Spark
Speed: Extends the MapReduce model to support computations like stream processing and
interactive queries
Combination: Covers various workloads that require different distributed systems, which
makes it easy to combine different processing types and allows easy tools management
Hadoop Support: Allows creation of distributed datasets from any file stored in the
Hadoop Distributed File System (HDFS) or any other supported storage systems
?
Why does unification matter?
Apache Spark
• Developers need to learn only one platform
• Users can take their apps everywhere
Hadoop Mesos NoSQL
Advantages of Spark
Worker Node
Executor Cache
Task Task
Driver Program
Cluster
Spark Context Manager
Worker Node
Executor Cache
Task Task
Spark Execution: Automatic Parallelization
Spark as Standalone
Can be launched manually by using launch scripts, or starting a master and workers;
used for development and testing
Spark on Mesos
Has advantages like scalable partitioning among different Spark instances and
dynamic partitioning between Spark and other frameworks
Spark on YARN
Has all parallel processing and benefits of the Hadoop cluster
Spark on EC2
Has key-value pair benefits of Amazon
Spark Shell
$Spark-shell $pySpark
SparkContext
• It is the main entry point of Spark API. Every Spark application requires a SparkContext.
• Spark Shell provides a preconfigured SparkContext called sc.
$Spark-shell $pySpark
Assisted Practice
Problem Statement: In this demonstration, you will learn how to run a Scala program in Spark shell.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Assisted Practice
Problem Statement: In this demonstration, you will learn how to set up an execution environment in IDE.
Assisted
Access: Click on the Practice
Practice Labs tab on 2the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Assisted Practice
Problem Statement: In this demonstration, you will understand the various components of Spark web UI.
Assisted
Access: Click on the Practice
Practice Labs tab on 3
the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways
b. Spark SQL
c. Spark Streaming
b. Spark SQL
c. Spark Streaming
Spark Core and RDDs, Spark SQL, and Spark Streaming are some of the components of Spark project.
Knowledge
Check
Spark was started in the year_______.
2
a. 2009
b. 2010
c. 2013
d. 2014
Knowledge
Check
Spark was started in the year_______.
2
a. 2009
b. 2010
c. 2013
d. 2014
Spark was started in the year 2009 at UC Berkeley AMPLab by Matei Zaharia.
Knowledge
Check
Which of the following are the supported Cluster Managers?
3
a. Standalone
b. Apache Mesos
c. Hadoop Yarn
a. Standalone
b. Apache Mesos
c. Hadoop Yarn
Standalone, Apache Mesos, and Hadoop Yarn are all supported Cluster Managers.
Lesson-End Project
Problem Statement:
You have enrolled as a trainee in one of the top training institutes which provides training on
Big Data. You have learned Spark and Hadoop and where to use them.
Based on that knowledge, your task is to solve the below use cases using Hadoop or Spark.
Use-cases:
1. An E-commerce company wants to show the most trending brands in the last 1 hour on
their web portal.
2. An E-commerce company wants to calculate orders for the last 5 years in the
mobile category.
3. You have been given the product data of clicks and impressions. Click means
when a user clicks on the product and goes to the product page.
Impression refers to the product landing page on Amazon. You have to create a
model that can predict if any product on the portal is eligible for click or not.
Spark Resilient Distributed Dataset (RDD) is an immutable collection of objects which defines the data
structure of Spark.
It was created and developed
by Martin Odersky.
The following are the features of Spark RDD:
In-Memory Immutable
Computation
Fault-Tolerant Location-Stickiness
RDD in Spark
1 Iterative algorithm
Performance Storage
Data Types Supported by RDD
Creating Spark RDD
Creating Spark RDD
Parallelized
Collections
Existing External
RDDs Data
Parallelized Collections
RDDs are created by parallelizing an existing collection in your driver program or referencing a dataset in an
external storage system.
RDDs are created by taking the existing collection and passing it to SparkContext parallelize() method.
val data=spark.sparkContext.parallelize(Seq((“physics",78),(“chemistry",48),(“biology",73),
(“english",54),("maths",77)))
val sorted = data.sortByKey()
sorted.foreach(println)
Creating RDD from Collections
File: Simplilearn.txt
Simplilearn
In [4]: data = [“Simplilearn”, “is”, “an”, “educational”,
“revolution”] is
an
In [5]: rdd1 = sc.parllelize(data) educational
RDD[1] (mydata) revolution
In [6]: rdd1.take(2) [“Simplilearn”, “is”]
[ “Simplilearn”, “is” ]
RDD[1] (myrdd2)
Existing RDDs
RDDs can be created from existing RDDs by transforming one RDD into another RDD.
wordPair.foreach(println)
External Data
In Spark, a dataset can be created from any other dataset. The other dataset must be supported by Hadoop,
including the local file system, HDFS, Cassandra, HBase, and many more.
Data frame reader interface can be used to load dataset from an external storage system in the following formats:
To create a file-based RDD, you can use the command SparkContext.textFile or sc.textfile, and pass one or more
file names.
sc.textFile(“simplilearn/*.log")
sc.textFile(“simplilearn.txt)
sc.textFile(“simplilearn1.txt,simplilearn2.txt")
Creating RDD from a Text File
{ { {
“firstname”: “Rahul”, “firstname”: “Rita”, “firstname”: “Sam”,
“lastname”: “Gupta”, “lastname”: “John”, “lastname”: “Grant”,
“customerid”: “001” “customerid”: “002” “customerid”: “002”
} } }
Pair RDDs
Double RDDs
001, 004
Double RDDs are RDDs that hold numerical data. Some of
002, 005
the functions that can be performed with Double RDDs are
003, 006
Distinct and Sum.
Creating Pair RDD
To create a Pair RDD, use functions such as Map, flatMap or flatMapValues, and keyBy.
Language: Python
Users = sc.textFile(file) \
.map (lambda line: line.split (‘\t’)) \
.map (lambda fields: (fields [0], fields [1])) (Cust001, Deepak Mehta)
Transformation Action
Transformation
Narrow transformation is self-sufficient. It is the result of map and filter, such that the data is from a
single partition only.
Map Filter
FlatMap Sample
MapPartition Union
Wide Transformation
Wide transformation is not self-sufficient. It is the result of GroupByKey() and ReduceByKey() like
functions, such that the data can be from multiple partitions.
Cartesian ReduceByKey
Join GroupByKey
Intersection Coalesce
Repartition Distinct
Spark Transformation: Detailed Exploration Duration: 10 mins
Assisted
Access: Click on the Practice
Practice Labs tab on 1
the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Action
Action allows us to return a value to the driver program, after running a computation on the dataset.
Actions are the RDD operations that produce non-RDD values.
Reduce Reduce is an action that aggregates all the elements of the RDD using some function.
Collect() Count()
Spark Action: Detailed Exploration Duration: 10 mins
Assisted
Access: Click on the Practice
Practice Labs tab on 2the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Caching and Persistence
Caching and Persistence
Caching and Persistence are the techniques used to store the result of RDD evaluation. They are also
used by developers to enhance the performance of applications.
Cost Efficient
Features of RDD Persistence
cache() persist()
Storage Levels
OFF_HEAP 6 1 MEMORY_ONLY
DISK_ONLY 5 2 MEMORY_AND_DISK
MEMORY_AND_DISK_SER 4 3 MEMORY_ONLY_SER
Marking an RDD for Persistence
Step 04
RDD[3] (myrdd2)
Changing Persistence Options
RDD Lineage is a graph that contains the existing RDD and the new RDD created from the existing one as a
result of transformation.
r00 r01
r20
DAG
Directed Acyclic Graph (DAG) is a graph where RDDs and the operations to be performed on RDDs are
represented in the form of vertices and edges, respectively.
Filter Filter
The need for DAG in Spark was created to overcome the limitations of DAG. The computation in
MapReduce is done as:
Driver Program
Cluster Manager Work Node
spark = SparkSession.builder…
spark.sparkContext…. Executer
Request for Allocate the resources
rdd = spark.read.textFile… and instruct workers to Task Task
worker
rdd.filter(…) execute the job.
nodes/Executers
rdd.map
in the cluster
rdd.count Action
Work Node
Executer
SparkContext
Task Task
Testing
Fixing Isolating
Attaching a Debugger to a Spark Application
The following are the steps for attaching a debugger to a Spark application:
Step 1 Upload the JAR file to the remote cluster and run it
Hash Partitioning
Range Partitioning
Custom Partitioning
Partitions from Single File
Partition can be done based Partition can be done by specifying the minimum
on size number of partitions as textFile(file, minPartitions)
Cluster
MasterNode
sc.textFile("mydir/*") sc.wholeTextFiles(“mydir”)
Operations on Partitions
foreachPartition mapPartitions
mapPartitionsWithIndex
Operation Stages
Operation
Task 1 Task 1
Task 2 Task 2
Stage 1 Stage 2
{}
Developer
Parallel Operations on Stages
map reduceByKey
flatMap
sortByKey
filter
join
distinct
groupByKey
RDD HDFS:
mydata
HDFS
Executor Task
Block 1
Driver Program
Spark
Context
HDFS
Executor Task
Block 2
HDFS
Executor Task
Block 3
Executor
Scheduling in Spark
Scheduling in Spark
Scheduling is the process in which resources are allocated to different tasks by an operating system.
In Spark, each application has its own JVM that runs tasks and stores data.
Static partitioning is the best approach for allocating resources to each application in the following Spark modes:
YARN
Scheduling within Application
Multiple jobs can run simultaneously inside a Spark application, if they are submitted from separate
threads.
Each job is divided into stages and resources are allocated in FIFO fashion.
In Spark 8.0, it is possible to enable fair scheduler which uses round-robin fashion to allocate tasks
between jobs.
conf.set("spark.scheduler.mode", "FAIR")
Shuffling is an operation which requires one node that will fetch data from other nodes to have data
for computing the result.
File File File File File File File File File File File File
Hash
Shuffle
Unsafe
Sort Shuffle
Shuffle
Hash Shuffle
In hash shuffle, each task will write the output into multiple files.
File File File File File File File File File File File File
Advantages of Hash Shuffle
No Memory
Overhead
No IO
Fast
Overhead
Sort Shuffle
In sort shuffle, each task spills only one shuffle containing segments and one index file.
In unsafe shuffle, the records are serialized once and then stored in memory pages.
Memory
Page 1
Record 1 (8 bytes) Memory
Record 2 (8 bytes) Page 2
Record xxx (8 bytes)
Memory
Page xxx
Sort as Array
Query Execution in Spark
toRdd 6 1 analyzed
executedPlan 5 2 withCachedData
sparkPlan 4 3 optimizedPlan
Execution in Spark
Dataset SQL
Analyzer
Logical Optimizer
Physical Planner
Optimizer
RDD
Aggregating Data with Pair RDD
Aggregation
To perform aggregations, the datasets must be described in the form of key-value pairs.
The following are the functions that can be used for aggregation:
reduceByKey() foldByKey()
mapValues()
Aggregation
Key Value
Panda 0
Pink 3 Key Value
Pirate 3 Panda (0,1)
Panda (1,1)
Key Value Pink (4,1)
Panda (1,2)
Pink (7,2)
Pirate (3,1)
Spark Application with Data Written Back to HDFS and Spark UI Duration: 10 mins
Problem Statement: In this demonstration, you will write the data to HDFS and Spark UI.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Changing Spark Application Parameters Duration: 10 mins
Problem Statement: In this demonstration, you will change Spark application parameters.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Handling Different File Formats Duration: 10 mins
Problem Statement: In this demonstration, you will handle different file formats.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Spark RDD with Real-World Application Duration: 10 mins
Problem Statement: In this demonstration, you will understand Spark RDD with a real-world application.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Optimizing Spark Jobs Duration: 10 mins
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways
a. Lazy Evaluation
b. Cross-Grained Operation
c. Fault-Tolerant
d. Immutable
Knowledge
Check Which feature of Spark RDD applies to all elements in datasets through maps, filter, or
group by operation?
1
a. Lazy Evaluation
b. Cross-Grained Operation
c. Fault-Tolerant
d. Immutable
Cross-Grained Operation feature of Spark RDD applies to all elements in datasets through maps, filter, or
group by operation.
Knowledge
Check
Which of the following transformations is not self-sufficient?
2
a. Narrow Transformation
b. Wide Transformation
c. Both a and b
a. Narrow Transformation
b. Wide Transformation
c. Both a and b
a. toRdd
b. executedPlan
c. withCachedData
d. optimizedPlan
Problem Statement: The New York School authority collects data from all schools
that provide bus facilities to students. This data helps to understand if buses are
reaching on time or not. This also helps to understand if there is a specific route
where buses are taking more time so that it can be improved.
You are given a dataset of buses which got broke down or are running late. You have
been given the below data set:
Filename: bus-breakdown-and-delays.csv
1. School_Year
a. Indicates the year the record refers to
2. Run_Type
a. Designates whether a breakdown or delay occurred on a specific category
of bus services
3. Bus_No
4. Route_Number
5. Reason
a. Reason for delay as entered by the staff employed by the reporting bus
vendor
6. Occurred_On
7. Number_Of_Students_On_The_Bus
Lesson-End Project
You are hired as a big data consultant to provide important insights. You
must write a Spark job using the above data and need to provide the
following:
1. Most common reasons for either a delay or breaking down of bus
2. Top 5 route numbers where the bus was either delayed or broke down
3. The total number of incidents,year-wise, when the students were
a. In the bus
b. Not in the bus
4. The year in which accidents were less
Thank You
Big Data Hadoop and Spark Developer
Spark SQL - Processing DataFrames
Learning Objectives
Spark SQL is a module for structured data processing that is built on top of core Spark.
Spark SQL
Spark SQL provides four main capabilities for using structured and semi-structured data.
Hive Compatibility: Compatible with the existing Hive queries, UDFs, and data
Unified Data Access: Loads and queries data from different sources
The below diagram shows the typical architecture and interfaces of Spark SQL.
User Programs
JDBC Console
(Scala, Python, Java)
Spark
Resilient Distributed Datasets
SQLContext
The SQLContext class or any of its descendants acts as the entry point into all functionalities.
Q
How to get the benefit of a superset of the basic SQLContext functionality?
Build a HiveContext to:
• Use the writing ability for queries
• Access Hive UDFs and read data from Hive tables
Points to Remember:
• You can use the Spark.sql.dialect option to select the specific variant of SQL
used for parsing queries
DataFrames represent a distributed collection of data in which data is organized into columns that are named.
Convert
Them to Call the rdd method, that returns the
RDDs DataFrame content, as an RDD of rows.
In the prior versions of Spark SQL API, SchemaRDD has been renamed as DataFrame.
Creating DataFrames
val df = sqlContext.read.json("examples/src/main/resources/customers.json")
// Displays the content of the DataFrame to stdout
df.show()
Assisted Practice
Problem Statement: In this demonstration, you will learn how to handle various data formats.
Assisted
Access: Click on the Practice
Practice Labs tab on1the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Using DataFrame Operations
Problem Statement: In this demonstration, you will learn how to implement various DataFrame
operations like filter, aggregates, joins, count, and sort.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
UDFs: User Defined Functions
The example below defines a UDF to convert a given text to upper case.
UDAFs are very useful when performing aggregations across groups or columns.
UDAF: User-Defined Aggregate Functions
In order to write a custom UDAF, you need to extend
UserDefinedAggregateFunction and define four methods.
Initialize On a given node, this method is called once for each group.
Update For a given group, Spark will call “update” for each input record of that group.
If the function supports partial aggregates, Spark computes partial result and
Merge
combines them together.
Once all the entries for a group are exhausted, Spark will call “evaluate” to get the
Evaluate
final result.
Assisted Practice
Problem Statement: In this demonstration, you will learn how to create UDF and UDAF.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Interoperating with RDDs
To convert existing RDDs into DataFrames, Spark SQL supports two methods:
Programmatic
Reflection-Based
• Lets you build a schema and apply
• Infers an RDD schema containing
to an already existing RDD
specific types of objects
• Allows you to build DataFrames
• Works well when the schema is
when you do not know the
already known while writing the
columns and their types until
Spark application
runtime
Using the Reflection-Based Approach
For Spark SQL, the Scala interface allows users to convert an RDD
with case classes to a DataFrame automatically.
The case class has the table schema, where the argument names to the case class
1
are read using the reflection method.
The case class can be nested and used to contain complex types like sequence of
2
arrays
Scala Interface implicitly convert the resultant RDD to a DataFrame and register it as
3
a table.
Using the Reflection-Based Approach
import sqlContext.implicits._
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
// or by field name:
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
Using the Programmatic Approach
This method is used when you cannot define case classes ahead of time.
For example, when the records structure is encoded in a text dataset or a string.
Apply the schema to the RDD of rows using the createDataFrame method
Using the Programmatic Approach
import org.apache.Spark.sql.Row;
import org.apache.Spark.sql.types.{StructType,StructField,StringType};
// Generate the schema based on the string of schema, Convert records of the RDD (people)
to Rows and Apply the schema to the RDD.
peopleDataFrame.registerTempTable("people")
Assisted Practice
Problem Statement: In this demonstration, you will learn how to process DataFrame(s) using SQL
queries.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
RDD vs. DataFrame vs. Dataset
Apache Spark provides three types of APIs: RDD, DataFrame, and Dataset.
RDD
(2011)
DATASET DATAFRAME
(2015) (2013)
Example: Filter By Attribute
Given below are the various ways to filter an attribute using the three APIs.
RDD
Low-level
Unstructured data
transformations
Use Case: DataFrame or Dataset API
DataFrame or
Dataset
High-level High-level
abstractions Unified access Type safety
expressions
Unassisted Practice
• Create the case classes with the following fields: Department, Employee, and DepartmentWithEmployees.
Note: Create the DepartmentWithEmployees instances from Departments and Employees.
Insert at least four values.
• Create two DataFrames from the list of the above case classes.
• Combine the two DataFrames and write the combined DataFrame to a parquet file.
• Use filter() or where() clause to return the rows whose first name is either “Alice” or “John”.
Note: Use first name filter as per your entries.
Steps to Perform
// Create the case classes
Steps to Perform
Steps to Perform
unionDF.write.parquet("/user/simpli_learn/simplitest")
Steps to Perform
// Using filter() to return the rows where first name is either 'Alice' or 'John'
Steps to Perform
import org.apache.spark.sql.functions._
a. Hive
b. Spark SQL
c. MapReduce
a. Hive
b. Spark SQL
c. MapReduce
c. It provides the Catalyst Optimizer along with SQL engine and CLI.
c. It provides the Catalyst Optimizer along with SQL engine and CLI
a. Spark SQL
b. SparkContext
c. DataFrames
d. Data Organizer
Knowledge
Check Which of the following represents a distributed collection of data in which data is
3 organized into columns that are named?
a. Spark SQL
b. SparkContext
c. DataFrames
d. Data Organizer
a. Programmatic
b. Reflective-Based
c. Both a and b
a. Programmatic
b. Reflective-Based
c. Both a and b
Problem Statement:
Every country has data for each of the companies that are operating in that country.
Registering a company is mandatory as it has to provide information about its profit/loss and
other details. “People data Labs” is one of the biggest data companies which collects and
provides data. They recently open-sourced the datasets of “Global companies”, which
operate in various countries. Data can be used to find a company in any specific industry,
their employee count, and website details.
1. Name
2. Domain
3. Year founded in
4. Industry
5. Size range
6. Country
7. LinkedIn URL
8. Current employee estimate
9. Total employee estimate
Lesson-End Project
Identify the skills required to become a data scientist and data analyst
A data scientist is the person who gathers data from multiple sources and applies machine learning,
predictive analytics, and sentiment analysis to extract critical information from the collected data sets.
Skills Required to Become a Data Scientist
Knowledge of Machine
Learning
A data analyst is the person who can do basic descriptive statistics, visualize data, and
communicate data points for conclusions.
Skills Required to Become a Data Analyst
Portable General
Types of Analytics
Descriptive Predictive
Analytics Analytics
Prescriptive Analytics
Descriptive Analytics
The type of analytics that describes the past and answers the question: “What has happened?”.
Descriptive Analytics
Data Mining
Predictive Analytics
The type of analytics that has the ability to understand the future and answer the question: “What might
happen?”.
Predictive Analytics
Forecast Technique
Prescriptive Analytics
The type of analytics that is used to advice the users on possible outcome and answer the question: “What should
be done?”.
Prescriptive Analytics
Simulation
Algorithms
Machine Learning
What Is Machine Learning?
Input Successful
Model
Training Input ML
Evaluation
Algorithm Model Algorithm
New Input
Relationship between Machine Learning and Data Science
Large-scale machine learning involves large data which has large number of
training, features, or classes.
Large-Scale Machine Learning Tools
Applications of Machine Learning
Image Processing
Healthcare Robotics
ML
Video Games
Types of ML Algorithms
• Supervised Learning
• Unsupervised Learning
• Semi-Supervised Learning
• Reinforcement Learning
Supervised Learning
Supervised Learning
Supervisor
It is a dog.
Output
Input Dataset Output
Model Training
New Input
Supervised Learning: Example
Dataset
Netflix uses supervised learning algorithms to recommend users the shows they
may watch based on the viewing history and ratings by similar classes of users.
Predicted
New input
outcome
Algorithm trained on
historical data
Supervised Learning Algorithms
Classification Regression
Classification with Real-World Problem
• Duration: 10 mins
Problem Statement: In this demonstration, you will perform classification with real-world problem.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Linear Regression with Real-World Problem
• Duration: 10 mins
Problem Statement: In this demonstration, you will perform linear regression with real-world problem.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unsupervised Learning
Unsupervised Learning
Output
Input
Algorithm
Understanding Processing
Model Training
Unsupervised Learning: Example
Unlabeled Data
Unsupervised Learning
Unsupervised Learning Algorithms
Clustering Reduction
Unsupervised Learning with Real-World Problem
• Duration: 10 mins
Problem Statement: In this demonstration, you will perform unsupervised learning with real-world problem.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Reinforcement Learning
Reinforcement Learning
Output
Dog or Cat? It is a dog.
Environment
Reward Action
Trained Model
Algorithm
State Selection
Agent
Labeled Input
Output
Model Training
Labeled
Semi-supervised
Learning
Pseudo-labeled
Overview of MLlib
What Is MLlib?
ML Algorithms 1 2 Featurization
Utilities 5 3 Pipelines
4 Persistence
MLlib Algorithms
Classification Optimization
Clustering Recommendation
DataFrame 1 2 Transformer
Parameter 5 3 Estimator
4 Pipeline
Working of Pipeline
Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator.
Logistic
Pipeline Tokenizer HashingTF
Regression
(Estimator)
Logistic
Pipeline.fit() Regression
Model
Raw Text Words Feature
Vectors
Working of Pipeline
Logistic
PipelineModel Tokenizer HashingTF Regression
(Transformer) Model
PipelineModel
.transform()
Raw Text Words Feature Predictions
Vectors
Key Takeaways
Identify the skills required to become a data scientist and data analyst
c. Experience in SQL
c. Experience in SQL
Knowledge of Python and R, ability to work with unstructured data, and experience in SQL are required to
d. All of the above
become a data scientist.
a. Descriptive analytics
b. Predictive analytics
c. Prescriptive analytics
a. Descriptive analytics
b. Predictive analytics
c. Prescriptive analytics
d. None of the
Descriptive above
analytics describes the past and answers the question, “What has happened?”.
a. Supervised learning
b. Unsupervised learning
c. Reinforcement learning
a. Supervised learning
b. Unsupervised learning
c. Reinforcement learning
a. Classification
b. Regression
c. Clustering
d. Optimization
Knowledge
Check Which MLlib algorithm is a statistical process for estimating the relationships among
b
4
variables?
a. Classification
b. Regression
c. Clustering
d. Optimization
Regression algorithm is a statistical process for estimating the relationships among variables.
Problem Statement: The Blue Nile is one of the largest diamond retail e-
commerce companies in the world. The company’s revenue numbers are
good this year. Recently a distributor was shutting down its business due to
his financial instability and he wants to sell all of the diamonds in an open
auction. This is a great opportunity for the company to expand its diamond
inventory and wants to participate in the bidding. To make sure your bid is
mostly accurate you must predict the correct price of the diamond.
To predict the price you should have a machine learning model and to
build that you need to have the correct data first. You have collected the
diamond data with all the possible diamond features concerned with
deciding the price of the diamond.
Lesson-End Project
You have created a diamond.csv file which has the following details:
1. Index: counter
2. Carat: Carat weight of the diamond
3. Cut: Describe the cut quality of the diamond. Quality in increasing order:
Fair, Good, Very Good, Premium, Ideal
4. Color: Color of the diamond, with D being the best and J the worst
5. Clarity: How obvious are the inclusions are in the diamond: (in order
from best to worst, FL = flawless, I3= level 3 inclusions)
6. Depth: depth %: The height of the diamond, measured from the culet to
the table, divided by its average girdle diameter
7. Table: table%: The width of the diamond’s table expressed as a
percentage of its average diameter
8. Price: the price of the diamond
9. X: length mm
10. Y: width mm
11. Z: depth mm
As a business analyst of the company, you are assigned the task of
recommending the bid amount for the diamond that the company should
bid for using the model built by the analytics team.
Thank You
Big Data Hadoop and Spark Developer
Stream Processing Frameworks and Spark Streaming
Learning Objectives
Big data streaming involves processing continuous streams of data in order to extract Real-time insights.
Apps
Queries
Services
Sensors
Responses
Streaming applications
Need for Real-Time Processing
Certain tasks require big data processing as quickly as possible. For example:
Scalability
High Storage
Real-Time Processing of Big Data
Real-time Processing of Big Data
Real-time processing consists of continuous input, processing, and analysis of reporting data.
Data Sources
Analysis and
Analytical Data Store
Reporting
Data Processing Architectures
Data Processing Architecture
A good architecture for Real-time processing should have the following properties.
1 2 3
Supportive of batch
Fault-tolerant and incremental Extensible
and scalable updates
The Lambda Architecture
The Lambda Architecture is composed of three layers: Batch, Real-Time, and Serving
Batch Layer
• Stores the raw data as it arrives
• Computes the batch views for
consumption
• Manages historical data
• Re-computes results such as machine
learning models
• Operates on full data
• Produces most accurate results
• Has a high cost of high latency due to
high computation time
Real-Time Layer
• Receives the arriving data
• Performs incremental updates to the
batch layer results
• Has incremental algorithms
implemented at the speed layer
• Has a significantly reduced computation
cost
The Kappa Architecture
The Kappa Architecture only processes data as a stream.
Use Case
Case Scenario: Twitter wanted to improve mobile experience for its users.
Problem: A complex system that receives events, archives them, performs offline
and real-time computations, and merges the results of those computations into
coherent information
Goal: To reduce impact on battery and network usage; ensuring data reliability
and getting the data over as close to real time as possible
Solution: To reduce impact on the device, analytics events are compressed and
sent in batches.
Result: This has helped provide app developers with reliable, real-time and
actionable insights into their mobile applications.
Assisted Practice
Problem Statement: In this demonstration, you will learn the basics of real-time processing.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Spark Streaming
Introduction to Spark Streaming
Spark Streaming is an extension of the core Spark API.
Kafka
File Systems
Flume
Kinesis Databases
HDFS/S3 Dashboards
Twitter
Working of Spark Streaming
Spark Streaming
Divide data
Streaming computations
stream into
expressed using DStreams
Live input data batches
Spark Streaming stream
Batches of Generate
input data RDD
as RDDs transformations
Receives
Discretized Stream
(DStream)
Machine learning and graph
processing algorithms
01 State storage and leader
Scala, Java, and Python election support
Problem Statement: In this demonstration, you will learn to write a Spark streaming application.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Micro Batch
Micro batching handles a stream by a task or process as a sequence that contains data chunks.
Data from time Data from time Data from time Data from time
DStream 2 to 3
0 to 1 1 to 2 3 to 4
Series of RDDs
Introduction to DStreams
All operations applied on a DStream get translated to operations applicable on the underlying RDDs.
Lines Lines from time Lines from time Lines from time Lines from time
DStream 0 to 1 1 to 2 2 to 3 3 to 4
FlatMap
Operation
Words Words from time Words from time Words from time Words from time
DStream 0 to 1 1 to 2 2 to 3 3 to 4
Input DStreams and Receivers
Input DStreams represent the input data stream received from streaming sources.
Except file stream, each input DStream is linked with a receiver object that stores the data received from a source.
Socket
File systems
connections
Transformations on DStreams
Transformations on DStreams
Transformations on DStreams are similar to those of RDDs.
A few of the common transformations on DStreams are given in the table below:
map(func) countByValue()
flatMap(func)
reduce(func)
Transformations
filter(func) on DStreams
reduceByKey(func, [numTasks])
repartition(numPartitions)
DStream.foreachRDD is a powerful primitive that lets the data to be sent to external systems.
Example:
DStream.foreachRDD { rdd => val connection = createNewConnection() //
executed at the driver
rdd.foreach { record => connection.send(record) // executed at the worker } }
DataFrame and SQL Operations
To use DataFrames and SQL Create an SQLContext using the SparkContext that the
operations StreamingContext uses
Example:
val words: DStream[String] = ...
words.foreachRDD { rdd =>
// Get the singleton instance of SQLContext
val sqlContext = SQLContext.getOrCreate(rdd.SparkContext)
import sqlContext.implicits._
val wordsDataFrame = rdd.toDF("word")
// Register as table and Do word count on DataFrame using SQL and print it
wordsDataFrame.registerTempTable("words")
val wordCountsDataFrame = sqlContext.sql("select word, count(*) as total
from words group by word")
wordCountsDataFrame.show() }
Checkpointing
A streaming application must be:
Resilient to failures
Fault-tolerant storage system
Metadata Checkpointing
Metadata
Types of
Checkpointing
Data
Enabling Checkpointing
Requirements
Recovering from failures of the
driver running the applications
Socket Stream
A socket is created on the driver’s machine. The code residing outside the closure of the DStream is
implemented in the driver, while the rdd.foreach method is implemented on every distributed RDD
partition.
The socket and computation are performed in the same host, which makes it effective.
Example:
crowd.foreachRDD(rdd =>
{rdd.collect.foreach(record=>{
out.println(record)
})
})
File Stream
A DStream can be created to read data from files on any file system that is compatible with the HDFS API
such as HDFS, S3, and NFS.
Example:
streamingContext.fileStream[KeyClass,
ValueClass, InputFormatClass](dataDirectory)
streamingContext.fileStream<KeyClass,
ValueClass, InputFormatClass>(dataDirectory);
streamingContext.textFileStream(dataDirectory)
Unassisted Practice
Steps to Perform
• Word count program code
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import StreamingContext._
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
object Streaming {
Steps to Perform
// create the FileInputDStream on the directory and use the stream to count words in newly created files
}
Unassisted Practice
Steps to Perform
• Netcat input
$ nc -lk 1234
Hi there, this is John
I am learning Big data from Simplilearn
Hi there, this is Alice
I am learning Apache Spark from Simplilearn
State Operations
State Management
Stateful
Stateless Stateful
When a service is active but is not engaged in A service that is processing and retaining
processing, it is said to be in a stateless condition state data actively is in a stateful condition
Stateful Operations
Stateful
01 Operate over various data batches
Window operations let you implement transformations over a sliding window of data.
Original
DStrea
m
Window-based
operation
Windowe
d
DStream Window at Window at Window at
time 1 time 3 time 5
Example:
val windowedWordCounts =
pairs.reduceByKeyAndWindow((a:Int,b:Int)
=> (a + b), Seconds(30), Seconds(10)) })
Types of Window Operations
Operations that take window length and slide interval as parameters are the following:
window(windowLength, slideInterval)
countByWindow(windowLength,slideInterval)
reduceByWindow(func, windowLength,slideInterval)
countByValueAndWindow(windowLength,slideInterval,[numTasks])
Join Operations: stream-stream Join
The first type, stream-stream joins, allows to join streams with other streams.
Example 1:
val stream1: DStream[String, String] = ...
val stream2: DStream[String, String] = ...
val joineDStream = stream1.join(stream2)
The second type, stream-dataset joins, allows to join a stream and a dataset.
Example:
val dataset: RDD[String, String] = ...
val windoweDStream = stream.window(Seconds(20))...
val joineDStream = windoweDStream.transform { rdd => rdd.join(dataset) }
Monitoring Spark Streaming Application
Spark Web UI displays a streaming tab that shows the statistics of the running receivers and details
of completed batches.
The time that it takes for processing The time a batch waits in a queue for the earlier batches
every data batch to complete processing
• If the batch processing time is continuously above the
batch interval or if the queue delay is increasing,
reduce the batch processing time.
• Monitor the progress of a Spark Streaming program
using the StreamingListener interface.
Assisted Practice
Problem Statement: In this demonstration, you will learn windowing of real-time data processing.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Spark Streaming Sources
Basic Sources
For basic sources, Spark streaming monitors the data directory and processes all files created in it.
Syntax:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
Categories
Advanced Sources
Kinesis
Apache Flume
Advanced Sources: Twitter
To create a DStream using data from Twitter’s stream of tweets, follow the steps listed below:
0 0 0
1 2 3
The artifact Spark-streaming- The TwitterUtils class needs to be An uber JAR needs to be
twitter_2.11 needs to be added imported and a DStream needs to be generated with all
to the SBT/Maven project created: import dependencies
dependencies which is under org.apache.Spark.streaming.twitter._T
org.apache.bahir witterUtils.createStream(ssc, None
Assisted Practice
Problem Statement: In this demonstration, you will learn how to process Twitter streaming data and
perform sentimental analysis.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Structured Spark Streaming
Introduction to Spark Structured Streaming
Structured Streaming is a high-level streaming API that is built on Spark SQL engine.
Structured API
(High-level APIs)
Limitations of
DStreams
DStreams API is
different from RDD Unreliable streaming
API
Advantages of Spark Structured Streaming
Easy to use
Better performance through One unified API for both batch and
Advantages
SQL optimizations streaming sources
Streaming unbounded
table
Static bounded
table
The key idea in Structured Streaming is to treat a live
data stream as a table that is being continuously
appended.
Batch vs. Streaming
The new Structured Streaming API enables you to easily adapt the batch
Sample code jobs that you have already written to deal with a stream of data.
Scenario: Banking transaction records containing the account number and transaction amount are
coming in a stream.
Advanatges
New data in the data stream • Allows them to focus on the business logic of the
= application rather than the infrastructure related aspects
New rows appended to an unbounded table
Spark Streaming vs. Spark Structured Streaming
• The API works only with batch • The API is same so you can write
to the same data destination and
can also read it back.
Structured Streaming Architecture, Model, and Its Components
Structured Streaming Architecture
Structured Streaming treats all the arriving data as an unbounded input table.
Every time there is a new item in the stream, it gets appended as a row in the input table
at the bottom.
Data Streaming
Target
into the system
Table
Co Co Co Co Co
l l l l l t0
t1
New Stream data
gets appended
t2
Unbounded table
Structured Streaming Model
Trigger.every 1
sec
Tim
e
Data up Data up Data up
Input to t=1 to t=2 to t=3
Output
Complete
mode
Components of Structured Streaming Model
Input Table
02 Trigger
01
04 Result
Incremental Queries 03
05
Output Mode
Output Modes
Complete
A query defined
Appendby the developer on the input
table and count to count the number of words Append Update
that in turn computes a final result table
Complete Update
A query defined by the developer on the input A query defined by the developer on the input
table and count to count the number of words table and count to count the number of words
that in turn computes a final result table that in turn computes a final result table
Output Sinks
.foreach(...)
.start()
Output Sinks
writeStream
.format("console")
.start()
writeStream
.format("memory")
.queryName("tableName")
.start()
Structured Streaming APIs
Features of Structured Streaming APIs
Example
val socketDF = Spark
.reaDStream
.format(“socket")// Reading Data from socket(Socket Datasource)
.option("host", "localhost")
.option("port", 9999)
.load()
Data Sources
File source
Example
Reading data from JSON file
val inputDF = Spark.reaDStream.json("s3://logs")
Operations on Streaming DataFrames and Datasets
Example
// Select the persons which have age more than 60
df.select("name").where("age > 60") // using untyped APIs
ds.filter(_.age > 60).map(_.name) // using typed APIs
// Running count of the number of counts for each value
df.groupBy("value").count() // using untyped API
Parsing Data with Schema Inference
Example
import org.apache.Spark.sql.types._
import org.apache.Spark.sql.catalyst.ScalaReflection
import org.apache.Spark.sql.functions._
case class Employee(
name:String,
city:String,
country:String,
age:Option[Int]
)
//Step 1:-Create schema for parsing data
val caseSchema =
(ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType])
//Step2:-Schema is passed to the stream
val empStream =
(Spark.reaDStream.schema(caseSchema).option("header",true).option("maxfilespertrigger
",1).csv("data/people.*").as[Employee])
//Step 3:Write the results to the screen
(empStream.writeStream.outputMode("append").format("console").start)
Constructing Columns in Structured Streaming
Example
(empStream.select($"country" === "France" as "in_France", $"age" <35 as “under_35”,
'country startsWith "U" as
"U_country").writeStream.outputMode("append").format("console").start)
groupby and Aggregation
Example
(empStream.groupBy(‘country).mean(“age”).writeStream.outputMode("complete").
format("console").start)
Example
(empStream.groupBy(‘country).agg(first("country") as “country",
count ( "age")) .writeStream.outputMode("complete").format("console").start)
Joining Structured Stream with Datasets
Streaming DataFrames can be joined with static DataFrames to create a new streaming DataFrame
Example
Windowed operations are running aggregations over data bucketed by time windows.
WordCount Example
12:05-12:15 Big 1
Windowed Grouped Aggregation with 10 12:05-12:15 Spark 1
12:05-12:15 Data 1
Structured Streaming allows recovery from failures, which is achieved by using checkpointing and
WAL.
Configure a query with a checkpoint location, and the query will save all the progress information
and the running aggregates for the checkpoint location.
The checkpoint location should be a path in an HDFS compatible file system, and can be set as an
option in the DataStreamWriter when starting a query.
Example
callsFromParis.writeStream.
format(“parquet").
option(“checkpointlocation","hdfs://nn:8020/mycheckloc").
start("/home/Spark/streaming/output")
Use Cases
Problem Statement: In this demonstration, you will learn how to create a streaming pipeline.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways
a. Lambda architecture
b. Kappa architecture
c. Both A and B
a. Lambda architecture
b. Kappa architecture
c. Both A and B
a. Twitter
b. Kafka
c. Flume
a. Twitter
b. Kafka
c. Flume
Problem Statement:
Alibaba is an e-commerce website that sells products online across different categories in
different countries. Festivals are coming up and they want to make sure that they hit their
target revenue. To make sure of that, they have decided to provide a trending dynamic
banner where users can see the trending categories and brands which can help the user to
decide which brand’s product to purchase and to see from which categories people are
purchasing the most. This will also help the company to keep enough inventory to fulfil all
orders.
Currently, their system is using Hadoop MR which is not providing the trending status in
real-time. They hired you as a big data engineer to modify the existing code and write an
optimized code that will work for any time duration.
For example, trending brands in the last 5 minutes.
You have been given transactions.csv file which contains the below fields:
1. Product Code
2. Description
3. Brand
4. Category
5. Sub Category
Lesson-End-Project
1. The top 5 trending categories in the last 5 minutes which has the maximum number
of orders
2. The bottom 5 brands in the last 10 minutes which has the least number of orders
3. Product units sold in the last 10 minutes
Note:
A graph is a structure which results to a set of objects which are related to each other. The
relation between them is represented using edges and vertices.
A B A B A B
Edge Labeled
Undirected Graph
Graph
Disconnected
Cyclic Graph
Graph
GraphX in Spark
Spark GraphX
GraphX is a graph computation system that runs on data-parallel system framework. It is a new
component in Spark for graphs and graph-parallel computation.
Features of Spark GraphX
RDD
Fault-tolerant
Distributed
Immutable
GraphX: Example
1 2 3
7 2 2
Suzan Alice 3 Sansa
Age: 48 Age: 65 Age: 55
Implementation of GraphX
Importing classes:
Example:
mapVertices mapEdges
mapTriplets
Property Operator
reverse subgraph
mask groupEdges
Structural Operators
The join operators are used to join data from external collections (RDDs) with graph.
joinVertices() outerJoinVertices()
joinVertices Operator
The joinVertices is an operator that joins the vertices with the input RDD and
returns a new graph with the vertex properties.
In outerJoinVertices operator, the user defined map function is applied to all vertices
and can change the vertex property type.
Neighborhood aggregation is the key task in graph analytics which includes aggregating information
about the neighborhood of each vertex.
graph.mapReduceTriplets graph.AggregateMessages
aggregateMessages is the core aggregation operation in GraphX which applies a user defined sendMsg
function to each edge triplet in the graph.
Neighborhood Aggregation
These frameworks cannot resolve the data ETL and cannot decipher process issues.
3
Algorithms in Spark
PageRank Algorithm
It is an iterative algorithm
On each iteration, a page contributes to its neighbors its own rank, divided by the number of its
neighbors.
Page 2 Page 3
1.0 1.0
Page 4
1.0
PageRank with Social Media Network
GraphX includes a social network dataset on which we can run the PageRank algorithm.
import org.apache.spark.graphx.GraphLoader
// Load the edges as a graph
val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("data/graphx/users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) => (username, rank)
}
// Print the result
println(ranksByUsername.collect().mkString("\n"))
Connected Components
The connected components is an algorithm that labels each connected component of the graph.
import org.apache.spark.graphx.GraphLoader
// Load the graph
val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")
// Find the connected components
val cc = graph.connectedComponents().vertices
// Join the connected components with the usernames
val users = sc.textFile("data/graphx/users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val ccByUsername = users.join(cc).map {
case (id, (username, cc)) => (username, cc)
}
// Print the result
println(ccByUsername.collect().mkString("\n"))
Triangle Counting
The triangle counting is an algorithm that determines the number of triangles passing through each vertex,
providing a measure of clustering.
Problem Statement: In this demonstration, you will understand the working of PageRank algorithm.
Assisted
Access: Click on the Practice Practice
Labs tab on1the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Pregel API
Pregel API
The initial message to start the The max number of super steps
computation for the Pregel API
Edge Direction
b c
vertexProgram
d
c d
c Msgc
vertex
d a
d Msgd
c
a e
f e Msge
f a
e d b f Msgf
Problem
Use Case of GraphX
Problem Statement: In this demonstration, you will work on a social medial real-world problem to
understand GraphX.
Assisted
Access: Click on the Practice Practice
Labs tab on1the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways
a. Edges
b. Vertices
c. Triplets
a. Edges
b. Vertices
c. Triplets
a. joinVertices()
b. outerJoinVertices()
c. Both a and b
a. joinVertices()
b. outerJoinVertices()
c. Both a and b
joinVertices() joins the vertices with the input RDD and returns a new graph with the vertex
properties.
Knowledge Which of the following structural operator constructs a subgraph by returning
Check a graph that contains the vertices and edges that are also found in the input
3 graph?
a. reverse
b. subgraph
c. groupEdges
d. mask
Knowledge Which of the following structural operator constructs a subgraph by returning
Check a graph that contains the vertices and edges that are also found in the input
3 graph?
a. reverse
b. subgraph
c. groupEdges
d. mask
Collected data is in CSV format (flights_graph.csv) which contains the following fields:
a. Airline
b. Flight_Number
c. Origin_Airport
d. Destination_Airport
e. Distance
f. Arrival_Delay
g. Arrival_Time
h. Diverted
i. Cancelled
You are hired as a big data consultant to provide important insights. You must write
Spark job using its graph component and use the above data to provide the
following insights:
Employee NYSE
Review Data
Analysis Analysis
1 2
Projects Projects
for for
submission practice
3 4
Car Transactional
Insurance Data
Analysis Analysis
Employee Review Analysis
Objective
Analyze employee review data and
provide actionable insights to the HR
team for taking corrective actions.
Problem Statement
To improve the employer-employee
relationship, current and ex-employee
feedbacks and sentiments have been
scraped from Glassdoor.
Objective
Use Hive features for data engineering
and analysis to share actionable insights.
Problem Statement
New York stock exchange data comprises
intra-day stock prices and volume traded
for each listed company.
Objective
Perform exploratory analysis to
understand the relationship between
multiple features and predict claims.
Problem Statement
01
A car insurance company wants to
02 analyze its historical data to predict the
probability of a customer making a claim.
Transactional Data Analysis
Objective
Use the Big Data stack for data
engineering and analysis of
transactional data logs.
Problem Statement
Amazon wants to increase their mobile
sales by a certain percentage.