0% found this document useful (0 votes)
9 views

Bda Unit-4

vcdj

Uploaded by

ANSHI RANK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Bda Unit-4

vcdj

Uploaded by

ANSHI RANK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

BDA

Unit – 4
HBASE, PIG and Zookeeper
By :- Urvi Dhamecha

Urvi Dhamecha
HBase
• HBase is a scalable distributed column-
oriented database built on top of the hadoop
and HDFS.
• It is an open source implemented based on
google’s big data.
• It is a part of hadoop ecosystem that provides
random real-time read/write access to data in
the hadoop file system

Urvi Dhamecha
HDFS vs. HBase
HDFS Hbase

Hbase is hadoop database that runs on top of


HDFS is a java based file distribution system
HDFS

HBase is partially tolerant and highly


HDFS is highly fault-tolerant and cost-effective
consistent

HDFS Provides only sequential read/write


Random access is possible due to hash table
operation

HBase supports random read and write


HDFS is based on write once read many times
operation into file system

HDFS has a rigid architecture HBase support dynamic changes

HDFS is prefereable for offline batch processing HBase is preferable for real time processing

HDFS provides high latency for access HBase provides low latency access to small
operations. amount of data

Urvi Dhamecha
Row-oriented vs. Column-oriented

Urvi Dhamecha
Row-oriented vs. Column-oriented
Row-oriented data stores
• Data is stored and retrieved one row at a time and
hence could read unnecessary data if only some of
the data in a row is required.
• Easy to read and write records
• Well suited for OLTP systems
• Not efficient in performing operations applicable to
the entire dataset and hence aggregation is an
expensive operation

Urvi Dhamecha
Row-oriented vs. Column-oriented
Column-oriented data stores
• Data is stored and retrieved in columns and hence
can read only relevant data if only some data is
required
• Read and Write are typically slower operations
• Well suited for OLAP systems
• Can efficiently perform operations applicable to the
entire dataset and hence enables aggregation over
many rows and columns

Urvi Dhamecha
HBase Data Model
• The data model in Hbase is designed to
accommodate semi-structured and
unstructured data that could vary in field size,
data type and columns.
• The design of the data model makes it easier
to partition the data and distributed it across
the cluster.

Urvi Dhamecha
HBase Data Model

Urvi Dhamecha
HBase Data Model

Rowkey Column Family Column Family Column Family

Col1 Col2 Col3 Col1 Col2 Col3 Col1 Col2 Col3

Urvi Dhamecha
HBase Data Model
• Tables: Data is stored in a table format in HBase. But here
tables are in column-oriented format.
• Row Key: Row keys are used to search records which
make searches fast.
• Column Families: Various columns are combined in a
column family. These column families are stored together
which makes the searching process faster because data
belonging to same column family can be accessed together
in a single seek.
• Column Qualifiers: Each column’s name is known as its
column qualifier.

Urvi Dhamecha
HBase Data Model
• Cell: Data is stored in cells. The data is dumped into cells
which are specifically identified by rowkey and column
qualifiers.
• Timestamp: Timestamp is a combination of date and time.
Whenever data is stored, it is stored with its timestamp.
This makes easy to search for a particular version of data.

Urvi Dhamecha
HBase Architecture

Urvi Dhamecha
HBase Architecture
HBase architecture has 4 main components:
1) HMaster
2) Region Server
3) Regions
4) Zookeeper

Urvi Dhamecha
HBase Architecture
HBase Architecture: Region

• HBase tables can be divided into a number of regions in


such a way that all the columns of a column family is stored in
one region.
• Each region contains the rows in a sorted order.
• A table can be divided into a number of regions.
• A Region has a default size of 256MB which can be configured
according to the need.
• A Group of regions is served to the clients by a Region Server.
• A Region Server can serve approximately 1000 regions to the
client.

Urvi Dhamecha
HBase Architecture
HBase Architecture: HMaster

Urvi Dhamecha
HBase Architecture
HBase Architecture: HMaster
• HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers.
• It coordinates and manages the Region Server (similar as
NameNode manages DataNode in HDFS).
• It assigns regions to the Region Servers on startup and re-assigns
regions to Region Servers during recovery and load balancing.
• It monitors all the Region Server’s instances in the cluster (with
the help of Zookeeper) and performs recovery activities
whenever any Region Server is down.
• It provides an interface for creating, deleting and updating
tables.

Urvi Dhamecha
HBase Architecture
HBase Architecture: Zookeeper – The Coordinator

Urvi Dhamecha
HBase Architecture
HBase Architecture: Zookeeper – The Coordinator
• Zookeeper acts like a coordinator inside HBase distributed
environment. It helps in maintaining server state inside the
cluster by communicating through sessions.
• Every Region Server along with HMaster Server sends
continuous heartbeat at regular interval to Zookeeper and
it checks which server is alive and available. It also provides
server failure notifications so that, recovery measures can
be executed.
• Referring from the above image you can see, there is an
inactive server, which acts as a backup for active server. If
the active server fails, it comes for the rescue.

Urvi Dhamecha
HBase Architecture
HBase Architecture: Zookeeper – The Coordinator
• The active HMaster sends heartbeats to the Zookeeper
while the inactive HMaster listens for the notification send
by active HMaster. If the active HMaster fails to send a
heartbeat the session is deleted and the inactive HMaster
becomes active.
• While if a Region Server fails to send a heartbeat, the
session is expired and all listeners are notified about it.
Then HMaster performs suitable recovery actions.

Urvi Dhamecha
HBase Architecture
HBase Architecture: Region Server

Urvi Dhamecha
HBase Architecture
HBase Architecture: Region Server
Components of a Region Server are:
• WAL: Write Ahead Log (WAL) is a file attached to every
Region Server inside the distributed environment. The WAL
stores the new data that hasn’t been persisted or
committed to the permanent storage. It is used in case of
failure to recover the data sets.

• Block Cache: Block Cache resides in the top of Region


Server. It stores the frequently read data in the memory. If
the data in BlockCache is least recently used, then that data
is removed from BlockCache.

Urvi Dhamecha
HBase Architecture
HBase Architecture: Region Server

• MemStore: It is the write cache. It stores all the incoming


data before committing it to the disk or permanent
memory. There is one MemStore for each column family in
a region. As you can see in the image, there are multiple
MemStores for a region because each region contains
multiple column families.

• HFile: HFile is stored on HDFS. Thus it stores the actual cells


on the disk. MemStore commits the data to HFile when the
size of MemStore exceeds.

Urvi Dhamecha
HBase Write Mechanism

Urvi Dhamecha
HBase Write Mechanism
Step 1: Whenever the client has a write request, the
client writes the data to the WAL (Write Ahead Log).
Step 2: Once data is written to the WAL, then it is
copied to the MemStore.
Step 3: Once the data is placed in MemStore, then
the client receives the acknowledgment.
Step 4: When the MemStore reaches the threshold, it
dumps or commits the data into a HFile.

Urvi Dhamecha
HBase Read Mechanism
• For reading the data, the scanner first looks for the
Row cell in Block cache. Here all the recently read
key value pairs are stored.
• If Scanner fails to find the required result, it moves to
the MemStore, as we know this is the write cache
memory. There, it searches for the most recently
written files, which has not been dumped yet in
HFile.
• At last, it will use bloom filters and block cache to
load the data from Hfile.

Urvi Dhamecha
HBase Read Mechanism
• For reading the data, the scanner first looks for the
Row cell in Block cache. Here all the recently read
key value pairs are stored.
• If Scanner fails to find the required result, it moves to
the MemStore, as we know this is the write cache
memory. There, it searches for the most recently
written files, which has not been dumped yet in
HFile.
• At last, it will use bloom filters and block cache to
load the data from Hfile.

Urvi Dhamecha
Advance Indexing (Self Study)
• Advanced indexing techniques in HBase involve leveraging
various strategies like secondary indexes, coprocessors, and
using external systems like Apache Phoenix.

Description of the Advanced Indexing Techniques in Hbase:

Primary Index (Row Key):


• HBase natively uses the row key as the primary index. All data
in HBase is stored in lexicographically sorted order based on
this row key. This allows fast lookups when querying by row
key but makes querying by other columns slow.

Urvi Dhamecha
Advance Indexing
Secondary Index Table (Manual Indexing):
• You can create a secondary index by manually maintaining an
index table. In this table, the row key is the value of the
column you want to index (e.g., a "name" or "email" column).
The value or reference points to the original table's row key.
• When a query is executed for a non-primary key column, the
index table is searched first to retrieve the corresponding row
key from the main table, allowing for faster querying.
Create a secondary index table manually.
For example, if you want to index a "name" column:
• put 'index_table', 'name_value', 'cf:row_key',
'original_table_row_key'

Urvi Dhamecha
PIG
• Apache Pig is an open-source library for exploring
large data sets.
• Pig provides an engine for executing data flows in
parallel on hadoop.
• The apache pig provides a high level language.
• Pig program supports parallelization mechanism.
• Pig consist of two main parts.
• Pig latin : language for expressing data flows.
• Pig engine : execution environment to run pig latin
programs.

Urvi Dhamecha
Apache Pig vs. Map Reduce
Sr. No. Map Reduce Pig
1. It is a Data Processing Language. It is a Data Flow Language.

It converts the job into map-reduce It converts the query into map-reduce
2.
functions. functions.

3. It is a Low-level Language. It is a High-level Language

It is difficult for the user to perform Makes it easy for the user to perform Join
4.
join operations. operations.

The user has to write 10 times more The user has to write fewer lines of code
5. lines of code to perform a similar because it supports the multi-query
task than Pig. approach.

It has several jobs therefore It is less compilation time as the Pig


6.
execution time is more. operator converts it into MapReduce jobs.

It is supported by recent versions of


7. It is supported with all versions of Hadoop.
the Hadoop.
Urvi Dhamecha
Features of Pig
• Rich set of operators : It provides many operators to performs
operation like join, sort, filter, etc
• Ease of programming : Pig latin is similar to SQL and it is easy to
write a pig script if you are good at SQL
• Optimization opportunities : The tasks in apache pig optimize
their execution automatically, so need to focus only on semantic
of language.
• Extensibility : Using the existing operators, users can develop
their own functions to read, process and write data.
• UDF’s : Pig provides the facility to create user define functions in
other programming languages such as java and invoke or embed
them in pig scripts.
• Handle all kind of data : apache pig analyzes all kind of data,
both structured and unstructured.
Urvi Dhamecha
Case Study of Twitter
Counting operations:
• How many requests twitter serve in a day?
• What is the average latency of the requests?
• How many searches happens each day on Twitter?
• How many unique queries are received?
• How many unique users come to visit?
• What is the geographic distribution of the users?

Correlating Big Data:


• How usage differs for mobile users?
• What goes wrong while site problem occurs?
• Which features user often uses?
• Search correction and search suggestions.

Urvi Dhamecha
Case Study of Twitter
Research on Big Data & produce better outcomes like:
• What can Twitter analysis about users from their tweets?
• Who follows whom and on what basis?
• What is the ratio of the follower to following?
• What is the reputation of the user?

Urvi Dhamecha
Case Study of Twitter
Case: Want to analyze how many tweet are stored per
User?
By Map Reduce:
• MapReduce program first inputs the key as rows and sends
the tweet table information to mapper function.
• Then the Mapper function will select the user id and associate
unit value (i.e. 1) to every user id.
• The Shuffle function will sort same user ids together. At last,
Reduce function will add all the number of tweets together
belonging to same user.
• The output will be user id, combined with user name and the
number of tweets per user.

Urvi Dhamecha
Case Study of Twitter
By Pig:

Urvi Dhamecha
Pig architecture

Urvi Dhamecha
Pig architecture
Pig Latin Scripts
• Initially as illustrated in the above image, we submit Pig scripts to
the Apache Pig execution environment which can be written in Pig
Latin using built-in operators.
• There are three ways to execute the Pig script:
• Grunt Shell: This is Pig’s interactive shell provided to execute all Pig
Scripts.
• Script File: Write all the Pig commands in a script file and execute
the Pig script file. This is executed by the Pig Server.
• Embedded Script: If some functions are unavailable in built-in
operators, we can programmatically create User Defined Functions
to bring that functionalities using other languages like Java, Python,
Ruby, etc. and embed it in Pig Latin Script file. Then, execute that
script file.

Urvi Dhamecha
Apache Pig Components
Parser
• Initially the Pig Scripts are handled by the Parser. It checks
the syntax of the script, does type checking, and other
miscellaneous checks.
• The output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements and
logical operators.
• In the DAG, the logical operators of the script are
represented as the nodes and the data flows are
represented as edges.

Urvi Dhamecha
Apache Pig Components
Optimizer
• The logical plan (DAG) is passed to the logical
optimizer, which carries out the logical optimizations
such as projection and pushdown.
Compiler
• The compiler compiles the optimized logical plan into
a series of MapReduce jobs.
Execution engine
• Finally the MapReduce jobs are submitted to Hadoop
in a sorted order. Finally, these MapReduce jobs are
executed on Hadoop producing the desired results.
Urvi Dhamecha
PIG Data Types

Urvi Dhamecha
Pig Latin Data Model
• The data model of Pig Latin is fully nested and it
allows complex non-atomic datatypes such
as map and tuple.

Urvi Dhamecha
Pig Latin Data Model
Atom
• Any single value in Pig Latin, irrespective of their datatype
is known as an Atom.
• It is stored as string and can be used as string and number.
int, long, float, double, chararray, and bytearray are the
atomic values of Pig.
• A piece of data or a simple atomic value is known as a field.
• Example − ‘raja’ or ‘30’

Urvi Dhamecha
Pig Latin Data Model
Tuple
• A record that is formed by an ordered set of fields is
known as a tuple, the fields can be of any type. A
tuple is similar to a row in a table of RDBMS.
• Example − (Raja, 30)

Urvi Dhamecha
Pig Latin Data Model
Bag
• A bag is an unordered set of tuples. In other words, a
collection of tuples (non-unique) is known as a bag. Each
tuple can have any number of fields (flexible schema). A bag is
represented by ‘{}’. It is similar to a table in RDBMS, but unlike
a table in RDBMS, it is not necessary that every tuple contain
the same number of fields or that the fields in the same
position (column) have the same type.
• Example − {(Raja, 30), (Mohammad, 45)}
• A bag can be a field in a relation; in that context, it is known
as inner bag.
• Example − {Raja, 30, {9848022338, [email protected],}}

Urvi Dhamecha
Pig Latin Data Model
Map
• A map (or data map) is a set of key-value pairs. The key needs
to be of type chararray and should be unique. The value might
be of any type. It is represented by ‘[]’
• Example − [name#Raja, age#30]

Urvi Dhamecha
PIG Run Modes
Apache Pig executes in two modes:
• Local Mode and
• Map Reduce Mode

Urvi Dhamecha
PIG Run Modes
Local Mode
• It executes in a single JVM and is used for
development experimenting and prototyping.
• Here, files are installed and run using
localhost.
• The local mode works on a local file system.
The input and output data stored in the local
file system.

Urvi Dhamecha
PIG Run Modes
MapReduce Mode
• The MapReduce mode is also known as Hadoop
Mode.
• It is the default mode.
• In this Pig renders Pig Latin into MapReduce jobs and
executes them on the cluster.
• It can be executed against semi-distributed or fully
distributed Hadoop installation.
• Here, the input and output data are present on
HDFS.

Urvi Dhamecha
Zookeeper
• Apache Zookeeper is a distributed, open-source coordination
service for distributed systems.
• It provides a central place for distributed applications to store
data, communicate with one another, and coordinate
activities.
• Zookeeper is used in distributed systems to coordinate
distributed processes and services.
• It provides a simple, tree-structured data model, a simple API,
and a distributed protocol to ensure data consistency and
availability.
• Zookeeper is designed to be highly reliable and fault-tolerant,
and it can handle high levels of read and write throughput.

Urvi Dhamecha
Why do we need Zookeeper?
• Coordination services: The integration/communication of
services in a distributed environment.
• Coordination services are complex to get right. They are
especially prone to errors such as race conditions and
deadlock.
• Race condition-Two or more systems trying to perform some
task.
• Deadlocks– Two or more operations are waiting for each
other.
• To make the coordination between distributed environments
easy, developers came up with an idea called zookeeper so
that they don’t have to relieve distributed applications of the
responsibility of implementing coordination services from
scratch.
Urvi Dhamecha
ZooKeeper Architecture
ZooKeeper Assemble

Leader
Server Server Server Server
Server
Follower Follower Server Follower Follower

Client Client Client Client Client Client Client Client

• All servers store a copy of the data (in memory)


• A leader is elected at startup
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have
persisted the change
Urvi Dhamecha
ZooKeeper Architecture
• The ZooKeeper architecture consists of a hierarchy of
nodes called znodes, organized in a tree-like
structure.
• Each znode can store data and has a set of
permissions that control access to the znode.
• The znodes are organized in a hierarchical
namespace, similar to a file system. At the root of the
hierarchy is the root znode, and all other znodes are
children of the root znode.
• The hierarchy is similar to a file system hierarchy,
where each znode can have children and
grandchildren, and so on.
Urvi Dhamecha
ZooKeeper Architecture
• Important Components in Zookeeper
Client:
• Clients, one of the nodes in our distributed application
cluster, access information from the server. For a particular
time interval, every client sends a message to the server to
let the sever know that the client is alive.
• Similarly, the server sends an acknowledgement when a
client connects. If there is no response from the connected
server, the client automatically redirects the message to
another server.

Urvi Dhamecha
ZooKeeper Architecture
Server:
• Server, one of the nodes in our ZooKeeper ensemble,
provides all the services to clients. Gives acknowledgement
to client to inform that the server is alive.
Ensemble:
• Group of ZooKeeper servers. The minimum number of
nodes that is required to form an ensemble is 3.
Leader:
• Server node which performs automatic recovery if any of
the connected node failed. Leaders are elected on service
startup.
Follower:
• Server node which follows leader instruction.
Urvi Dhamecha
Zookeeper Data Mode

Urvi Dhamecha
Zookeeper Data Mode
• In Zookeeper, data is stored in a hierarchical namespace,
similar to a file system.
• Each node in the namespace is called a Znode, and it can
store data and have children.
• Znodes are similar to files and directories in a file system.
• Zookeeper provides a simple API for creating, reading,
writing, and deleting Znodes.
• It also provides mechanisms for detecting changes to the
data stored in Znodes, such as watches and triggers.
• Znodes maintain a stat structure that includes: Version
number, ACL, Timestamp, Data Length

Urvi Dhamecha
Node Types in Zookeeper

Urvi Dhamecha
Node Types in Zookeeper
Persistence Znode
• All the nodes in an ensemble assume themselves to be
Persistence Znodes. These nodes tend to stay alive even after
the client is disconnected.
Ephemeral Znode
• These type of nodes stay alive until the client is connected to
them. When the client gets disconnected, they die. These
type of nodes are not allowed to have children.
Sequential Znode
• It can be either a Persistence Znode or an Ephemeral Znode.
When a node gets created as a Sequential Znode, then you
can assign the path of the Znode.

Urvi Dhamecha
Sessions and Watches
Sessions
• A session is a time interval assigned to every client for
receiving service. Every client is provided with a Session-ID
and the service is provided in sequential order. Every client
sends a heartbeat to the server to keep the session valid. If a
heartbeat is not received for more than the interval of
session-timeout, then the server considers the client to be
dead.
Watches
• These are just notifications to the client. Whenever there is a
change in the Ensemble, then the client receives a notification
from the ensemble about that change in the form of a watch.

Urvi Dhamecha
Benefits and Challenges of Zookeeper
Benefits:
Manage configuration across nodes
• If you have dozens or hundreds of nodes, it becomes hard to
keep configuration in sync across nodes and quickly make
changes. ZooKeeper helps you quickly push configuration
changes.
Implement reliable messaging
• With ZooKeeper, you can easily implement a
producer/consumer queue that guarantees delivery, even if
some consumers or even one of the ZooKeeper servers fails.

Urvi Dhamecha
Benefits and Challenges of Zookeeper
Benefits:
Implement redundant services
• With ZooKeeper, a group of identical nodes (e.g. database
servers) can elect a leader/master and let ZooKeeper refer all
clients to that master server. If the master fails, ZooKeeper
will assign a new leader and notify all clients.
Synchronize process execution
• With ZooKeeper, multiple nodes can coordinate the start and
end of a process or calculation. This ensures that any follow-
up processing is done only after all nodes have finished their
calculations.

Urvi Dhamecha
Benefits and Challenges of Zookeeper
Challenges:
• Why is coordination in a distributed system the hard
problem?
• Coordination or configuration management for a distributed
application that has many systems.
• Master Node where the cluster data is stored.
• Worker nodes or slave nodes get the data from this master
node.
• single point of failure.
• synchronization is not easy.
• Careful design and implementation are needed.

Urvi Dhamecha
End of Unit - 4

Urvi Dhamecha

You might also like