Unit 4
Unit 4
Frameworks and Applications: Frameworks: Applications on Big Data Using Pig and Hive, Data
processing operators in Pig, Hive services, HiveQL, Querying Data in Hive, Fundamentals of
HBase and ZooKeeper
1
Data Integration: Hive acts as a bridge between various data sources and formats within the big
data ecosystem. It seamlessly integrates data from relational databases, NoSQL stores, and flat files,
enabling a unified view for analysis.
Machine Learning: Hive plays a preparatory role in machine learning workflows. By cleansing,
transforming, and structuring data, Hive helps prepare it for feeding into machine learning
algorithms and models.
Data Governance: Hive's centralized metadata store provides a mechanism for data governance.
This metadata repository offers information about data lineage, ownership, and access control,
ensuring data quality and security within the big data architecture.
In summary, Apache Hive serves as a cornerstone for big data analytics by simplifying data
management, analysis, and integration at scale.
INTRODUCTION TO PIG
1. Pig Represents Big Data as data flows.
2. Pig is a high-level platform or tool which is used to process the large datasets.
3. It provides a high-level of abstraction for processing over the MapReduce.
4. It provides a high-level scripting language, known as Pig Latin which is used to develop the
data analysis codes. Internally Pig Engine (a component of Apache Pig) converted all these
scripts into a specific map and reduce task. But these are not visible to the programmers in order
to provide a high-level of abstraction.
5. Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The result of Pig
always stored in the HDFS.
6. Pig Engine has two type of execution environment i.e. a local execution environment in a single
JVM (used when dataset is small in size) and distributed execution environment in a Hadoop
Cluster.
NEED OF PIG:
One limitation of Map Reduce is that the development cycle is very long. Writing the
reducer and mapper, compiling packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development using the multi-query approach.
Also, Pig is beneficial for programmers who are not from Java background.
HISTORY OF APACHE PIG:
Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At that time, the main
idea to develop Pig was to execute the Map Reduce jobs on extremely large datasets. In the year
2007, it moved to Apache Software Foundation (ASF) which makes it an open source project. The
2
first version (0.1) of Pig came in the year 2008. The latest version of Apache Pig is 0.18 which
came in the year 2017.
FEATURES OF APACHE PIG:
1. For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.
2. Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
3. Apache Pig is extensible so that you can make your own process and user-defined functions
(UDFs) written in python, java or other programming languages.
4. Join operation is easy in Apache Pig.
5. Fewer lines of code.
6. Apache Pig allows splits in the pipeline.
7. By integrating with other components of the Apache Hadoop ecosystem, such as Apache
Hive, Apache Spark, and Apache ZooKeeper, Apache Pig enables users to take advantage of
these components’ capabilities while transforming data.
8. The data structure is multi-valued, nested, and richer.
9. Pig can handle the analysis of both structured and unstructured data.
EXECUTION TYPES: Pig has two execution types or modes: local mode and MapReduce mode.
Local mode: In local mode, Pig runs in a single JVM and accesses the local file system. This mode
is suitable only for small datasets and when trying out Pig. To run in local mode, set the option to
local, command is:
$ pig -x local
grunt>
This starts Grunt, the Pig interactive shell
Map Reduce mode: In Map Reduce mode, Pig translates queries into MapReduce jobs and runs
them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster. MapReduce
mode is used to run Pig on large datasets. The default mode is map reduce mode. To run in map
reduce mode, command is:
$pig -brief
grunt>
3
Grunt :
Grunt shell is a shell command.
The Grunt shell of the Apace pig is mainly used to write pig Latin scripts.
Pig script can be executed with grunt shell, which is a native shell provided by Apache pig
to execute pig queries.
We can invoke shell commands using sh and fs.
Syntax of sh command :
grunt> sh -ls
Syntax of fs command :
grunt>fs -ls
PIG LATIN
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop.
It is a textual language that abstracts the programming from the Java MapReduce idiom into
a notation.
The Pig Latin statements are used to process the data.
It is an operator that accepts a relation as an input and generates another relation as an
output.
It can span multiple lines.
Each statement must end with a semi-colon.
It may include expression and schemas.
By default, these statements are processed using multi-query execution
User-Defined Functions:
Apache Pig provides extensive support for User Defined Functions(UDF’s)
Using these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages:
Java, Jython, Python, JavaScript, Ruby, Groovy
For writing UDF’s, complete support is provided in Java and limited support is provided in
all the remaining languages.
Using Java, you can write UDF’s involving all parts of the processing like data load/store,
column transformation, and aggregation.
Since Apache Pig has been written in Java, the UDF’s written using Java language work
efficiently compared to other languages.
Types of UDFs in Java:
4
Filter Functions:
The filter functions are used as conditions in filter statements.
These functions accept a Pig value as input and return a Boolean value.
Eval Functions:
The Eval functions are used in FOREACH-GENERATE statements.
These functions accept a Pig value as input and return a Pig result.
Algebraic Functions:
The Algebraic functions act on inner bags in a FOREACH-GENERATE statement.
These functions are used to perform full MapReduce operations on an inner bag.
b) Store function
Function that specifies how to save the contents of a relation to external storage. After you
have finished processing your data, you will want to write it out somewhere. Pig provides the store
statement for this purpose.
Example:
grunt> store processed into '/data/examples/processed';
Pig will write the results of your processing into a directory processed in the directory
/data/examples
c) Dump
If you want to write the processed results to your screen, then use the dump.
5
Example:
grunt> dump processed;
II. FILTERING DATA
Filtering data means removing the data that is not interested to you. By filtering early in
before processing, you can minimize the amount of data flowing through the system, which can
improve efficiency.
a) FOREACH...GENERATE
The FOREACH...GENERATE operator is used to act on every row in a relation. It can be
used to remove fields or to generate new ones.
In this example, we do both:
grunt>DUMP A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
6
Since the large datasets that are suitable for analysis by Pig are not normalized, however, joins
are used more infrequently in Pig.
a) JOIN
Let’s look at an example of an inner join. Consider the relations A and B:
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Hank,2)
(2,Tie,Joe,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
Pig also supports outer joins
For example:
grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;
grunt> DUMP C;
(1,Scarf,,)
(2,Tie,Hank,2)
(2,Tie,Joe,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
b) COGROUP
7
JOIN always gives a flat structure: a set of tuples. The COGROUP statement creates a
nested set of output tuples.
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Hank,2),(Joe,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
COGROUP generates a tuple for each unique grouping key.
The first field of each tuple is the key, and the remaining fields are bags of tuples from the
relations with a matching key.
The first bag contains the matching tuples from relation A
The second bag contains the matching tuples from relation B
If for a particular key a relation has no matching key, the bag for that relation is empty.
c) CROSS
Pig Latin includes the cross-product operator CROSS, which joins every tuple in a relation
with every tuple in a second relation (and with every tuple in further relations, if supplied). The size
of the output is the product of the size of the inputs.
grunt> I = CROSS A, B;
grunt> DUMP I;
(2,Tie,Joe,2)
(2,Tie,Hank,4)
(2,Tie,Ali,0)
(2,Tie,Eve,3)
(2,Tie,Hank,2)
(4,Coat,Joe,2)
(4,Coat,Hank,4)
(4,Coat,Ali,0)
(4,Coat,Eve,3)
(4,Coat,Hank,2)
(3,Hat,Joe,2)
(3,Hat,Hank,4)
(3,Hat,Ali,0)
8
(3,Hat,Eve,3)
(3,Hat,Hank,2)
(1,Scarf,Joe,2)
(1,Scarf,Hank,4)
(1,Scarf,Ali,0)
(1,Scarf,Eve,3)
(1,Scarf,Hank,2)
When dealing with large datasets, you should try to avoid this operation. Computing the
cross product of the whole input dataset is rarely used.
d) GROUP
The GROUP statement groups the data in a single relation.
For example, consider the following relation A :
grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)
Let’s group by the number of characters in the second field:
grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;
(5,{(Eve,apple),(Ali,apple)})
(6,{(Joe,banana),(Joe,cherry)})
IV. SORTING DATA
Relations are unordered in Pig.
There is no guarantee which order the rows will be processed in.
If you want to have an order on the output, you can use the ORDER operator to sort a
relation by one or more fields.
Consider a relation A :
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
The following example sorts A by the first field in ascending order and by the second field
in descending order:
9
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
(1,2)
(2,4)
(2,3)
Any further processing on a sorted relation is not guaranteed to retain its order
V. COMBINING AND SPLITTING DATA
A) Sometimes you have several relations that you would like to combine into one. For this, the
UNION statement is used.
For example:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
grunt> DUMP B;
(z,x,8)
(w,y,1)
grunt> C = UNION A, B;
grunt> DUMP C;
(2,3)
(z,x,8)
(1,2)
(w,y,1)
(2,4)
C is the union of relations A and B , and because relations are unordered, the order of the
tuples in C is undefined.
Also, it’s possible to form the union of two relations with different schemas or with different
numbers of fields.
Here, Pig attempts to merge the schemas from the relations that UNION is operating on. In
this case, they are incompatible, so C has no schema.
B) The SPLIT operator is the opposite of UNION: it partitions a relation into two or more relations.
Consider a relation A:
grunt> DUMP A
10
(3,2)
(1,8)
(4,9)
(2,6)
(1,7)
(2,1)
grunt>SPLIT A INTO X IF $0<=2, Y IF $0>2;
grunt>DUMP X
(1,8)
(2,6)
(1,7)
(2,1)
grunt>DUMP Y
(3,2)
(4,9)
11
INTRODUCTION TO HIVE
Apache Hive is a data warehouse and it provides an SQL-like interface between the user and
the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top of Hadoop. It
is a software project that provides data query and analysis. It facilitates reading, writing and
handling wide datasets that stored in distributed storage and queried by Structure Query Language
(SQL) syntax. It is frequently used for data warehousing tasks.
Initially Hive is developed by Facebook and Amazon, Netflix and It delivers standard SQL
functionality for analytics.
Hive uses a language called HiveQL, which is similar to SQL, to allow users to express data
queries, transformations, and analyses in a familiar syntax. HiveQL statements are compiled into
MapReduce jobs, which are then executed on the Hadoop cluster to process the data.
Hive can be used for a variety of data processing tasks, such as data warehousing, ETL
(extract, transform, load) pipelines, and ad-hoc data analysis. It is widely used in the big data
industry, especially in companies that have adopted the Hadoop ecosystem as their primary data
processing platform.
HIVE ARCHITECTURE
The following architecture explains the flow of submission of query into Hive:
12
Hive Client
Hive allows applications in various languages, including Java, Python, and C++. It supports
different types of clients such as:-
4. Thrift Server - It is a cross-language service provider platform that serves the request from all
those programming languages that supports Thrift.
5. JDBC Driver - It is used to establish a connection between hive and Java applications. The
JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
6. ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.
Hive Services
The following are the services provided by Hive:-
1. Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
2. Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a
web-based GUI for executing Hive queries and commands.
3. Hive Meta Store - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, it is used to read and write data and the corresponding HDFS files where the data is
stored.
4. Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
5. Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
6. Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into Map
Reduce jobs.
7. Hive Execution Engine - Optimizer generates the logical plan in the form of map-reduce tasks.
In the end, the execution engine executes the incoming tasks in the order of their dependencies.
13
HIVE QUERY LANGUAGE (HiveQL)
HiveQL is a mixture of SQL-92, MySQL, and Oracle’s SQL dialect. HiveQL also provides
features from new SQL standards. Some of Hive’s non-SQL standards such as multi-table inserts
and the TRANSFORM, MAP, and REDUCE clauses.
A comparison of SQL and HiveQL:
Data Types
Hive supports both primitive and complex data types. Primitives include numeric, Boolean,
string, and timestamp types. The complex data types include arrays, maps, and struct. Hive’s data
types are listed:
14
PRIMITIVE DATATYPES:
BOOLEAN type for storing true and false values
There are four signed integral types: TINYINT , SMALLINT , INT , and BIGINT are 1-
byte, 2-byte, 4-byte, and 8-byte signed integers
Hive’s floating-point types, FLOAT and DOUBLE, correspond to Java’s float and double,
which are 32-bit and 64-bit floating-point numbers.
The DECIMAL data type is used to represent arbitrary-precision decimals. For example,
DECIMAL(5,2) stores numbers between −999.99 and 999.99
There are three Hive data types for storing text:
◦ STRING is a variable-length character string with no declared maximum length.
◦ VARCHAR types are similar to STRING except they are declared with a maximum
length
◦ CHAR types are fixed-length strings
The BINARY data type is for storing variable-length binary data.
The TIMESTAMP data type stores timestamps with nanosecond precision
The DATE data type stores a date with year, month, and day components
15
COMPLEX DATATYPES
Example:
CREATE TABLE complex
(
c1 ARRAY<INT>,
c2 MAP<STRING, INT>,
c3 STRUCT<a:STRING, b:INT, c:DOUBLE>,
c4 UNIONTYPE<STRING, INT>
);
OPERATORS AND FUNCTIONS
The main types of Built-in Operators in HiveQL:
Relational Operators
Arithmetic Operators
Logical Operators
Operators on Complex types
Operators such as equals, Not equals, less than, greater than …etc
16
The operand types are all number types in these Operators.
The following Table will give us details about Relational operators and its usage in HiveQL:
Built-in
Description Operand
Operator
TRUE if expression X is equivalent to expression Y It takes all primitive
X=Y
Otherwise FALSE. types
TRUE if expression X is not equivalent to expression It takes all primitive
X != Y
Y Otherwise FALSE. types
TRUE if expression X is less than expression Y It takes all primitive
X<Y
Otherwise FALSE. types
TRUE if expression X is less than or equal to It takes all primitive
X <= Y
expression Y Otherwise FALSE. types
TRUE if expression X is greater than expression Y It takes all primitive
X>Y
Otherwise FALSE. types
TRUE if expression X is greater than or equal to It takes all primitive
X>= Y
expression Y Otherwise FALSE. types
TRUE if expression X evaluates to NULL otherwise
X IS NULL It takes all types
FALSE.
FALSE If expression X evaluates to NULL otherwise
X IS NOT NULL It takes all types
TRUE.
TRUE If string pattern X matches to Y otherwise
X LIKE Y Takes only Strings
FALSE.
17
Operator Description Operand
~X It will return the output of bitwise NOT of X. It takes all number types
Logical operations such as AND, OR, NOT between operands we use these Operators.
The operand types all are BOOLEAN type in these Operators
The following Table will give us details about Logical operators in HiveSQL:
The following Table will give us details about Complex Type Operators . These operators are used
to access elements in complex types.
18
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC, temperature DESC;
1949 111
1949 78
1950 22
1950 0
1950 -11
II. MapReduce Scripts
The TRANSFORM, MAP, and REDUCE clauses allows to execute an external script or program
from Hive
Example: Python script to filter out poor-quality weather records
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
(year, temp, q) = line.strip().split()
if (temp != "9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
19
If we use a nested form for the query, we can specify a map and a reduce function. Using MAP and
REDUCE keywords would have the same result
Example:
FROM (
FROM records2
MAP year, temperature, quality
USING 'is_good_quality.py'
AS year, temperature) map_output
REDUCE year, temperature
USING 'max_temperature_reduce.py'
AS year, temperature;
III. JOINS
One of the nice things about using Hive is that Hive makes performing commonly used
operations very simple rather than raw MapReduce model.
Inner joins: The simplest kind of join is the inner join, where each match in the input tables results
in a row in the output.
Consider two small demonstration tables, sales (which lists the names of people and the IDs
of the items they bought) and things (which lists the item IDs and their names):
hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
hive> SELECT * FROM things;
2 Tie
4 Coat
3 Hat
1 Scarf
We can perform an inner join on the two tables as follows:
hive> SELECT sales.*, things.*
> FROM sales JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
20
Hank 4 4 Coat
Eve 3 3 Hat
Hank 2 2 Tie
Outer joins: Outer joins allow you to find non-matches in the tables being joined. There are
different types of Outer Joins:
1. Left Outer Join
2. Right Outer Join
3. Full Outer Join
Left Outer Join:
In this, the query will return a row for every row in the left table ( sales ), even if there is no
matching row in the table it is being joined to ( things ):
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
Hank 4 4 Coat
Ali 0 NULL NULL
Eve 3 3 Hat
Hank 2 2 Tie
Right Outer Join
Hive also supports right outer joins, which reverses the roles of the tables relative to the left
join. In this case, all items from the things table are included, even those that weren’t purchased by
anyone (a scarf):
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
Hank 2 2 Tie
Hank 4 4 Coat
Eve 3 3 Hat
NULL NULL 1 Scarf
Full Outer Join
In full outer join, the query returns the unmatched rows from both the left table and right table:
hive> SELECT sales.*, things.*
> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
21
NULL NULL 1 Scarf
Hank 2 2 Tie
Joe 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Semi joins: In semi-join, matched rows from only one table are retrieved as output of the query.
For LEFT SEMI JOIN, rows matched in left side table with the right side table are only retrieved.
For example, to query the ids of the things table which are in the sales table:
hive> SELECT *
> FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
2 Tie
4 Coat
3 Hat
Map Joins
Map join is a Hive feature that is used to speed up Hive queries. It lets a table to be loaded into
memory so that a join could be performed within a mapper without using a Map/Reduce step. If
queries frequently depend on small table joins, using map joins speed up queries' execution.
Consider the original inner join again:
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
If one table is small enough to fit in memory, as things is here, Hive can load it into memory to
perform the join in each of the mappers. This is called a map join.
IV. SUBQUERIES
A sub-query is a SELECT statement that is placed in another SQL statement. Hive has
limited support for sub-queries. The following example query finds the mean maximum
temperature for every year and weather station:
SELECT station, year, AVG(max_temperature)
FROM (
SELECT station, year, MAX(temperature) AS max_temperature
FROM records2
WHERE temperature != 9999 AND quality IN (0, 1, 4, 5, 9)
GROUP BY station, year
) GROUP BY station, year;
22
The above sub-query is used to find the maximum temperature for each station/date
combination, and then the outer query uses the AVG aggregate function to find the average of the
maximum temperature readings for each station/date combination.
V. VIEWS
A view is a "virtual table" that is defined by a SELECT statement. Views can be used to present
data to users in a way that differs from the way it is actually stored on disk.
Views may also be used to restrict users access to particular subsets of tables that they are
authorized to see.
In Hive, a view is not materialized to disk when it is created; rather, the view’s SELECT
statement is executed when the statement that refers to the view is run
To create a view of maximum temperatures for each station and year:
CREATE VIEW max_temperatures (station, year, max_temperature)
AS
SELECT station, year, MAX(temperature)
FROM records
GROUP BY station, year;
With the views in place, we can now use them by running a query:
SELECT station, year, AVG(max_temperature)
FROM max_temperatures
GROUP BY station, year;
23
FUNDAMENTALS OF HBASE AND ZOOKEEPER
Introduction to HBase:
HBase is an open source and sorted map data built on Hadoop.
It is column oriented and horizontally scalable.
It is based on Google's Big Table. It has set of tables which keep data in key value format.
HBase provides APIs enabling development in practically any programming language.
HBase does support writing applications in Java, Apache Avro, REST and Thrift.
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.
Need of HBase:
RDBMS get exponentially slow as the data becomes large
Expects data to be highly structured, i.e. ability to fit in a well-defined schema
Any change in schema might require a downtime
For sparse datasets, too much of overhead of maintaining NULL values
Features of HBase
Horizontally scalable: You can add any number of columns anytime.
Automatic Failover: Automatic failover is a resource that allows a system administrator to
automatically switch data handling to a standby system in the event of system compromise
Integrations with Map/Reduce framework: All the commands and java codes internally implement
Map/ Reduce to do the task and it is built over Hadoop Distributed File System.
It’s a platform for storing and retrieving data with random access.
It doesn't care about data types (storing an integer in one row and a string in another for the same
column).
It doesn't enforce relationships within your data.
It is designed to run on a cluster of computers, built using commodity hardware.
Applications of HBase:
Medical: HBase is used in the medical field for storing genome sequences and running MapReduce
on it, storing the disease history of people or an area, and many others.
Sports: HBase is used in the sports field for storing match histories for better analytics and
prediction.
Web: HBase is used to store user history and preferences for better customer targeting.
Oil and petroleum: HBase is used in the oil and petroleum industry to store exploration data for
analysis and predict probable places where oil can be found.
24
E-commerce: HBase is used for recording and storing logs about customer search history, and to
perform analytics and then target advertisement for better business.
HBASE VS RDBMS
S. Parame-
RDBMS HBase
No. ters
Data re-
5 In RDBMS, slower retrieval of data. In HBase, faster retrieval of data.
trieval
It follows the ACID (Atomicity, Con-
It follows CAP (Consistency, Availabil-
6 Rule sistency, Isolation, and Durability)
ity, and Partition-tolerance) theorem.
property.
Type of It can handle structured, unstructured as
7 It can handle structured data.
data well as semi-structured data.
8 Sparse data It cannot handle sparse data. It can handle sparse data.
25
HBASE VS HDFS
HDFS: Hadoop Distributed File System is a distributed file system designed to store and run on multiple
machines that are connected to each other as nodes and provide data reliability. It consists of clusters, each of
which is accessed through a single Name Node software tool installed on a separate machine to monitor and
manage the cluster’s file system and user access mechanism.
HBase: HBase is a top-level Apache project written in java which fulfils the need to read and write data in
real-time. It provides a simple interface to the distributed data. It can be accessed by Apache Hive, Apache
Pig, MapReduce, and store information in HDFS.
HDFS HBase
HBase is Hadoop database that runs on top of
HDFS is a java based file distribution system
HDFS
HDFS is highly fault-tolerant and cost-effective HBase is partially tolerant and highly consistent
HDFS is preferable for offline batch processing HBase is preferable for real time processing
Some of the general concepts that should be followed while designing schema in HBase:
Row key: Each table in the HBase table is indexed on the row key. There are no secondary indices
available on the HBase table.
Automaticity: Avoid designing a table that requires atomicity across all rows. All operations on
HBase rows are atomic at row level.
Even distribution: Read and write should be uniformly distributed across all nodes available in the
cluster. Design row key in such a way that, related entities should be stored in adjacent rows to
increase read efficacy.
26
INTRODUCTION TO ZOOKEEPER
Writing distributed applications is hard. It’s hard primarily because of partial failure. When a
message is sent across the network between two nodes and the network fails, the sender does not know
whether the receiver got the message. It may have gotten through before the network failed, or the receiver’s
process died. The only way that the sender can find out what happened is to reconnect to the receiver and ask
it. This is partial failure: when we don’t even know if an operation failed.
ZooKeeper can’t make partial failures go away. ZooKeeper gives a set of tools to build distributed
applications that can safely handle partial failures
APACHE ZOOKEEPER
Apache Zookeeper is a distributed, open-source coordination service for distributed systems.
It provides a central place for distributed applications to store data, communicate with one another,
and coordinate activities.
Zookeeper is used in distributed systems to coordinate distributed processes and services.
It provides a simple, tree-structured data model, a simple API, and a distributed protocol to ensure
data consistency and availability.
Zookeeper is designed to be highly reliable and fault-tolerant, and it can handle high levels of read
and write throughput.
SERVICES OF ZOOKEEPER
I. DATA MODEL
Zookeeper maintains a hierarchical tree of nodes called znodes
to store data. A znode stores data and has an associated ACL.
Zookeeper is designed for coordination of small size files.
Data access is atomic.
The references are by paths like file systems in UNIX, so they
aren't URI's.
Ephemeral znodes
27
A znode type is set at creation time and may not be changed later. An ephemeral znode is deleted by
ZooKeeper when the creating client’s session ends. By contrast, a persistent znode is not tied to the
client’s session and is deleted only when explicitly deleted by a client.
Watches
Allows to clients to get notifications when a znode changes in some way
Sequence numbers
A sequential znode is given a sequence number by ZooKeeper as a part of its name. If a znode is
created with the sequential flag set, then the value of a monotonically increasing counter is appended
to its name. If a client asks to create a sequential znode with the name /a/b-, for example, the znode
created may actually have the name /a/b-3. If, later on, another sequential znode with the name /a/b-
is created, it will be given a unique name with a larger value of the counter —for example, /a/b-5
II. OPERATIONS
There are nine basic operations in ZooKeeper, listed in below table:
Operation Description
create Creates a znode (the parent znode must already exist)
delete Deletes a znode (the znode must not have any children)
exists Tests whether a znode exists and retrieves its metadata
getACL, setACL Gets/sets the ACL for a znode ACL – (Access Control List)
getChildren Gets a list of the children of a znode
getData, setData Gets/sets the data associated with a znode
sync Synchronizes a client’s view of a znode with ZooKeeper
APIs
There are two core language bindings for ZooKeeper clients, one for Java and one for C; there are also
bindings for Perl, Python, and REST clients
Multi-update
There is another ZooKeeper operation, called multi that batches together multiple primitive operations into
a single unit that either succeeds or fails in its entirety. The situation where some of the primitive operations
succeed and some fail can never arise.
ACLs
A znode is created with a list of ACLs, which determines who can perform certain operations on it.
It depends on authentication:
Digest: Client is authenticated by a user and pass
SASL: Client is authenticated using Kerberos
IP: Client is authenticated by its IP address
III. IMPLEMENTATION
28
With replication, it can provide service as long as the majority of the ensembles are up (3 of 5, 2 of 3).
Its work is simple, it ensure that the every modification to the tree znodes is replicated to a majority of
the ensemble using a protocol called Zab that runs in two phases:
1. Leader election: The machines in an ensemble go through a process of electing a distinguished
member, called the leader. The other machines are termed followers.
2. Atomic broadcast: All write requests are forwarded to the leader, which broadcasts the update
to the followers.
IV. CONSISTENCY
Every update made to the znode tree is given a globally unique id called zxid.
Sequential consistency: Updates from any particular client are applied in the order that they are sent.
Atomicity: Updates either succeed or fail.
Single system image: A client will see the same view of the system regardless of the server it connects to
Durability: Once an update has succeeded, it will persist and will not be undone
V. SESSIONS
A ZooKeeper client is configured with the list of servers in the ensemble. On start-up, it tries to
connect to one of the servers in the list. If the connection fails, it tries another server in the list, and so on,
until it either successfully connects to one of them or fails because all ZooKeeper servers are unavailable.
Once a connection has been made with a ZooKeeper server, the server creates a new session for the client. A
session has a timeout period that is decided on by the application that creates it.
States
Zookeepers object transition through different states in its
lifecycle. You can know its state with getState()
States is an enum data type representing the different
states
A client using the Zookeeper object can receive
notifications via a Watcher objects.
29