HBase– Overview
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the
advent of big data, companies realized the benefit of processing big data and started opting for
solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop
excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even
unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be processed
sequentially. At this point, a new solution is needed to access any point of data in a single unit of
time (random access).
Hadoop Random Access Databases
Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the
databases that store huge amounts of data and access the data in a random manner.
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in
the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses
the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and
provides read and write access.
HBase and HDFS
HDFS HBase
HDFS is a distributed file system suitable
HBase is a database built on top of the HDFS.
for storing large files.
HDFS does not support fast individual
HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch
It provides low latency access to single rows from billions of
processing; no concept of batch
records (Random access).
processing.
It provides only sequential access of HBase internally uses Hash tables and provides random access,
data. and it stores the data in indexed HDFS files for faster lookups.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column
families and each column family can have any number of columns. Subsequent column values
are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an
HBase:
Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
Given below is an example schema of table in HBase.
Rowid Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections of columns of data, rather
than as rows of data. Shortly, they will have column families.
Row-Oriented Database Column-Oriented Database
It is suitable for Online Analytical Processing
It is suitable for Online Transaction Process (OLTP).
(OLAP).
Such databases are designed for small number of rows Column-oriented databases are designed for
and columns. huge tables.
The following image shows column families in a column-oriented database:
HBase and RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have the concept of An RDBMS is governed by its schema, which
fixed columns schema; defines only column families. describes the whole structure of tables.
It is thin and built for small tables. Hard to
It is built for wide tables. HBase is horizontally scalable.
scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as structured data. It is good for structured data.
Features of HBase
HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.
Where to Use HBase
Apache HBase is used to have random, real-time read/write access to Big Data.
It hosts very large tables on top of clusters of commodity hardware.
Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts
up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase
It is used whenever there is a need to write heavy applications.
HBase is used whenever we need to provide fast random access to available data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
HBase History
Year Event
Nov 2006 Google released the paper on BigTable.
Feb 2007 Initial HBase prototype was created as a Hadoop contribution.
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
Jan 2008 HBase became the sub project of Hadoop.
Oct 2008 HBase 0.18.1 was released.
Jan 2009 HBase 0.19.0 was released.
Sept 2009 HBase 0.20.0 was released.
May 2010 HBase became Apache top-level project.
Architecture of HBase
HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.
MasterServer
The master server -
Assigns regions to the region servers and takes the help of Apache ZooKeeper for this
task.
Handles load balancing of the regions across region servers. It unloads the busy servers
and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as creation of
tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores as shown below:
The store contains memory store and HFiles. Memstore is just like a cache memory. Anything
that is entered into the HBase is stored here initially. Later, the data is transferred and saved in
Hfiles as blocks and the memstore is flushed.
Zookeeper
Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different region servers. Master servers use
these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or network
partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
HBase - General Commands
The general commands in HBase are status, version, table_help, and whoami. This chapter
explains these commands.
status
This command returns the status of the system including the details of the servers running on the
system. Its syntax is as follows:
hbase(main):009:0> status
If you execute this command, it returns the following output.
hbase(main):009:0> status
3 servers, 0 dead, 1.3333 average load
version
This command returns the version of HBase used in your system. Its syntax is as follows:
hbase(main):010:0> version
If you execute this command, it returns the following output.
hbase(main):009:0> version
0.98.8-hadoop2, r6cfc8d064754251365e070a10a82eb169956d5fe, Fri Nov 14
18:26:29 PST 2014
table_help
This command guides you what and how to use table-referenced commands. Given below is the
syntax to use this command.
hbase(main):02:0>table_help
When you use this command, it shows help topics for table-related commands. Given below is
the partial output of this command.
hbase(main):002:0>table_help
Help for table-reference commands.
You can either create a table via 'create' and then manipulate the table
via commands like 'put', 'get', etc.
See the standard help information for how to use each of these commands.
However, as of 0.96, you can also get a reference to a table, on which
you can invoke commands.
For instance, you can get create a table and keep around a reference to
it via:
hbase> t = create 't', 'cf'…...
whoami
This command returns the user details of HBase. If you execute this command, returns the
current HBase user as shown below.
hbase(main):008:0>whoami
hadoop (auth:SIMPLE)
groups: hadoop
Some of the Commands:
Creating a Table using HBase Shell
You can create a table using the createcommand, here you must specify the table name and the
Column Family name. The syntax to create a table in HBase shell is shown below.
create ‘<table name>’,’<column family>’
Example
Given below is a sample schema of a table named emp. It has two column families: “personal
data” and “professional data”.
Row key personal data professional data
You can create this table in HBase shell as shown below.
hbase(main):002:0> create 'emp', 'personal data', 'professional data'
And it will give you the following output.
0 row(s) in 1.1300 seconds
=>Hbase::Table - emp
Verification
You can verify whether the table is created using the list command as shown below. Here you
can observe the created emp table.
hbase(main):002:0> list
TABLE
emp
2 row(s) in 0.0340 seconds
Listing a Table using HBase Shell
list is the command that is used to list all the tables in HBase. Given below is the syntax of the
list command.
hbase(main):001:0 > list
When you type this command and execute in HBase prompt, it will display the list of all the
tables in HBase as shown below.
hbase(main):001:0> list
TABLE
emp
Dropping a Table using HBase Shell
Using the drop command, you can delete a table. Before dropping a table, you have to disable it.
hbase(main):018:0> disable 'emp'
0 row(s) in 1.4580 seconds
hbase(main):019:0> drop 'emp'
0 row(s) in 0.3060 seconds
Verify whether the table is deleted using the exists command.
hbase(main):020:07gt; exists 'emp'
Table emp does not exist
0 row(s) in 0.0730 seconds
drop_all
This command is used to drop the tables matching the “regex” given in the command. Its syntax
is as follows:
hbase>drop_all ‘t.*’
Note: Before dropping a table, you must disable it.
Example
Assume there are tables named raja, rajani, rajendra, rajesh, and raju.
hbase(main):017:0> list
TABLE
raja
rajani
rajendra
rajesh
raju
9 row(s) in 0.0270 seconds
All these tables start with the letters raj. First of all, let us disable all these tables using the
disable_all command as shown below.
hbase(main):002:0>disable_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled
Now you can delete all of them using the drop_all command as given below.
hbase(main):018:0>drop_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Drop the above 5 tables (y/n)?
y
5 tables successfully dropped
Inserting Data using HBase Shell
This chapter demonstrates how to create data in an HBase table. To create data in an HBase
table, the following commands and methods are used:
put command,
add() method of Put class, and
put() method of HTable class.
As an example, we are going to create the following table in HBase.
Using put command, you can insert rows into a table. Its syntax is as follows:
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
Inserting the First Row
Let us insert the first row values into the emp table as shown below.
hbase(main):005:0> put 'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal data:city','hyderabad'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional data:salary','50000'
0 row(s) in 0.0240 seconds
Insert the remaining rows using the put command in the same way. If you insert the whole table,
you will get the following output.
hbase(main):022:0> scan 'emp'
ROW COLUMN+CELL
1 column=personal data:city, timestamp=1417524216501, value=hyderabad
1 column=personal data:name, timestamp=1417524185058, value=ramu
1 column=professional data:designation, timestamp=1417524232601,
value=manager
1 column=professional data:salary, timestamp=1417524244109, value=50000
2 column=personal data:city, timestamp=1417524574905, value=chennai
2 column=personal data:name, timestamp=1417524556125, value=ravi
2 column=professional data:designation, timestamp=1417524592204,
value=sr:engg
2 column=professional data:salary, timestamp=1417524604221, value=30000
3 column=personal data:city, timestamp=1417524681780, value=delhi
3 column=personal data:name, timestamp=1417524672067, value=rajesh
3 column=professional data:designation, timestamp=1417524693187,
value=jr:engg
3 column=professional data:salary, timestamp=1417524702514,
value=25000
Updating Data using HBase Shell
You can update an existing cell value using the put command. To do so, just follow the same
syntax and mention your new value as shown below.
put ‘table name’,’row ’,'Column family:columnname',’new value’
The newly given value replaces the existing value, updating the row.
Example
Suppose there is a table in HBase called emp with the following data.
hbase(main):003:0> scan 'emp'
ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418051555, value = raju
row1 column = personal:city, timestamp = 1418275907, value = Hyderabad
row1 column = professional:designation, timestamp = 14180555,value = manager
row1 column = professional:salary, timestamp = 1418035791555,value = 50000
1 row(s) in 0.0100 seconds
The following command will update the city value of the employee named ‘Raju’ to Delhi.
hbase(main):002:0> put 'emp','row1','personal:city','Delhi'
0 row(s) in 0.0400 seconds
The updated table looks as follows where you can observe the city of Raju has been changed to
‘Delhi’.
hbase(main):003:0> scan 'emp'
ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418035791555, value = raju
row1 column = personal:city, timestamp = 1418274645907, value = Delhi
row1 column = professional:designation, timestamp = 141857555,value = manager
row1 column = professional:salary, timestamp = 1418039555, value = 50000
1 row(s) in 0.0100 seconds
Deleting a Specific Cell in a Table
Using the delete command, you can delete a specific cell in a table. The syntax of delete
command is as follows:
delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
Example
Here is an example to delete a specific cell. Here we are deleting the salary.
hbase(main):006:0> delete 'emp', '1', 'personal data:city',
1417521848375
0 row(s) in 0.0060 seconds
Deleting All Cells in a Table
Using the “deleteall” command, you can delete all the cells in a row. Given below is the syntax
of deleteall command.
deleteall ‘<table name>’, ‘<row>’,
Example
Here is an example of “deleteall” command, where we are deleting all the cells of row1 of emp
table.
hbase(main):007:0>deleteall 'emp','1'
0 row(s) in 0.0240 seconds
Verify the table using the scan command. A snapshot of the table after deleting the table is given
below.
hbase(main):022:0> scan 'emp'
ROW COLUMN + CELL
2 column = personal data:city, timestamp = 1417524574905, value = chennai
2 column = personal data:name, timestamp = 1417524556125, value = ravi
2 column = professional data:designation, timestamp = 1417524204, value =
sr:engg
2 column = professional data:salary, timestamp = 1417524604221, value = 30000
3 column = personal data:city, timestamp = 1417524681780, value = delhi
3 column = personal data:name, timestamp = 1417524672067, value = rajesh
3 column = professional data:designation, timestamp = 1417523187, value =
jr:engg
3 column = professional data:salary, timestamp = 1417524702514, value = 25000