HBase
1
HBase: Part of Hadoop’s
Ecosystem
HBase is built on top of HDFS
HBase files are
internally stored
in HDFS
2
HBase: Overview
• HBase is a distributed column-oriented datas store built on top of HDFS
• HBase is an Apache open-source project whose goal is to provide storage
for the Hadoop Distributed Computing
• Data is logically organized into tables, rows and columns
Example Schema of Table in HBase
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
3
HBase vs. HDFS
• Both are distributed systems that scale to hundreds or
thousands of nodes
• HDFS is good for batch processing (scans over big files)
• Not good for record lookup
• Not good for incremental addition of small batches
• Not good for updates
• It provides only sequential access of data.
4
HBase vs. HDFS (Cont’d)
• HBase is designed to efficiently address the above points
• Fast record lookup
• Support for record-level insertion
• Support for updates (not in place)
• HBase internally uses Hash tables and provides random
access, and it stores the data in indexed HDFS files for
faster lookups.
5
HBase vs. HDFS (Cont’d)
If application has neither random reads or writes then Stick to HDFS
6
HBase Data Model
7
HBase Data Model
• A column-oriented database stores data in cells grouped into columns,
not rows
• HBase is based on Google’s Bigtable model
• Key-Value pairs
8
HBase Data Model
1. Table & 2. Row
•Several Rows are multiple in Hbase Table. Columns have values assigned to them. HBase sorts rows alphabetically by
row key.
•The main goal is to store data so that related rows are closer together. The domain of the site is used as a common row-
key pattern. For example, if our row keys are domains, we should store them in reverse, i.e. org.apache.www or
org.apache.mail or org.Apache.Jira. This way, all Apache domains are close to each other in the HBase table.
3. Column
•An HBase column consists of a column family and a column qualifier separated by the : (colon) character.
•A. Column family: Column families physically house a set of columns and their values; then, Each column family has a
set of storage properties, such as how its data is compressed, whether its values should be cached, how its row keys are
encoded, and more. Each row in an HBase table has the same column families.
•b. Column qualifications: A column qualifier for qualification is added to the column family to provide an index for that
data part. Example: the column family is content, then the column qualifier can be content: HTML or content: pdf. The
Column families are fixed during table creation, but column qualifiers are mutable and vary widely between rows.
4. The cell
•A cell is essentially a combination of a row, a column family, and a column qualifier. Contains a value and a timestamp
that represents the version of the value.
5. Timestamp
•A timestamp is an identifier for a given value version and is written next to each value. The timestamp default represents
the time on the RegionServer when the data was written. However, we can specify a different timestamp value when
inserting data into a cell.
HBase: Keys and Column
Families
Each record is divided into Column Families
Each row has a Key
Each column family consists of one or more Columns
10
Column family named “anchor”
Column family named “Contents”
• Key
• Byte array
• Serves as the primary key
for the table
Column named “apache.com”
• Indexed far fast lookup
• Column Family
• Has a name (string)
• Contains one or more
related columns
• Column
• Belongs to one column
family
• Included inside the row
• familyName:columnName
11
Version number for each row
• Version Number
• Unique within each
key value
• By default→
System’s timestamp
• Data type is Long
• Value (Cell)
• Byte array
12
Notes on Data Model
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
• Columns are not part of the schema
• HBase has Dynamic Columns
• Because column names are encoded inside the cells
• Different cells can have different columns
“Roles” column family
has different columns
in different cells
13
Notes on Data Model (Cont’d)
• The version number can be user-supplied
• Even does not have to be inserted in increasing order
• Version number are unique within each key
• Table can be very sparse
• Many cells are empty
Has two columns
• Keys are indexed as the primary key [cnnsi.com & my.look.ca]
HBase Physical Model
15
HBase Physical Model
• Each column family is stored in a separate file (called HTables)
• Key & Version numbers are replicated with each column family
• Empty cells are not stored
HBase maintains a multi-
level index on values:
<key, column family, column
name, timestamp>
16
Example
17
Column Families
18
HBase Regions
• Each HTable (column family) is partitioned horizontally
into regions
• Regions are counterpart to HDFS blocks
Each will be one region
19
HBase Architecture
20
Three Major Components
• The HBaseMaster
• One master
• The HRegionServer
• Many region servers
• The HBase client
21
HBase Components
• Region
• A subset of a table’s rows, like horizontal range partitioning
• Automatically done
• RegionServer (many slaves)
• Manages data regions
• Serves data for reads and writes (using a log)
• Master
• Responsible for coordinating the slaves
• Assigns regions, detects failures
• Admin functions
https://www.analyticsvidhya.com/blog/2022/10/a-brief-introduction-to-apache-hbase-and-its-architecture/
22
Big Picture
23
ZooKeeper
• HBase depends on ZooKeeper
• By default HBase manages the
ZooKeeper instance
• E.g., starts and stops
ZooKeeper
• HMaster and HRegionServers
register themselves with
ZooKeeper
24
Creating a Table
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);
25
Operations On Regions: Get()
• Given a key → return corresponding record
• For each value return the highest version
• Can control the number of versions you want
26
Operations On Regions: Scan()
27
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’
Time
Row key Column “anchor:”
Stamp
t12
t11
“com.apache.www”
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
Select value from table
Scan() where anchor=‘cnnsi.com’
Time
Row key Column “anchor:”
Stamp
t12
t11
“com.apache.www”
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
Operations On Regions: Put()
• Insert a new record (with a new key), Or
• Insert a record for an existing key
Implicit version number
(timestamp)
Explicit version number
30
Operations On Regions: Delete()
• Marking table cells as deleted
• Multiple levels
• Can mark an entire column family as deleted
• Can make all column families of a given row as deleted
• All operations are logged by the RegionServers
• The log is flushed periodically
31
HBase: Joins
• HBase does not support joins
• Can be done in the application layer
• Using scan() and get() operations
32
Altering a Table
Disable the table before changing the schema
33
Logging Operations
34
HBase Deployment
Master
node
Slave
nodes
35
HBase vs. HDFS
36
HBase vs. RDBMS
37
When to use HBase
38
References
• https://www.bmc.com/blogs/hadoop-hbase/
• https://towardsdatascience.com/hbase-working-principle-a-part-of-hadoop-architecture-fbe0453a031b
• https://medium.com/hands-on-apache-hbase/an-introduction-to-apache-hbase-2cdd1d9ff13
• https://builtin.com/data-science/hbase
• https://www.analyticsvidhya.com/blog/2022/10/a-brief-introduction-to-apache-hbase-and-its-
architecture/
• https://www.tutorialspoint.com/hbase/hbase_overview.htm
• https://www.geeksforgeeks.org/apache-hbase/
• https://www.simplilearn.com/tutorials/hadoop-tutorial/hbase