0% found this document useful (0 votes)
54 views

Apache Hbase

Apache HBase is an open-source, distributed, column-oriented database that allows for real-time querying of large datasets. It stores data in tables that can contain billions of rows and millions of columns. HBase runs on top of HDFS and provides random, real-time read and write access to big data.

Uploaded by

iconoc
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Apache Hbase

Apache HBase is an open-source, distributed, column-oriented database that allows for real-time querying of large datasets. It stores data in tables that can contain billions of rows and millions of columns. HBase runs on top of HDFS and provides random, real-time read and write access to big data.

Uploaded by

iconoc
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Apache Hbase

Apache HBase is an open-source, NoSQL, distributed big data store. It enables random,
strictly consistent, real-time access to petabytes of data. HBase is very effective for handling
large, sparse datasets.

HBase is a high-reliability, high-performance, column-oriented, scalable distributed storage

system that uses HBase technology to build large-scale structured storage clusters on

inexpensive PC Servers. The goal of HBase is to store and process large amounts of data,

specifically to handle large amounts of data consisting of thousands of rows and columns

using only standard hardware configurations.

Different from MapReduce’s offline batch computing framework, HBase is random access

storage and retrieval data platform, which makes up for the shortcomings of HDFS that cannot

access data randomly.

It is suitable for business scenarios where real-time requirements are not very high — HBase

stores Byte arrays, which don’t mind data types, allowing dynamic, flexible data models.

Hbase basic architecture

HBase consists of HMaster and HRegionServer and also follows the master-slave server

architecture. HBase divides the logical table into multiple data blocks, HRegion, and stores

them in HRegionServer.

HMaster: is responsible for managing all HRegionServers. It does not store any data itself, but

only stores the mappings (metadata) of data to HRegionServer.

Client: Use HBase’s RPC mechanism to communicate with HMaster and HRegionServer,

submit requests and get results. For management operations, the client performs RPC with

HMaster. For data read and write operations, the client performs RPC with HRegionServer.
Zookeeper: By registering the status information of each node in the cluster to ZooKeeper,

HMaster can sense the health status of each HRegionServer at any time, and can also avoid the

single point problem of HMaster.

HMaster: Manage all HRegionServers, tell them which HRegions need to be maintained, and

monitor the health of all HRegionServers. When a new HRegionServer logs in to HMaster,

HMaster tells it to wait for data to be allocated. When an HRegion dies, HMaster marks all

HRegions it is responsible for as unallocated and then assigns them to other HRegionServers.

HMaster does not have a single point problem. HBase can start multiple HMasters. Through

the Zookeeper’s election mechanism, there is always one HMaster running in the cluster,

which improves the availability of the cluster.

HRegion: When the size of the table exceeds the preset value, HBase will automatically

divide the table into different areas, each of which contains a subset of all the rows in the

table. For the user, each table is a collection of data, distinguished by a primary key

(RowKey). Physically, a table is split into multiple blocks, each of which is an HRegion. We

use the table name + start/end primary key to distinguish each HRegion. One HRegion will

save a piece of continuous data in a table. A complete table data is stored in multiple
HRegions.

HRegionServer: All data in HBase is generally stored in HDFS from the bottom layer. Users

can obtain this data through a series of HRegionServers. Generally, only one HRegionServer

is running on one node of the cluster, and the HRegion of each segment is only maintained by

one HRegionServer. HRegionServer is mainly responsible for reading and writing data to the

HDFS file system in response to user I/O requests. It is the core module in HBase.

HRegionServer internally manages a series of HRegion objects, each HRegion corresponding

to a continuous data segment in the logical table. HRegion is composed of multiple HStores.

Each HStore corresponds to the storage of one column family in the logical table. It can be

seen that each column family is a centralized storage unit. Therefore, to improve operational
efficiency, it is preferable to place columns with common I/O characteristics in one column

family.

HStore: It is the core of HBase storage, which consists of MemStore and StoreFiles.

MemStore is a memory buffer. The data written by the user will first be put into MemStore.

When MemStore is full, Flush will be a StoreFile (the underlying implementation is HFile).

When the number of StoreFile files increases to a certain threshold, the Compact merge

operation will be triggered, merge multiple StoreFiles into one StoreFile, and perform version

merge and data delete operations during the merge process. Therefore, it can be seen that

HBase only adds data, and all update and delete operations are performed in the subsequent

Compact process, so that the user’s write operation can be returned as soon as it enters the

memory, ensuring the high performance of HBaseI/O. When StoreFiles Compact, it will

gradually form a larger and larger StoreFile. When the size of a single StoreFile exceeds a

certain threshold, the Split operation will be triggered. At the same time, the current HRegion

will be split into 2 HRegions, and the parent HRegion will go offline. The two sub-HRegions

are assigned to the corresponding HRegionServer by HMaster so that the load pressure of the

original HRegion is shunted to the two HRegions.

HLog: Each HRegionServer has an HLog object, which is a pre-written log class that

implements the Write Ahead Log. Each time a user writes data to MemStore, it also writes a

copy of the data to the HLog file. The HLog file is periodically scrolled and deleted, and the

old file is deleted (data that has been persisted to the StoreFile). When HMaster detects that an

HRegionServer is terminated unexpectedly by the Zookeeper, HMaster first processes the

legacy HLog file, splits the HLog data of different HRegions, puts them into the

corresponding HRegion directory, and then redistributes the invalid HRegions. In the process

of loading HRegion, HRegionServer of these HRegions will find that there is a history HLog

needs to be processed so the data in Replay HLog will be transferred to MemStore, then Flush

to StoreFiles to complete data recovery.


Features of Hbase

o Horizontally scalable: You can add any number of columns anytime.


o Automatic Failover: Automatic failover is a resource that allows a system
administrator to automatically switch data handling to a standby system in the event
of system compromise
o Integrations with Map/Reduce framework: Al the commands and java codes internally
implement Map/ Reduce to do the task and it is built over Hadoop Distributed File
System.
o sparse, distributed, persistent, multidimensional sorted map, which is indexed by
rowkey, column key,and timestamp.
o Often referred as a key value store or column family-oriented database, or storing
versioned maps of maps.
o fundamentally, it's a platform for storing and retrieving data with random access.
o It doesn't care about datatypes(storing an integer in one row and a string in another for
the same column).
o It doesn't enforce relationships within your data.
o It is designed to run on a cluster of computers, built using commodity hardware.

Advantages of Hbase

HBase has good number of benefits and is a good solution in many use cases. Let us check
some of the advantages of HBase:

 Random and consistent Reads/Writes access in high volume request


 Auto failover and reliability
 Flexible, column-based multidimensional map structure
 Variable Schema: columns can be added and removed dynamically
 Integration with Java client, Thrift and REST APIs
 MapReduce and Hive/Pig integration
 Auto Partitioning and sharding
 Low latency access to data
 BlockCache and Bloom filters for query optimization
 HBase allows data compression and is ideal for sparse data

You might also like