Apache Hbase
Apache Hbase
Apache HBase is an open-source, NoSQL, distributed big data store. It enables random,
strictly consistent, real-time access to petabytes of data. HBase is very effective for handling
large, sparse datasets.
system that uses HBase technology to build large-scale structured storage clusters on
inexpensive PC Servers. The goal of HBase is to store and process large amounts of data,
specifically to handle large amounts of data consisting of thousands of rows and columns
Different from MapReduce’s offline batch computing framework, HBase is random access
storage and retrieval data platform, which makes up for the shortcomings of HDFS that cannot
It is suitable for business scenarios where real-time requirements are not very high — HBase
stores Byte arrays, which don’t mind data types, allowing dynamic, flexible data models.
HBase consists of HMaster and HRegionServer and also follows the master-slave server
architecture. HBase divides the logical table into multiple data blocks, HRegion, and stores
them in HRegionServer.
HMaster: is responsible for managing all HRegionServers. It does not store any data itself, but
submit requests and get results. For management operations, the client performs RPC with
HMaster. For data read and write operations, the client performs RPC with HRegionServer.
Zookeeper: By registering the status information of each node in the cluster to ZooKeeper,
HMaster can sense the health status of each HRegionServer at any time, and can also avoid the
HMaster: Manage all HRegionServers, tell them which HRegions need to be maintained, and
monitor the health of all HRegionServers. When a new HRegionServer logs in to HMaster,
HMaster tells it to wait for data to be allocated. When an HRegion dies, HMaster marks all
HRegions it is responsible for as unallocated and then assigns them to other HRegionServers.
HMaster does not have a single point problem. HBase can start multiple HMasters. Through
the Zookeeper’s election mechanism, there is always one HMaster running in the cluster,
HRegion: When the size of the table exceeds the preset value, HBase will automatically
divide the table into different areas, each of which contains a subset of all the rows in the
table. For the user, each table is a collection of data, distinguished by a primary key
(RowKey). Physically, a table is split into multiple blocks, each of which is an HRegion. We
use the table name + start/end primary key to distinguish each HRegion. One HRegion will
save a piece of continuous data in a table. A complete table data is stored in multiple
HRegions.
HRegionServer: All data in HBase is generally stored in HDFS from the bottom layer. Users
can obtain this data through a series of HRegionServers. Generally, only one HRegionServer
is running on one node of the cluster, and the HRegion of each segment is only maintained by
one HRegionServer. HRegionServer is mainly responsible for reading and writing data to the
HDFS file system in response to user I/O requests. It is the core module in HBase.
to a continuous data segment in the logical table. HRegion is composed of multiple HStores.
Each HStore corresponds to the storage of one column family in the logical table. It can be
seen that each column family is a centralized storage unit. Therefore, to improve operational
efficiency, it is preferable to place columns with common I/O characteristics in one column
family.
HStore: It is the core of HBase storage, which consists of MemStore and StoreFiles.
MemStore is a memory buffer. The data written by the user will first be put into MemStore.
When MemStore is full, Flush will be a StoreFile (the underlying implementation is HFile).
When the number of StoreFile files increases to a certain threshold, the Compact merge
operation will be triggered, merge multiple StoreFiles into one StoreFile, and perform version
merge and data delete operations during the merge process. Therefore, it can be seen that
HBase only adds data, and all update and delete operations are performed in the subsequent
Compact process, so that the user’s write operation can be returned as soon as it enters the
memory, ensuring the high performance of HBaseI/O. When StoreFiles Compact, it will
gradually form a larger and larger StoreFile. When the size of a single StoreFile exceeds a
certain threshold, the Split operation will be triggered. At the same time, the current HRegion
will be split into 2 HRegions, and the parent HRegion will go offline. The two sub-HRegions
are assigned to the corresponding HRegionServer by HMaster so that the load pressure of the
HLog: Each HRegionServer has an HLog object, which is a pre-written log class that
implements the Write Ahead Log. Each time a user writes data to MemStore, it also writes a
copy of the data to the HLog file. The HLog file is periodically scrolled and deleted, and the
old file is deleted (data that has been persisted to the StoreFile). When HMaster detects that an
legacy HLog file, splits the HLog data of different HRegions, puts them into the
corresponding HRegion directory, and then redistributes the invalid HRegions. In the process
of loading HRegion, HRegionServer of these HRegions will find that there is a history HLog
needs to be processed so the data in Replay HLog will be transferred to MemStore, then Flush
Advantages of Hbase
HBase has good number of benefits and is a good solution in many use cases. Let us check
some of the advantages of HBase: