HD Mod09 HBase Phoenix
HD Mod09 HBase Phoenix
Module 09 – HBase
cd /usr/hdp/2.2.0.0-2041/hbase/bin
hbase shell
• Accumulo
– Key/Value store
– Security focus
• Allows controlling access both by row and column
• Rarely used outside of government
• Cassandra
– Key/Value store
– Independent of the Hadoop Ecosystem
– Key distinguishing feature is cross data-center replication.
• synchronous or asynchronous replication
•
• MongoDB (Non-Apache project)
– Document Store
• Document ID points to Binary JSON structure rather than
values list
– Often used as an Object data store
– Sometimes described as a competitor to HBase. It’s not.
• Different storage mechanism. MongoDB returns the entire
document. HBase returns one or more values associated with
a row ID
• Problem - Hadoop can perform only batch processing and data will be
accessed only in a sequential manner. That means one has to search the
entire dataset even for the simplest of jobs. A new solution is needed to
access any point of data in a single unit of time (random access)
• Solution - Applications such as HBase, Cassandra, CouchDB, Dynamo, and
MongoDB are some of the databases that store huge amounts of data and
access the data in a random manner
• HBase is a data model that is similar to Google’s big table designed to
provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System (HDFS)
• Unlike HDFS, HBase is good for:
If want to do Join, use Phoenix or load
• Fast record lookup HBase tables into HIVE and do there
• Support for record-level insertion
• Support for updates (not in place), but rather updates are done by
creating new versions of values
1. Made for big data. By design, NoSQL is capable of storing, processing, and
managing huge amounts of data. This not only includes the structured data you collect
from your web form or at the point-of-sale, but text messages, word processing
documents, videos and other forms of unstructured data as well. While RDBMS
applications are growing in terms of what they can handle in capacity, they are largely
outclassed and outmatched by NoSQL.
3. Cost effective data processing. Commercial RDBMS solutions like SQL Server
tend to perform best when paired up with commercial servers, which means you could
up shelling out a lot of cash depending on the number of machines in your cluster.
NoSQL, on the other hand, thrives on low-cost commodity hardware. As a result, it
offers what is often a significantly more cost effective way to store and process data in
comparison to its proprietary competitors.
Disadvantages
1. Lack of familiarity. Hitched on the shoulders of big data, NoSQL is slowly but surely
piggybacking its way to way mainstream viability. However, RDBMS tools have been
around forever in IT years and form the only category of databases many businesses
know. Like any new technology, NoSQL can be a tough sell for senior-level executives
who are coddled in the comfort and familiarity of their existing systems.
2. Management challenges. Big data tools aim to make managing large amounts of
information as simple as possible. But ask any administrator who is responsible for
interacting with the databases behind these tools and most will tell you that we still have
a long way to go in the simplicity department. NoSQL, in particular, has a reputation for
being challenging to install, and even more hectic to manage on a day-to-day basis.
3. Limited expertise. Based on maturity alone, there are countless administrators that
know the fundamentals of MySQL and RDBMS software in general like the back of their
hands. With big data being a relatively new concept, it’s fair to say that even a specialist
only has limited knowledge of NoSQL. This is a critical factor that can make assembling
a staff of skilled data administrators and engineers quite the daunting task
• NoSQL databases are built to allow the insertion of data without a predefined
schema (ie: doesn't mind storing INT in one row and string in next row from
same column). That makes it easy to make significant application changes in
real-time, without worrying about service interruptions – which means
development is faster, code integration is more reliable, and less database
administrator time is needed
Here’s a few questions you should ask when deciding which NoSQL to choose.
• Above is the wrong question to ask. Instead, ask yourself about the
functionality you are implementing and the requirements on the data storage
solution
• Instead ask whether your Use Cases require the following support:
• Random reads or random write
• Sequential reads or sequential writes
• High read throughput or high write throughput
• Whether the data changes or remains immutable once written
• Storage model most suitable for your access patterns
• Column/column family oriented
• Key-value
• Document-oriented
• Schema/Schemaless
• Whether consistency or availability is most desirable
Use Apache HBase™ when you need random, realtime read/write access to your Big
Data. This project's goal is the hosting of very large tables -- billions of rows X millions
of columns -- atop clusters of commodity hardware. Apache HBase is an open-source,
distributed, versioned, non-relational database modeled after Google's Bigtable: A
Distributed Storage System for Structured Data by Chang et al. Just as Bigtable
leverages the distributed data storage provided by the Google File System, Apache
HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
HBase provides a fault-tolerant way of storing large quantities of sparse data (small
amounts of information caught within a large collection of empty or unimportant data,
such as finding the 50 largest items in a group of 2 billion records, or finding the non-
zero items representing less than 0.1% of a huge collection).
Features
• No real Indexes – Rows are stored sequentially (as are columns within each
row). Insert performance is independent of table size
• No NULL records – HBase doesn't store anything to indicate absence of data
• Automatic partitioning – As your table grows, they will be automatically split
into Regions and distributed across all available nodes
• Linearly scalable – Regions automatically rebalance when new node is added
• Supports Versioning – So can look back in time for data
• HBase is a database built on top of Hadoop – It depends on Hadoop for both
data access and data reliability
• All data is stored in the form of a Byte array
• Commodity hardware, Fault tolerance, Batch processing
• Fast access to cell via Row Key, Column Family, Column, Version
– Row Keys are the critical design point in HBase
– Row Keys are sorted automatically
– Row Keys must allow spreading of access across Region Servers
Just like in a Relational Database, data in HBase is stored in Tables and these Tables
are stored in Regions. When a Table becomes too big, the Table is partitioned into
multiple Regions. These Regions are assigned to Region Servers across the cluster.
Each Region Server hosts roughly the same number of Regions.
• Performing Administration
• Managing and Monitoring the Cluster
• Assigning Regions to the Region Servers
• Controlling the Load Balancing and Failover
Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions.
Each Region in turn is made up of a MemStore and multiple StoreFiles (HFile). The
data lives in these StoreFiles in the form of Column Families (explained below). The
MemStore holds in-memory modifications to the Store (data).
The mapping of Regions to Region Server is kept in a system table called .META.
When trying to read or write data from HBase, the clients read the required Region
information from the .META table and directly communicate with the appropriate Region
Server. Each Region is identified by the start key (inclusive) and the end key
(exclusive)
By splitting a table across multiple regions you get faster performance due to parallelism.
The Data Model in HBase is designed to accommodate semi-structured data that could
vary in field size, data type and columns. Additionally, the layout of the data model
makes it easier to partition the data and distribute it across the cluster. The Data Model
in HBase is made of different logical components such as Tables, Rows, Column
Families, Columns, Cells and Versions.
Columns
Column
values
Tables – The HBase Tables are more like logical collection of rows stored in separate
partitions called Regions. As shown above, every Region is then served by exactly one
Region Server. The figure above shows a representation of a Table.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys
are unique in a Table and are always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each
Column Family has one more Columns and these Columns in a family are stored
together in a low level storage file known as HFile. Column Families form the basic unit
of physical storage to which certain HBase features like compression are applied.
Hence it’s important that proper care be taken when designing Column Families in
table. The table above shows Customer and Sales Column Families. The Customer
Column Family is made up 2 columns – Name and City, whereas the Sales Column
Families is made up to 2 columns – Product and Amount.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column
Family and the Column (Column Qualifier). The data stored in a Cell is called its value
and the data type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by
the timestamp. The number of versions of data retained in a column family is
configurable and this value by default is 3
• Column Family
– Has a name (string)
– Contains one or more related
columns
put 'cust','2','info:name','mark'
put 'cust','2','info:age','24'
Wait a minute. You mean I put 'cust','2','info:birth', '1958-04-20'
can define a new column put 'cust','2','info:job', 'consultant'
'on-the-fly' if I need to? Wow
When you put data into HBase, a timestamp is required. The timestamp can be
generated automatically by the RegionServer or can be supplied by you. The timestamp
must be unique per version of a given cell, because the timestamp identifies the
version. To modify a previous version of a cell, for instance, you would issue a Put with
a different value for the data itself, but the same timestamp.
You can also configure the maximum and minimum number of versions to keep for a
given column, or specify a default time-to-live (TTL), which is the number of seconds
before a version is deleted. The following examples all use alter statements in HBase
Shell to create new column families with the given characteristics, but you can use the
same syntax when creating a new table or to alter an existing column family. This is
only a fraction of the options you can specify for a given column family.
hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5
hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2
hbase> alter ‘t1′, NAME => ‘f1′, TTL => 15
HBase sorts the versions of a cell from newest to oldest, by sorting the timestamps
lexicographically. When a version needs to be deleted because a threshold has been
reached, HBase always chooses the "oldest" version, even if it is in fact the most recent
version to be inserted. Keep this in mind when designing your timestamps. Consider
using the default generated timestamps and storing other version-specific data
elsewhere in the row, such as in the row key. If MIN_VERSIONS and TTL
conflict, MIN_VERSIONS takes precedence.
hbase> list
hbase> list ‘abc.*’
To break out, type 'q' followed by 'Enter'. To exit, type exit or CTRL+C
Example
Given below is a sample schema of a table named emp. It has two column families:
“personal data” and “professional data”.
1. Here we create a table named EMP with 2 Column Families. One of the CF
will have 2 versions while the other will have 1 version
Type: create 'emp', {NAME =>'personal', VERSIONS => 2}, {NAME => 'professional'}
3. To drop a table or change its settings (ALTER), you need to first disable the
table using the disable command. You can re-enable it using the enable
command
Note you can use JAVA code for HBase commands too
describe
This command returns the description of the table. Its syntax is as follows:
alter
Alter is the command used to make changes to an existing table. Using this command,
you can change the maximum number of cells of a column family, set and delete table
scope operators, and delete a column family from a table.
Given below is the syntax to change the maximum number of cells of a column family.
1. describe
describe 'emp'
2. alter allows you to change Column Family schema. For example, can add or
delete a Column Family from a table. Of course, first you must DISABLE the
table as mentioned in the previous slide
For example, the below would remove the Column Family = 'professional'
from the EMPLOYEE table
Example: disable 'employee'
alter 'employee', 'delete' = 'professional'
enable 'employee'
put
scan
Scan a table; pass table name and optionally a dictionary of scanner specifications.
Scanner specifications may include one or more of the following: LIMIT, STARTROW,
STOPROW, TIMESTAMP, or COLUMNS. If no columns are specified, all columns will
be scanned. To scan all members of a column family, leave the qualifier empty as in
'col_family:'. Examples:
CACHE_BLOCKS -- which switches block caching for the scanner on (true) or off
(false). By default it is enabled. Examples:
1. Using PUT command, you can insert cell value(s) into a table
Generic syntax:
put '<table name>', 'row<Int>', '<colfamily:colname>', '<value>', timestamp
Type following:
put 'emp','1','personal:name','juli'
put 'emp','1','personal:city','Portsmouth'
put 'emp','1','professional:job','sales'
put 'emp','1','professional:salary','99000'
2. Using scan, you can now read from the table. Type: scan 'emp'
2. Using scan, you can now read from the table. Type: scan 'emp'
If you want to see both versions (original 'portsmouth' and new 'Cincinnati') type:
scan 'emp', {VERSIONS =>2} must type exactly like this
Scan a table; pass table name and optionally a dictionary of scanner specifications.
Scanner specifications may include one or more of the following: LIMIT, STARTROW,
STOPROW, TIMESTAMP, or COLUMNS. If no columns are specified, all columns will
be scanned. To scan all members of a column family, leave the qualifier empty as in
'col_family:'. Examples:
Repeat, but select just 'name' column from Column Family = PERSONAL
• Type following: scan 'emp', {COLUMNS => 'personal:name'}
Get row or cell contents; pass table name, row, and optionally a dictionary of column(s),
timestamp and versions. Examples:
2. Using GET command, select specific row. Type : get 'emp', '2'
Get row or cell contents; pass table name, rowkey , and optionally a
dictionary of column(s), timestamp and versions. Examples:
hbase> get 't1', 'r1'
hbase> get 't1', 'r1', {COLUMN => 'c1'}
hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, \ VERSIONS => 4}
deleteall
Delete all cells in a given row; pass a table name, row, and optionally a column and
timestamp
To delete all cells in a row: deleteall '<table name>', '<row>'. You cannot UNDELETE
DELETE doesn't delete. It just creates a 'tombstone' so GET and SCAN will no
longer see these cells. Because HFiles are immutable, it's not until a major
compaction is run for these tombstones to be reconciled and space recovered
count
Count the number of rows in a table. This operation may take a LONG time (Run
'$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount' to run a counting mapreduce
job). Current count is shown every 1000 rows by default. Count interval may be
optionally specified. Examples:
truncate
• An HBase table is made of column families which are the logical and
physical grouping of columns. The columns in one family are stored
separately from the columns in another family. If you have data that is not
often queried, assign that data to a separate column family
• The column family and column qualifier names are repeated for each row.
Therefore, keep the names as short as possible to reduce the amount of
data that HBase stores and reads. For example, use f:q instead of
mycolumnfamily:mycolumnqualifier
• Because column families are stored in separate HFiles, keep the number
of column families as small as possible. You also want to reduce the
number of column families to reduce the frequency of MemStore flushes,
and the frequency of compactions. And, by using the smallest number of
column families possible, you can improve the LOAD time and reduce
disk consumption
Consider you wish to design a Table of Twitter followers and who they follow.
Here's some alternatives designs to answer queries such as:
• Who does Jarrod follow?
• Does Jarrod follow Jeffrey?
• Who follows Jarrod?
• How many people does a User follow?
CF
RowKey Column Cell value
follows
Jarrod 1:Jeffrey 2:Larry 3:Curly 4:Moe Count:4
Jeffrey 1:Rogers 2:Juli Count:2
User name now part of Column name and value =1 as the Cell Value (Cells
must have a value. So can know count number of followers a User has
follows
Jarrod Jeffrey:1 Larry:1 Curly:1 Moe:1
Jeffrey Rogers:1 Juli:1
Instead of a wide table, how about a tall table? Can also keep the CF short to
reduce data transferred across network wire
f
Jarrod+Jeffrey Jeffrey:1
Jarrod+Larry Larry:1
Jarrod+Curly Curly:1
Jarrod+Moe Moe:1
Jeffrey+Rogers Rogers:1
Jeffrey+Juli Juli:1
Now the question 'Does Jarrod follow Jeffrey?' can be answered by just
searching the Row Key
Answering the questions, 'Who does Jarrod follow?' and 'How many users
does Jarrod follow?' can also be accomplished by searching the Row Key only
!exit
Note won't see the tables you manually created in HBase (ie: emp)
Copy and Paste the 2 statements below into Phoenix and execute …
CREATE TABLE department CREATE
CREATETABLE
TABLEemployee
employee
(dept_num SMALLINT NOT NULL (emp_num
(emp_numINTEGER
INTEGERNOT
NOTNULL
NULL
,dept_name CHAR(30) ,mgr_emp_num
,mgr_emp_numINTEGER
INTEGER
,budget DECIMAL(10,2) ,dept_num
,dept_numINTEGER
INTEGER
,mgr_emp_num INTEGER ,job
,job INTEGER
INTEGER
,CONSTRAINT pk PRIMARY KEY ,last_name
,last_nameCHAR(20)
CHAR(20)
(dept_num)); ,first_name
,first_nameVARCHAR(30)
VARCHAR(30)
,hire
,hire VARCHAR(10)
DATE
,birth
,birth VARCHAR(10)
DATE
,salary
,salary DECIMAL(10,2)
DECIMAL(10,2)
,CONSTRAINT
,CONSTRAINTpk pkPRIMARY
PRIMARYKEY
KEY(emp_num));
(emp_num));
The psql command is invoked via psql.py in the Phoenix bin directory. In order to use it
to load CSV data, it is invoked by providing the connection information for your HBase
cluster, the name of the table to load data into, and the path to the CSV file or files. Note
that all CSV files to be loaded must have the ‘.csv’ file extension (this is because
arbitrary SQL scripts with the ‘.sql’ file extension can also be supplied on the PSQL
command line).
To load the example data outlined above into HBase running on the local machine, run
the following command:
The following parameters can be used for loading data with PSQL:
Parameter Description
Provide the name of the table in which to load data. By default, the name of
-t the table is taken from the name of the CSV file. This parameter is case-
sensitive
Overrides the column names to which the CSV data maps and is case
-h sensitive. A special value of in-line indicating that the first line of the CSV file
determines the column to which the data maps.
-s Run in strict mode, throwing an error on CSV parsing errors
-d Supply a custom delimiter or delimiters for CSV parsing
-q Supply a custom phrase delimiter, defaults to double quote character
-e Supply a custom escape character, default is a backslash
-a Supply an array delimiter (explained in more detail below)
For higher-throughput loading distributed over the cluster, the MapReduce loader can
be used. This loader first converts all data into HFiles, and then provides the created
HFiles to HBase after the HFile creation is complete.
The MapReduce loader is launched using the hadoop command with the Phoenix client
jar, as follows:
Or use the BULK loader utility psql.py (For simplicity, open new PuTTY prompt)
(must be at [root@sandbox bin]# prompt and not ../hbase-unsecur> prompt
Ensure you are back at the Phoenix shell and enter SELECT commands for
the 2 tables
SELECT * from employee;
SELECT * from department;
Execute a JOIN
SELECT e.last_name, d.dept_name Here's something HBase
FROM employee e JOIN department d
can't do, but Phoenix can
ON e.dept_num=d.dept_num;
department_name SumSal
------------------------------ ------------
education 233000.00
research and development 116400.00
customer support 245575.00
marketing sales 200125.00