0% found this document useful (0 votes)
8 views

HD Mod09 HBase Phoenix

Uploaded by

hlidio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

HD Mod09 HBase Phoenix

Uploaded by

hlidio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Module 9

Module 09 – HBase

After completing this module, the student should be able


execute both HBase and Phoenix code and be able to describe:
• HBase overview
• Data Model
• Architecture
• Hands-on labs
• Design considerations
• Running Phoenix over Hbase
• Hands-on labs

cd /usr/hdp/2.2.0.0-2041/hbase/bin
hbase shell

HBase and Phoenix Page 1


Page 2 HBase and Phoenix
Table Of Contents
Before we Begin – Ensure HBase started ....................................................................................... 4
What are NoSQL databases?........................................................................................................... 6
HDFS vs HBase .............................................................................................................................. 8
HBase vs RDBMS ........................................................................................................................ 10
SQL vs NoSQL ............................................................................................................................. 12
Why use No SQL? ........................................................................................................................ 14
Which NoSQL should I chose? ..................................................................................................... 16
HBase characteristics .................................................................................................................... 18
HBase architecture ........................................................................................................................ 20
Splitting a table across multiple regions ....................................................................................... 22
Storage mechanism in HBase........................................................................................................ 24
Key and Column families ............................................................................................................. 26
Comparing RDBMS tables to HBase tables ................................................................................. 28
Physical Model - Timestamps ....................................................................................................... 30
In-line lab: HBase GUI - <IP>:60010 ........................................................................................... 32
Lab01: Run HBase shell and list tables......................................................................................... 34
Lab02: HBase commands ............................................................................................................. 36
Lab03: CREATE TABLE command ............................................................................................ 38
Lab04: DESCRIBE and ALTER TABLE command .................................................................... 40
Lab05: PUT and SCAN commands (Insert) ................................................................................. 42
Lab06: PUT and SCAN commands (Update) ............................................................................... 44
Lab07: SCAN command ............................................................................................................... 46
Lab08: GET command .................................................................................................................. 48
Lab08: Rows sorted automatically ................................................................................................ 50
Lab09: DELETE command........................................................................................................... 52
Compacting: HBase housekeeping ............................................................................................... 54
Lab10: Other HBase commands ................................................................................................... 56
Table Design ................................................................................................................................. 58
Table Design – Column Family .................................................................................................... 60
Table Design examples ................................................................................................................. 62
Table Design examples (con’t) ..................................................................................................... 64
Phoenix over HBase ...................................................................................................................... 66
Phoenix characteristics .................................................................................................................. 68
Lab11: Run Phoenix shell ............................................................................................................. 70
!tables command ........................................................................................................................... 72
Lab12: CREATE TABLE command ............................................................................................ 74
Lab13: INSERT into tables ........................................................................................................... 76
Lab13: Confirm tables populated .................................................................................................. 78
Lab14: Joins .................................................................................................................................. 80
Lab14: Joins with Aggregates ....................................................................................................... 82
In Review – HBase and Phoenix................................................................................................... 84

HBase and Phoenix Page 3


Before we Begin – Ensure HBase started

From Hue, ensure HBase is started. If not do so now.

Page 4 HBase and Phoenix


Before we begin – Ensure HBase started

• From web browser, navigate to URL: http://192.168.100.140:8080


Login as: admin / admin
• If needed, go to 'Service Actions' and start Hbase. Note it may take a few
minutes for it to show no errors

HBase and Phoenix Page 5


What are NoSQL databases?
Some other examples of NoSQL databases include:

• Accumulo
– Key/Value store
– Security focus
• Allows controlling access both by row and column
• Rarely used outside of government
• Cassandra
– Key/Value store
– Independent of the Hadoop Ecosystem
– Key distinguishing feature is cross data-center replication.
• synchronous or asynchronous replication

• MongoDB (Non-Apache project)
– Document Store
• Document ID points to Binary JSON structure rather than
values list
– Often used as an Object data store
– Sometimes described as a competitor to HBase. It’s not.
• Different storage mechanism. MongoDB returns the entire
document. HBase returns one or more values associated with
a row ID

Page 6 HBase and Phoenix


What are NoSQL databases?

• Problem - Hadoop can perform only batch processing and data will be
accessed only in a sequential manner. That means one has to search the
entire dataset even for the simplest of jobs. A new solution is needed to
access any point of data in a single unit of time (random access)
• Solution - Applications such as HBase, Cassandra, CouchDB, Dynamo, and
MongoDB are some of the databases that store huge amounts of data and
access the data in a random manner
• HBase is a data model that is similar to Google’s big table designed to
provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System (HDFS)
• Unlike HDFS, HBase is good for:
If want to do Join, use Phoenix or load
• Fast record lookup HBase tables into HIVE and do there
• Support for record-level insertion
• Support for updates (not in place), but rather updates are done by
creating new versions of values

HBase and Phoenix Page 7


HDFS vs HBase

Page 8 HBase and Phoenix


HDFS vs HBase

HBase and Phoenix Page 9


HBase vs RDBMS

Page 10 HBase and Phoenix


HBase vs RDBMS

HBase cannot do Joins, RDBMS of course can

HBase and Phoenix Page 11


SQL vs NoSQL

Page 12 HBase and Phoenix


SQL vs NoSQL
SQL NoSQL
Many different types including key-value stores,
One type (SQL database) with minor
Types document databases, wide-column stores, and
variations
graph databases
Developed in 2000s to deal with limitations of SQL
Development Developed in 1970s to deal with first
databases, particularly concerning scale,
History wave of data storage applications
replication and unstructured data storage
Examples MySQL, Postgres, Oracle, Teradata MongoDB, Cassandra, HBase, Neo4j
Varies based on database type. For example, key-
value stores function similarly to SQL databases,
Individual records (e.g., "employees") are but have only two columns ("key" and "value"), with
stored as rows in tables, with each more complex information sometimes stored within
Data Storage Model column storing a specific piece of data the "value" columns. Document databases do
about that record (e.g., "manager," "date away with the table-and-row model altogether,
hired," etc.), much like a spreadsheet. storing all relevant data together in single
"document" in JSON, XML, or another format,
which can nest values hierarchically.
Structure and data types are fixed in
Typically dynamic. Records can add new
advance. To store information about a
information on the fly, and unlike SQL table rows,
Schemas new data item, the entire database must
dissimilar data can be stored together as
be altered, during which time the
necessary.
database must be taken offline.

HBase and Phoenix Page 13


Why use No SQL?
Advantages

1. Made for big data. By design, NoSQL is capable of storing, processing, and
managing huge amounts of data. This not only includes the structured data you collect
from your web form or at the point-of-sale, but text messages, word processing
documents, videos and other forms of unstructured data as well. While RDBMS
applications are growing in terms of what they can handle in capacity, they are largely
outclassed and outmatched by NoSQL.

2. Seamless scalability. Traditionally, many organizations addressed the need for


scaling up by throwing money at the problem. When you needed to accommodate more
data, you simply bought a bigger server with bigger capacity. Big data databases are
designed with scalability in mind, offering a convenient way for companies to transition
to new nodes both on-premise and in the cloud as well – all while maintaining the high
level of performance and availability such mission-critical applications require.

3. Cost effective data processing. Commercial RDBMS solutions like SQL Server
tend to perform best when paired up with commercial servers, which means you could
up shelling out a lot of cash depending on the number of machines in your cluster.
NoSQL, on the other hand, thrives on low-cost commodity hardware. As a result, it
offers what is often a significantly more cost effective way to store and process data in
comparison to its proprietary competitors.

Disadvantages

1. Lack of familiarity. Hitched on the shoulders of big data, NoSQL is slowly but surely
piggybacking its way to way mainstream viability. However, RDBMS tools have been
around forever in IT years and form the only category of databases many businesses
know. Like any new technology, NoSQL can be a tough sell for senior-level executives
who are coddled in the comfort and familiarity of their existing systems.

2. Management challenges. Big data tools aim to make managing large amounts of
information as simple as possible. But ask any administrator who is responsible for
interacting with the databases behind these tools and most will tell you that we still have
a long way to go in the simplicity department. NoSQL, in particular, has a reputation for
being challenging to install, and even more hectic to manage on a day-to-day basis.

3. Limited expertise. Based on maturity alone, there are countless administrators that
know the fundamentals of MySQL and RDBMS software in general like the back of their
hands. With big data being a relatively new concept, it’s fair to say that even a specialist
only has limited knowledge of NoSQL. This is a critical factor that can make assembling
a staff of skilled data administrators and engineers quite the daunting task

Page 14 HBase and Phoenix


Why use NoSQL?

• NoSQL databases are built to allow the insertion of data without a predefined
schema (ie: doesn't mind storing INT in one row and string in next row from
same column). That makes it easy to make significant application changes in
real-time, without worrying about service interruptions – which means
development is faster, code integration is more reliable, and less database
administrator time is needed

• NoSQL database types include (not inclusive):


• Document databases pair each key with a complex data structure known as
a document. Documents can contain many different key-value pairs, or key-
array pairs, or even nested documents such as MongoDB
• Graph stores are used to store information about networks, such as social
connections. Graph stores include Neo4J and HyperGraphDB
• Key-value stores are the simplest NoSQL databases. Every single item in the
database is stored as an attribute name (or "key"), together with its value.
Examples of key-value stores are Riak and Voldemort. Some key-value
stores, such as Redis, allow each value to have a type, such as 'integer',
which adds functionality
• Wide-column stores such as Cassandra and HBase are optimized for queries
over large datasets, and store columns of data together, instead of rows
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

HBase and Phoenix Page 15


Which NoSQL should I chose?

Here’s a few questions you should ask when deciding which NoSQL to choose.

Page 16 HBase and Phoenix


Which NoSQL should I chose?

• Above is the wrong question to ask. Instead, ask yourself about the
functionality you are implementing and the requirements on the data storage
solution
• Instead ask whether your Use Cases require the following support:
• Random reads or random write
• Sequential reads or sequential writes
• High read throughput or high write throughput
• Whether the data changes or remains immutable once written
• Storage model most suitable for your access patterns
• Column/column family oriented
• Key-value
• Document-oriented
• Schema/Schemaless
• Whether consistency or availability is most desirable

HBase and Phoenix Page 17


HBase characteristics

Use Apache HBase™ when you need random, realtime read/write access to your Big
Data. This project's goal is the hosting of very large tables -- billions of rows X millions
of columns -- atop clusters of commodity hardware. Apache HBase is an open-source,
distributed, versioned, non-relational database modeled after Google's Bigtable: A
Distributed Storage System for Structured Data by Chang et al. Just as Bigtable
leverages the distributed data storage provided by the Google File System, Apache
HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

HBase provides a fault-tolerant way of storing large quantities of sparse data (small
amounts of information caught within a large collection of empty or unimportant data,
such as finding the 50 largest items in a group of 2 billion records, or finding the non-
zero items representing less than 0.1% of a huge collection).

Features

• Linear and modular scalability.


• Strictly consistent reads and writes.
• Automatic and configurable sharding of tables
• Automatic failover support between RegionServers.
• Convenient base classes for backing Hadoop MapReduce jobs with Apache
HBase tables.
• Easy to use Java API for client access.
• Block cache and Bloom Filters for real-time queries.
• Query predicate push down via server side Filters
• Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and
binary data encoding options
• Extensible jruby-based (JIRB) shell
• Support for exporting metrics via the Hadoop metrics subsystem to files or
Ganglia; or via JMX

Page 18 HBase and Phoenix


HBase characteristics

• No real Indexes – Rows are stored sequentially (as are columns within each
row). Insert performance is independent of table size
• No NULL records – HBase doesn't store anything to indicate absence of data
• Automatic partitioning – As your table grows, they will be automatically split
into Regions and distributed across all available nodes
• Linearly scalable – Regions automatically rebalance when new node is added
• Supports Versioning – So can look back in time for data
• HBase is a database built on top of Hadoop – It depends on Hadoop for both
data access and data reliability
• All data is stored in the form of a Byte array
• Commodity hardware, Fault tolerance, Batch processing
• Fast access to cell via Row Key, Column Family, Column, Version
– Row Keys are the critical design point in HBase
– Row Keys are sorted automatically
– Row Keys must allow spreading of access across Region Servers

HBase and Phoenix Page 19


HBase architecture

The HBase Physical Architecture consists of servers in a Master-Slave relationship as


shown below. Typically, the HBase cluster has one Master node, called HMaster and
multiple Region Servers called HRegionServer. Each Region Server contains multiple
Regions – HRegions.

Just like in a Relational Database, data in HBase is stored in Tables and these Tables
are stored in Regions. When a Table becomes too big, the Table is partitioned into
multiple Regions. These Regions are assigned to Region Servers across the cluster.
Each Region Server hosts roughly the same number of Regions.

The HBase Master in the HBase is responsible for

• Performing Administration
• Managing and Monitoring the Cluster
• Assigning Regions to the Region Servers
• Controlling the Load Balancing and Failover

On the other hand, the RegionServer perform the following work

• Hosting and managing Regions


• Splitting the Regions automatically
• Handling the read/write requests
• Communicating with the Clients directly

Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions.
Each Region in turn is made up of a MemStore and multiple StoreFiles (HFile). The
data lives in these StoreFiles in the form of Column Families (explained below). The
MemStore holds in-memory modifications to the Store (data).

The mapping of Regions to Region Server is kept in a system table called .META.
When trying to read or write data from HBase, the clients read the required Region
information from the .META table and directly communicate with the appropriate Region
Server. Each Region is identified by the start key (inclusive) and the end key
(exclusive)

Page 20 HBase and Phoenix


HBase architecture
1. HBase client
2. HBase Master – Assigns Regions to Region Servers
3. Region Server – Stores and retrieve's clients requests from HDFS,
manages splits when regions become too large
ZooKeeper keeps copy of hbase:meta catalog for client requests, cluster state

HBase and Phoenix Page 21


Splitting a table across multiple regions

By splitting a table across multiple regions you get faster performance due to parallelism.

Page 22 HBase and Phoenix


Splitting a table across multiple regions

HBase and Phoenix Page 23


Storage mechanism in HBase

The Data Model in HBase is designed to accommodate semi-structured data that could
vary in field size, data type and columns. Additionally, the layout of the data model
makes it easier to partition the data and distribute it across the cluster. The Data Model
in HBase is made of different logical components such as Tables, Rows, Column
Families, Columns, Cells and Versions.

Page 24 HBase and Phoenix


Storage mechanism in HBase

• HBase is a column-oriented database and the tables in it are sorted by


rowkey value. The table schema defines only column families, which are the
key value pairs. A table can have multiple column families and each column
family can have any number of columns. Subsequent column values are
stored contiguously on the disk. Each cell value of the table has a timestamp
• In short, in HBase:
• Table consists of a set of Column Families (columns defined during load)
• Column Family is a collection of columns
• Column Names are encoded inside the cells

Columns
Column
values

Column Key-value pair example: personal data:name-raju

HBase and Phoenix Page 25


Key and Column families

Tables – The HBase Tables are more like logical collection of rows stored in separate
partitions called Regions. As shown above, every Region is then served by exactly one
Region Server. The figure above shows a representation of a Table.

Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys
are unique in a Table and are always treated as a byte[].

Column Families – Data in a row are grouped together as Column Families. Each
Column Family has one more Columns and these Columns in a family are stored
together in a low level storage file known as HFile. Column Families form the basic unit
of physical storage to which certain HBase features like compression are applied.
Hence it’s important that proper care be taken when designing Column Families in
table. The table above shows Customer and Sales Column Families. The Customer
Column Family is made up 2 columns – Name and City, whereas the Sales Column
Families is made up to 2 columns – Product and Amount.

Columns – A Column Family is made of one or more columns. A Column is identified


by a Column Qualifier that consists of the Column Family name concatenated with the
Column name using a colon – example: columnfamily:columnname. There can be
multiple Columns within a Column Family and Rows within a table can have varied
number of Columns.

Cell – A Cell stores data and is essentially a unique combination of rowkey, Column
Family and the Column (Column Qualifier). The data stored in a Cell is called its value
and the data type is always treated as byte[].

Version – The data stored in a cell is versioned and versions of data are identified by
the timestamp. The number of versions of data retained in a column family is
configurable and this value by default is 3

Page 26 HBase and Phoenix


Keys and Column Families
Table Table

CREATE TABLE statement: create 'person' , 'personal_data', 'demographic'


• Row Key (defined auto during load) Each row has Each record is divided into
– Byte array Primary Key Column Families
– Serves as the primary key for
the table
– Indexed for fast lookup

• Column Family
– Has a name (string)
– Contains one or more related
columns

• Column (defined during loading)


– Belongs to one column family
– Included inside the row Each Column Family consists of one or
• familyName:columnName more Columns (entered during loading)
Column keys consist of Column Family:Column. For example, in the above, for Row key1,
Column key = personal_data:Address and and the Cell value = Budapest, Hungary

HBase and Phoenix Page 27


Comparing RDBMS tables to HBase tables

Page 28 HBase and Phoenix


Comparing RDBMS tables to HBase

RDBMS-like table CUST An HBase equivalent :


table: CUST table : CUST
(cust_id INT column family : INFO
name STRING columns : NAME, AGE, BIRTH, ???
age INT rowkey : CUST_ID
birth DATE)
PRIMARY INDEX(cust_id)

create 'cust' , 'cust_info' put 'cust','1','info:name','juli'


put 'cust','1','info:age','25'
put 'cust','1','info:birth', '1957-10-07'

put 'cust','2','info:name','mark'
put 'cust','2','info:age','24'
Wait a minute. You mean I put 'cust','2','info:birth', '1958-04-20'
can define a new column put 'cust','2','info:job', 'consultant'
'on-the-fly' if I need to? Wow

HBase and Phoenix Page 29


Physical Model - Timestamps

When you put data into HBase, a timestamp is required. The timestamp can be
generated automatically by the RegionServer or can be supplied by you. The timestamp
must be unique per version of a given cell, because the timestamp identifies the
version. To modify a previous version of a cell, for instance, you would issue a Put with
a different value for the data itself, but the same timestamp.

HBase's behavior regarding versions is highly configurable. The maximum number of


versions defaults to 1 in CDH 5, and 3 in previous versions. You can change the default
value for HBase by configuring hbase.column.max.version in hbase-site.xml, either via
an advanced configuration snippet if you use Cloudera Manager, or by editing the file
directly otherwise.

You can also configure the maximum and minimum number of versions to keep for a
given column, or specify a default time-to-live (TTL), which is the number of seconds
before a version is deleted. The following examples all use alter statements in HBase
Shell to create new column families with the given characteristics, but you can use the
same syntax when creating a new table or to alter an existing column family. This is
only a fraction of the options you can specify for a given column family.
hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5
hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2
hbase> alter ‘t1′, NAME => ‘f1′, TTL => 15

HBase sorts the versions of a cell from newest to oldest, by sorting the timestamps
lexicographically. When a version needs to be deleted because a threshold has been
reached, HBase always chooses the "oldest" version, even if it is in fact the most recent
version to be inserted. Keep this in mind when designing your timestamps. Consider
using the default generated timestamps and storing other version-specific data
elsewhere in the row, such as in the row key. If MIN_VERSIONS and TTL
conflict, MIN_VERSIONS takes precedence.

Page 30 HBase and Phoenix


Physical Model – Timestamp stored in CF

• CF are stored in separate files


CF called HFiles

• Each Hfile is partitioned


horizontally into Regions (like
LP tables in SQL)
Info Column Family

Timestamp values are


auto assigned and
used for Versioning
roles Column Family

HBase and Phoenix Page 31


In-line lab: HBase GUI - <IP>:60010

HBase has a GUI.

Page 32 HBase and Phoenix


In-line lab: HBase GUI - <IP>:60010

HBase and Phoenix Page 33


Lab01: Run HBase shell and list tables

List all tables in hbase. Optional regular expression parameter could


be used to filter the output

hbase> list
hbase> list ‘abc.*’

Page 34 HBase and Phoenix


Lab01: Run HBase shell and list tables

1. From Hadoop PuTTY prompt, Start HBase


cd /usr/hdp/2.2.0.0-2041/hbase/bin
hbase shell

To break out, type 'q' followed by 'Enter'. To exit, type exit or CTRL+C

2. Type list to get a list of all the tables

HBase and Phoenix Page 35


Lab02: HBase commands

Page 36 HBase and Phoenix


Lab02: HBase commands

1. Type: status and version status


version

2. Just type: help.


help Or to get help on Table using: table_help
table_help

HBase and Phoenix Page 37


Lab03: CREATE TABLE command

Creating a Table using HBase Shell


You can create a table using the create command, here you must specify the table
name and the Column Family name. The syntax to create a table in HBase shell is
shown below.

create ‘<table name>’,’<column family>’

Example

Given below is a sample schema of a table named emp. It has two column families:
“personal data” and “professional data”.

Row key personal data professional data

You can create this table in HBase shell as shown below.

hbase(main):002:0> create 'emp', 'personal data', ’professional data’

And it will give you the following output.

0 row(s) in 1.1300 seconds


=> Hbase::Table - emp

Page 38 HBase and Phoenix


Lab03: CREATE TABLE command

1. Here we create a table named EMP with 2 Column Families. One of the CF
will have 2 versions while the other will have 1 version
Type: create 'emp', {NAME =>'personal', VERSIONS => 2}, {NAME => 'professional'}

2. Confirm it is created: list

3. To drop a table or change its settings (ALTER), you need to first disable the
table using the disable command. You can re-enable it using the enable
command

Note you can use JAVA code for HBase commands too

HBase and Phoenix Page 39


Lab04: DESCRIBE and ALTER TABLE command

describe

This command returns the description of the table. Its syntax is as follows:

hbase> describe 'table name'

alter

Alter is the command used to make changes to an existing table. Using this command,
you can change the maximum number of cells of a column family, set and delete table
scope operators, and delete a column family from a table.

Given below is the syntax to change the maximum number of cells of a column family.

hbase> alter 't1', NAME ⇒ 'f1', VERSIONS ⇒ 5

In the following example, the maximum number of cells is set to 5.

hbase(main):003:0> alter 'emp', NAME ⇒ 'personal data', VERSIONS ⇒ 5


Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 2.3050 seconds

Page 40 HBase and Phoenix


Lab04: DESCRIBE and ALTER command

1. describe
describe 'emp'

2. alter allows you to change Column Family schema. For example, can add or
delete a Column Family from a table. Of course, first you must DISABLE the
table as mentioned in the previous slide
For example, the below would remove the Column Family = 'professional'
from the EMPLOYEE table
Example: disable 'employee'
alter 'employee', 'delete' = 'professional'
enable 'employee'

HBase and Phoenix Page 41


Lab05: PUT and SCAN commands (Insert)

put

Put a cell 'value' at specified table/row/column and optionally timestamp coordinates.


To put a cell value into table 't1' at row 'r1' under column 'c1' marked with the time 'ts1',
do:

hbase> put 't1', 'r1', 'c1', 'value', ts1

scan

Scan a table; pass table name and optionally a dictionary of scanner specifications.
Scanner specifications may include one or more of the following: LIMIT, STARTROW,
STOPROW, TIMESTAMP, or COLUMNS. If no columns are specified, all columns will
be scanned. To scan all members of a column family, leave the qualifier empty as in
'col_family:'. Examples:

hbase> scan '.META.'


hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, \ STARTROW =>
'xyz'}

For experts, there is an additional option –

CACHE_BLOCKS -- which switches block caching for the scanner on (true) or off
(false). By default it is enabled. Examples:

hbase> scan 't1', {COLUMNS => ['c1', 'c2'],


CACHE_BLOCKS => false}

Page 42 HBase and Phoenix


Lab05: PUT and SCAN command (Insert)

1. Using PUT command, you can insert cell value(s) into a table
Generic syntax:
put '<table name>', 'row<Int>', '<colfamily:colname>', '<value>', timestamp
Type following:
put 'emp','1','personal:name','juli'
put 'emp','1','personal:city','Portsmouth'
put 'emp','1','professional:job','sales'
put 'emp','1','professional:salary','99000'

2. Using scan, you can now read from the table. Type: scan 'emp'

Can optionally assign Timestamp

HBase and Phoenix Page 43


Lab06: PUT and SCAN commands (Update)

Page 44 HBase and Phoenix


Lab06: PUT and SCAN command (Update)

1. Using PUT command, you can update rows into a table


Generic syntax: put '<table name>', 'row<Int>', '<colfamily:colname>', '<value>'
Type following:
put 'emp', '1', 'personal:city', 'Cincinnati'

2. Using scan, you can now read from the table. Type: scan 'emp'

If you want to see both versions (original 'portsmouth' and new 'Cincinnati') type:
scan 'emp', {VERSIONS =>2} must type exactly like this

HBase and Phoenix Page 45


Lab07: SCAN command
scan

Scan a table; pass table name and optionally a dictionary of scanner specifications.
Scanner specifications may include one or more of the following: LIMIT, STARTROW,
STOPROW, TIMESTAMP, or COLUMNS. If no columns are specified, all columns will
be scanned. To scan all members of a column family, leave the qualifier empty as in
'col_family:'. Examples:

hbase> scan '.META.'


hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT =>
10, \ STARTROW => 'xyz'}

For experts, there is an additional option -- CACHE_BLOCKS -- which switches block


caching for the scanner on (true) or off (false). By default it is enabled. Examples:

hbase> scan 't1', {COLUMNS => ['c1', 'c2'],


CACHE_BLOCKS => false}

Page 46 HBase and Phoenix


Lab07: SCAN command

1. Using SCAN command, pluck out just Column Family = PERSONAL


• Type following: scan 'emp', {COLUMNS => 'personal'}

Repeat, but select just 'name' column from Column Family = PERSONAL
• Type following: scan 'emp', {COLUMNS => 'personal:name'}

HBase and Phoenix Page 47


Lab08: GET command
get

Get row or cell contents; pass table name, row, and optionally a dictionary of column(s),
timestamp and versions. Examples:

hbase> get 't1', 'r1'


hbase> get 't1', 'r1', {COLUMN => 'c1'}
hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP =>
ts1}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP =>
ts1, \ VERSIONS => 4}

Page 48 HBase and Phoenix


Lab08: GET command ( Used to read back a single row )

1. Add another row to the 'emp' table using PUT


put 'emp','2','personal:name','mark' Notice I added 2 new
put 'emp','2','personal:city','Sandusky' Column names (compared
put 'emp','2','personal:gender','male' to earlier PUT)
put 'emp','2','professional:job','IT' Must provide Row Key

2. Using GET command, select specific row. Type : get 'emp', '2'

Get row or cell contents; pass table name, rowkey , and optionally a
dictionary of column(s), timestamp and versions. Examples:
hbase> get 't1', 'r1'
hbase> get 't1', 'r1', {COLUMN => 'c1'}
hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, \ VERSIONS => 4}

HBase and Phoenix Page 49


Lab08: Rows sorted automatically

Page 50 HBase and Phoenix


Lab08: Rows sorted automatically

1. Add another row to the 'emp' table using PUT


put 'emp','0','personal:city','Cleveland'

2. Using SCAN command, Keys automatically sorted scan 'emp'

HBase and Phoenix Page 51


Lab09: DELETE command
delete

Put a delete cell value at specified table/row/column and optionally timestamp


coordinates. Deletes must match the deleted cell's coordinates exactly. When
scanning, a delete cell suppresses older versions. Takes arguments like the 'put'
command described below

deleteall

Delete all cells in a given row; pass a table name, row, and optionally a column and
timestamp

Page 52 HBase and Phoenix


Lab09: DELETE command

1. Using DELETE command


Generic syntax: delete '<table>', '<row>', '<cf:column >', '<time (optional) >'
Type following: delete 'emp', '1', 'personal:city'
2. Confirm row is delete using: scan 'emp'

To delete all cells in a row: deleteall '<table name>', '<row>'. You cannot UNDELETE

HBase and Phoenix Page 53


Compacting: HBase housekeeping

Page 54 HBase and Phoenix


Compacting: HBase housekeeping

DELETE doesn't delete. It just creates a 'tombstone' so GET and SCAN will no
longer see these cells. Because HFiles are immutable, it's not until a major
compaction is run for these tombstones to be reconciled and space recovered

Compact all regions in table or pass a region row to compact an individual


region. You can also compact a single column family within a region
Compact all regions in a table: hbase> compact 't1'
Compact an entire region: hbase> compact 'r1'

HBase and Phoenix Page 55


Lab10: Other HBase commands

count

Count the number of rows in a table. This operation may take a LONG time (Run
'$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount' to run a counting mapreduce
job). Current count is shown every 1000 rows by default. Count interval may be
optionally specified. Examples:

hbase> count 't1'


hbase> count 't1', 100000

truncate

Disables, drops and recreates the specified table.

Page 56 HBase and Phoenix


Lab10: Other HBase commands

1. count 'table' - Count number of rows count 'emp'

2. truncate 'table' – Disables, drops and recreates table truncate 'emp1'

3. You can now exit out of HBase exit

HBase and Phoenix Page 57


Table Design
The next few slides talks about table design. It is important to know your access
patterns and the data you wish to insert.

Page 58 HBase and Phoenix


Table Design

Consider the following about the Table schema:

1. How many Column Families?


2. What data goes in what Column Family?
3. How many Columns should each Column Family have?
4. What should the Column Names be? (during loading this occurs)
5. What info goes in the Cells?
6. How many Versions should be stored for each Cell?
7. What should the Row Key contain?

Things to keep in mind:


• Define your Access Patterns (in other words, which queries are you
typically going to ask) to assist your Table design
• Remember Column Names can be treated as data, just like Cell values
• The Row Key is the single most important thing. You should model Keys
based on the expected access pattern

HBase and Phoenix Page 59


Table Design – Column Family

Page 60 HBase and Phoenix


Table Design – Column Family

• An HBase table is made of column families which are the logical and
physical grouping of columns. The columns in one family are stored
separately from the columns in another family. If you have data that is not
often queried, assign that data to a separate column family

• The column family and column qualifier names are repeated for each row.
Therefore, keep the names as short as possible to reduce the amount of
data that HBase stores and reads. For example, use f:q instead of
mycolumnfamily:mycolumnqualifier

• Because column families are stored in separate HFiles, keep the number
of column families as small as possible. You also want to reduce the
number of column families to reduce the frequency of MemStore flushes,
and the frequency of compactions. And, by using the smallest number of
column families possible, you can improve the LOAD time and reduce
disk consumption

HBase and Phoenix Page 61


Table Design examples

Page 62 HBase and Phoenix


Table Design examples

Consider you wish to design a Table of Twitter followers and who they follow.
Here's some alternatives designs to answer queries such as:
• Who does Jarrod follow?
• Does Jarrod follow Jeffrey?
• Who follows Jarrod?
• How many people does a User follow?
CF
RowKey Column Cell value
follows
Jarrod 1:Jeffrey 2:Larry 3:Curly 4:Moe Count:4
Jeffrey 1:Rogers 2:Juli Count:2

User name now part of Column name and value =1 as the Cell Value (Cells
must have a value. So can know count number of followers a User has
follows
Jarrod Jeffrey:1 Larry:1 Curly:1 Moe:1
Jeffrey Rogers:1 Juli:1

HBase and Phoenix Page 63


Table Design examples (con’t)

Page 64 HBase and Phoenix


Table Design examples (con't)

Instead of a wide table, how about a tall table? Can also keep the CF short to
reduce data transferred across network wire

f
Jarrod+Jeffrey Jeffrey:1
Jarrod+Larry Larry:1
Jarrod+Curly Curly:1
Jarrod+Moe Moe:1
Jeffrey+Rogers Rogers:1
Jeffrey+Juli Juli:1

Now the question 'Does Jarrod follow Jeffrey?' can be answered by just
searching the Row Key
Answering the questions, 'Who does Jarrod follow?' and 'How many users
does Jarrod follow?' can also be accomplished by searching the Row Key only

HBase and Phoenix Page 65


Phoenix over HBase
Apache Phoenix is a relational database layer over HBase delivered as a client-
embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix
takes your SQL query, compiles it into a series of HBase scans, and orchestrates the
running of those scans to produce regular JDBC result sets. The table metadata is
stored in an HBase table and versioned, such that snapshot queries over prior versions
will automatically use the correct schema. Direct use of the HBase API, along with
coprocessors and custom filters, results in performance on the order of milliseconds for
small queries, or seconds for tens of millions of rows.

Page 66 HBase and Phoenix


Phoenix over HBase

• What Hive is to MapReduce, Phoenix is to HBase. Simply stated, it is an


SQL-like interface for HBase. It's motto: "We put the SQL back in NoSQL"
• Phoenix turns HBase into a SQL database
• Query engine
• Metadata repository
• Embedded JDBC driver
• Compile queries into native HBase calls (No MapReduce)
• Phoenix commands include: Example
• CREATE TABLE CREATE TABLE t1 (host VARCHAR, ts DATE,
• CREATE VIEWS response_time INTEGER, gc_time INTEGER,
• CREATE INDEX cpu_time INTEGER, io_time INTEGER,
• SELECT CONSTRAINT pk PRIMARY KEY (host, date))
• UPSERT VALUES COMPRESSION='GZ', BLOCKSIZE='4096'
• UPSERT SELECT
SELECT host, avg(response_time)
• DELETE
FROM t1 t JOIN v1
• JOIN
ON t1.host = v1.host
• !quit
WHERE ts > CURRENT_DATE() - 7 AND v1.loc = 'sf%'
ORDER BY t1.gc_time DESC
LIMIT 5;

HBase and Phoenix Page 67


Phoenix characteristics

Page 68 HBase and Phoenix


Phoenix characteristics

• Apache Phoenix is a relational database layer over HBase delivered as a


client-embedded JDBC driver targeting low latency queries over HBase data.
Apache Phoenix takes your SQL query, compiles it into a series of HBase
scans, and orchestrates the running of those scans to produce regular JDBC
result sets
• A Phoenix table is created through the CREATE TABLE DDL command and
can either be:
1. Built from scratch, in which case the HBase table and column families
will be created automatically
2. Mapped to an existing HBase table, by creating either a read-write
TABLE or a read-only VIEW, with the caveat that the binary
representation of the row key and key values must match that of the
Phoenix data types

HBase and Phoenix Page 69


Lab11: Run Phoenix shell

Page 70 HBase and Phoenix


Lab11: Run Phoenix shell

1. Start Phoenix (open a new PuTTY prompt)


cd /usr/hdp/2.2.0.0-2041/phoenix/bin
./sqlline.py localhost:2181:/hbase-unsecure

!exit

HBase and Phoenix Page 71


!tables command

Page 72 HBase and Phoenix


!tables command

!tables - View HBase tables !tables

Note won't see the tables you manually created in HBase (ie: emp)

HBase and Phoenix Page 73


Lab12: CREATE TABLE command

Page 74 HBase and Phoenix


Lab12: CREATE TABLE command

Copy and Paste the 2 statements below into Phoenix and execute …
CREATE TABLE department CREATE
CREATETABLE
TABLEemployee
employee
(dept_num SMALLINT NOT NULL (emp_num
(emp_numINTEGER
INTEGERNOT
NOTNULL
NULL
,dept_name CHAR(30) ,mgr_emp_num
,mgr_emp_numINTEGER
INTEGER
,budget DECIMAL(10,2) ,dept_num
,dept_numINTEGER
INTEGER
,mgr_emp_num INTEGER ,job
,job INTEGER
INTEGER
,CONSTRAINT pk PRIMARY KEY ,last_name
,last_nameCHAR(20)
CHAR(20)
(dept_num)); ,first_name
,first_nameVARCHAR(30)
VARCHAR(30)
,hire
,hire VARCHAR(10)
DATE
,birth
,birth VARCHAR(10)
DATE
,salary
,salary DECIMAL(10,2)
DECIMAL(10,2)
,CONSTRAINT
,CONSTRAINTpk pkPRIMARY
PRIMARYKEY
KEY(emp_num));
(emp_num));

… then confirm they exist using the !tables command !tables

HBase and Phoenix Page 75


Lab13: INSERT into tables

The psql command is invoked via psql.py in the Phoenix bin directory. In order to use it
to load CSV data, it is invoked by providing the connection information for your HBase
cluster, the name of the table to load data into, and the path to the CSV file or files. Note
that all CSV files to be loaded must have the ‘.csv’ file extension (this is because
arbitrary SQL scripts with the ‘.sql’ file extension can also be supplied on the PSQL
command line).

To load the example data outlined above into HBase running on the local machine, run
the following command:

bin/psql.py -t EXAMPLE localhost data.csv

The following parameters can be used for loading data with PSQL:

Parameter Description
Provide the name of the table in which to load data. By default, the name of
-t the table is taken from the name of the CSV file. This parameter is case-
sensitive
Overrides the column names to which the CSV data maps and is case
-h sensitive. A special value of in-line indicating that the first line of the CSV file
determines the column to which the data maps.
-s Run in strict mode, throwing an error on CSV parsing errors
-d Supply a custom delimiter or delimiters for CSV parsing
-q Supply a custom phrase delimiter, defaults to double quote character
-e Supply a custom escape character, default is a backslash
-a Supply an array delimiter (explained in more detail below)

For higher-throughput loading distributed over the cluster, the MapReduce loader can
be used. This loader first converts all data into HFiles, and then provides the created
HFiles to HBase after the HFile creation is complete.

The MapReduce loader is launched using the hadoop command with the Phoenix client
jar, as follows:

hadoop jar phoenix-3.0.0-incubating-client.jar


org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input
/data/example.csv

Page 76 HBase and Phoenix


Lab13: INSERT into tables

UPSERT INTO department VALUES (401,'customer support',982300.00,1003);


UPSERT INTO department VALUES (201,'technical operations',293800.00,1025);
UPSERT INTO department VALUES (301,'research and development',465600.00,1019);
UPSERT INTO department VALUES (302,'product planning',226000.00,1016);
UPSERT INTO department VALUES (403,'education',932000.00,1005);
UPSERT INTO department VALUES (402,'software support',308000.00,1011);
UPSERT INTO department VALUES (501,'marketing sales',308000.0,1017);
UPSERT INTO department VALUES (100,'president',400000.00,0801);
UPSERT INTO department VALUES (600,'None',NULL,1099);

Or use the BULK loader utility psql.py (For simplicity, open new PuTTY prompt)
(must be at [root@sandbox bin]# prompt and not ../hbase-unsecur> prompt

cd /usr/hdp/2.2.0.0-2041/phoenix/bin Must be in Upper Case


./psql.py -t EMPLOYEE localhost:2181:/hbase-unsecure /usr/hdp/2.2.0.0-
2041/phoenix/doc/examples/employee.csv

HBase and Phoenix Page 77


Lab13: Confirm tables populated

Page 78 HBase and Phoenix


Lab13: Confirm tables populated

Ensure you are back at the Phoenix shell and enter SELECT commands for
the 2 tables
SELECT * from employee;
SELECT * from department;

HBase and Phoenix Page 79


Lab14: Joins

Page 80 HBase and Phoenix


Lab14: Joins

Execute a JOIN
SELECT e.last_name, d.dept_name Here's something HBase
FROM employee e JOIN department d
can't do, but Phoenix can
ON e.dept_num=d.dept_num;

HBase and Phoenix Page 81


Lab14: Joins with Aggregates

Page 82 HBase and Phoenix


Lab14: Joins with Aggregations

Execute a JOIN with Aggregation


SELECT d.dept_name, sum(e.salary) as sumsal
FROM employee e JOIN department d
ON e.dept_num=d.dept_num
GROUP BY d.dept_name;

department_name SumSal
------------------------------ ------------
education 233000.00
research and development 116400.00
customer support 245575.00
marketing sales 200125.00

HBase and Phoenix Page 83


In Review – HBase and Phoenix

Page 84 HBase and Phoenix


In Review – HBase and Phoenix

After completing this module, the student should be able to


execute both HBase and Phoenix code and be able to describe:
• HBase overview
• Data Model
• Architecture
• Hands-on labs
• Design considerations
• Running Phoenix over HBase
• Hands-on labs

• If you need random, real-time read/write access, HBase if for you

• If need OLTP capability, NoSQL database's like HBase are better


choice than using MapReduce which typically can only do FTS
• If have structured data, then use Phoenix as opposed to HBase

HBase and Phoenix Page 85

You might also like