Course Topics
Week 1 Week 5
– Introduction to HDFS – HIVE
Week 2 Week 6
– Setting Up Hadoop Cluster – HBASE
Week 3 Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER
Week 4 Week 8
– PIG – SQOOP
What are we going to learn Today..!
• Problems in the real world
• Traditional RDBMS fallacies
• The advent of HBase
• HBase Architecture
• Hands-on creation, updation of HBase table on shell
• Multiple ways of loading data into HBase (Shell, Jvm-Client, MapReduce, Avro, Thrift,
REST Api)
Problem in real world
Linkedin
Revolutionizing Education
Add Targeting
So, what is common?
• Huge Data
• Fast Random access
• Structured Data
• Variable Schema
• Need of Compression
• Need of Distribution (Sharding)
How Traditional RDBMS will solve
Users Follower
Id User_id
Name Follower_id
Sex type
age
Contd.
Users Connections
Id User_id
Name Connection_id
Sex type
age
Characteristics Of Probable Solution
• Distributed database
• Sorted data
• Sparse data store
• Automatic sharding
History of HBase
2006 Big Table paper published
2006 HBase development starts
2008 Microsoft buys powerset
2010 Facebook’s messaging system
Facebook Messaging System
• Facebook monitored their usage and figured out what the really
needed.
• What they needed was a system that could handle two types of data
patterns:
– A short set of temporal data that tends to be volatile
– An ever-growing set of data that rarely gets accessed
real-time, distributed, linearly scalable, robust, BigData, open-source, key-value, column-oriented
HBase Definition
HBase is a key/value store. Specifically it is a
Sparse, Consistent, Distributed, Multidimensional, Sorted map.
More HBase Implementation
uses HBase to power their Messages
A number of applications including infrastructure
people search rely on HBase internally http://sites.computer.org\debull\A12june\facebo
for data generation. ok.pdf
uses HBase to store document fingerprint for
We use HBase as a real time data detecting near-duplications. We have a
storage and analytics platform. cluster of few nodes that runs HDFS,
mapreduce, and HBase.
uses an HBase cluster containing
uses HBase as a foundation for cloud scale
over a billion anonymized clinical
storage for a variety of applications.
records.
Referred - http://wiki.apache.org/hadoop/Hbase/PoweredBy
Data Model
Versions Of Data
Row key Personal_data demographic
Persons ID Name Address Birth Date Gender
1 Harry BTM layout 1988-10-31 M
2 Dhawan 1956-09-16 M
3 Sana whitefield 1189-12-03 F
….. ….. ….. ….. …..
500,000,000 vineet delhi 1964-01-07 M
Physical storage
Col3(Birth date) ->
1926-10-31
Col1(address) ->Budapest
Row 1(1)
Col3(Gender) -> M
Col1(Name) -> H. Houdini
Col3(Birth date) ->
Row 2 (2) val3
Col5(address) -> D. Copper
Row 3 (3) Col4(Gender) -> val4
Family1(personal data) Family2(Demographic)
How does It look like?
What it means?
Column Family:
Row Key Values
Column Qualifier
• Unique for each row • Less number of families • Various versions of values
• Identifies each row gives faster access are maintained
• Families are fixed column • Scan shows only recent
qualifiers are not version
Three Major Components
Data Distribution
Row
s
Logical View – All rows in a
Region
A1
Null -> A3
A2
A22 Region
A3 A3 -> F34
…..
…..
K4 Region
….. F34 -> K80
….. Region
090 K80 -> 095
table
….. Region
….. 095 -> Null
…..
Z30
Z55 Region Region Region
Server Server Server
HBase Components
Zookeeper
Master /hbase/region
1
/hbase/region
2
…..
RegionServers
…..
memstore
/hbase/region
HDFS HFile WAL
HBase Components
• Table made of regions
• Region – a range of rows stored together
- Single Shard, used for scaling
- Dynamically merge if too big
- merge if too small
• Region servers- serves one or more regions
- A region is served by only one region
• Master server – demon responsible for managing HBase cluster
• HBase stores its data into HDFS
- relies on HDFS’s High Availability and fault tolerance
HBase Storage Architecture
Hbase Storage Simpler
Zookeeper Zookeeper Zookeeper
HBase HDFS
HDFS SNN
Master Namenode
Management Management Management
Node Node Node
Zookeeper Zookeeper Scale Zookeeper
HBase HBase Horizontally HBase
Region Server Region Server N Machines Region Server
Data Node Data Node
Data Node
Different Types of regions
Root/Meta Table
Each row in the ROOT and META tables is approximately 1KB in size.
At the default size of 256MB.
Compactions
Compactions
Time Column
Row key Column “anchor:”
Stamp “contents:”
t12 “<html>…”
t11 “<html>…”
“com.apache.www”
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”
Hstore1
Region Split
Region Splits
Time Column
Row key Column “anchor:”
Stamp “contents:”
t12 “<html>…”
“com.apache.www” t11 “<html>…”
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”
Hstore1
HBase Client API
HBase Client API
Scanner and Filters
Search
Get value from table where key=„com.apache.www‟ AND label=„anchor:apache.com‟
Time
Row key Column “anchor:”
Stamp
t12
“com.apache.www” t11
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www” t6
t5
t3
Search
Scanner Select value from table
where, anchor=„cnnsi.com‟
Time
Row key Column “anchor:”
Stamp
t12
“com.apache.www” t11
t10 “anchor:apache.com” “APACHE”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www” t6
t5
t3
Hbase API
• get(row)
• put(row,Map<column,value>)
• scan(key range, filter)
• increment(row, columns)
• Check and Put, delete etc.
Hbase Shell
• hbase(main):003:0> create 'test', 'cf'
0 row(s) in 1.2200 seconds
• hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0560 seconds
• hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0370 seconds
• hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0450 seconds
Hbase Shell Contd.
• hbase(main):007:0> scan 'test'
ROW COLUMN+CELL
• row1 column=cf:a, timestamp=1288380727188,
value=value1
• row2 column=cf:b, timestamp=1288380738440,
value=value2
• row3 column=cf:c, timestamp=1288380747365,
value=value3
3 row(s) in 0.0590 seconds
Thank You
See You in Class Next Week