0% found this document useful (0 votes)
24 views

ADO Lecture II 2024-26

Uploaded by

thehorizon2026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

ADO Lecture II 2024-26

Uploaded by

thehorizon2026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Advance Data Organization

Lecture II
MBA(DSDA) 2024-26, SCIT
ADO

Journey From
RDBMS to NoSQL (I)
ADO
• BigTable (Google)
• Amazon DynamoDB
• Hbase (Apache)
• Cassandra (Facebook)
BigTable
Column Store
A column store database can also be referred to as :
• Column database
• Column family database
• Column oriented database
• Wide column store database
• Wide column store
• Columnar database
• Columnar store
Column Store
A Schema in RDBMS
Column Store
A Schema in RDBMS
Column Store
Column Store
• Columns store databases use a concept called
a keyspace. A keyspace is kind of like a schema
in the relational model. The keyspace contains
all the column families (kind of like tables in
the relational model), which contain rows,
which contain columns.(Instance)
• For example, a keyspace can have column
families AuthorProfile, MemberProfile, Article,
Blog, and Question.
Column Store
Column Store
Column Store
Column Store Keyspace
Column Store
Column Store
• A column family consists of multiple rows.
• Each row can contain a different number of columns
to the other rows. And the columns don’t have to
match the columns in the other rows (i.e. they can
have different column names, data types, etc).
• Each column is contained to its row. It doesn’t span
all rows like in a relational database. Each column
contains a name/value pair, along with a timestamp.
Note that this example uses Unix/Epoch time for the
timestamp.
Column Store
Each Row
Column Store
• Row Key. Each row has a unique key, which is a
unique identifier for that row.
• Column. Each column contains a name, a value,
and timestamp.
• Name. This is the name of the name/value pair.
• Value. This is the value of the name/value pair.
• Timestamp. This provides the date and time that
the data was inserted. This can be used to
determine the most recent version of data.
Column Store
Each Row
Column Store
Keyspace
Column Store
KeySpace
Column Store
Super Column
StudentDetails={
Student1:{
username:{firstname:”Abc”,lastname:”wxy”}
address:{city:”Hinjewadi”,postcode:”411057”}
}
Student2:{
username:{firstname:”Def”,lastname:”xyz”}
account:{bank:”SBI”,accounted:”2212340005”}
}

Studentk:{
username:{firstname:”Def”,lastname:”xyz”}
account:{bank:”SBI”,accounted:”2212340005”}
marksheet:{sub1:40, sub2:98, ………, subN:75}
}
}
Column Store
Column Store
How to write the following for Column Store
Database?

rr
Column Store
How to write the following for Column Store
Database?
Contact
Rowkey::1 ContactID: 1X2B Cell-Phone: 9867 Email1: x@abc
234A ts: 123456780 ts: 123456789 Ts: 123456790

Rowkey::1 ContactID:21X2B Home-Phone: Twitter:#bigtable


234B ts: 123456690 1234 Ts: 123456710
ts: 123456700

Rowke ContactID:3X3 Home-Phone: Cell-Phone: 9845 Email2: y@wqa Facebook:[email protected] Twitter:#hadoop


y::1234 Y 3456 ts: 123456910 Ts: 123456920 Ts: 123456310 Ts: 123456940
C ts: 123456890 ts: 123456900
Column Store
Column Store format
Contact
Rowkey:1234A Rowkey:1234B Rowkey:1234C
ContactID: ContactID: 2X2B ContactID: 3X3Y
1X2B ts: 123456690 ts: 123456890
ts: 123456780
Home-Phone: 1234 Home-Phone: 3456
Cell-Phone: ts: 123456700 ts: 123456900
9867
ts: 123456789 Twitter:#bigtable Cell-Phone: 9845
Ts: 123456710 ts: 123456910
Email1: x@abc
Ts: 123456790 Email2: y@wqa
Ts: 123456920
Facebook:[email protected]
Ts: 123456310
Twitter:#hadoop
BigTable
Bigtable
• BigTable has been in development since late/early 2003-
2004
• BigTable was an in-house development designed to run
on commodity hardware.
• BigTable allows Google to have a very small incremental
cost for new services and expanded computing power
• BigTable is built on top their other services, specifically
GFS, Scheduler, Lock Service, and MapReduce.
BigTable
Bigtable
• is a compressed, high
performance, proprietary data storage system
built on Google File System, Chubby Lock Service,
SSTable (log-structured storage like LevelDB) and
a few other Google technologies
• On May 6, 2015, a public version of Bigtable was
made available as a service (Cloud BigTable)
BigTable
Bigtable
• Sparse
• Distributed
• Persistent multidimensional
• Sorted Map
BigTable
Bigtable
• Distributed Storage System for Managing
Structured data
• Designed to Scale to a very large Size-
Petabytes across thousands of servers
BigTable
Bigtable has achieved several Goals:
• Wide applicability
• Scalability
• High performance, and
• High availability
BigTable
Bigtable ( more than 60 Google products and projects)
• Google Analytics
• Google Finance
• Personalized Search
• My Search History
• Google Map
• Google Earth
• Blogger.com
• Youtube
• Gmail,
• Orkut, etc
BigTable
Bigtable
• scalability and better control of performance
characteristics
• To cater Variety of Demanding Workloads

Google's Spanner RDBMS is layered on an


implementation of Bigtable
BigTable
Bigtable
• Bigtable is a distributed storage system for
managing structured data that is designed to scale
to a very large size: petabytes of data across
thousands of commodity servers.
• Bigtable does not support a full relational data
model;
• Each row is indexed by a single row key
• Data is indexed using row and column names that
can be arbitrary strings.
BigTable
Bigtable
Indexed as:

Row Name Column Families Time Stamps


BigTable
Bigtable

• Bigtable tables are sparse; if a column is not used in a particular


row, it does not take up any space.
• Columns can be unused in a row.
• Each cell in a given row and column has a unique timestamp (t).
BigTable
Bigtable : Rows
• The row keys in a table are arbitrary strings
(currently up to 64KB in size, although 10-100
bytes is a typical size for most of our users)
• Every read or write of data under a single row
key is Atomic (regardless of the number of
different columns being read or written in the
row)
BigTable
Bigtable : Rows
• Bigtable maintains data in lexicographic order by row key.
• The row range for a table is dynamically partitioned. A
Bigtable table is sharded into blocks of contiguous rows,
called tablets, to help balance the workload of queries.
• Each row range is called a tablet, which is the unit of
distribution and load balancing. (A Bigtable table is
sharded into blocks of contiguous rows, called tablets )
• Tablets are around 100-200 MB, each machines stores
about 100 of them.
BigTable
Bigtable : Rows
• Reads of short row ranges are efficient and
typically require communication with only a
small number of machines.
BigTable
Bigtable : Column Families
• Column keys are grouped into sets called column
families.
• All data stored in a column family is usually of the
same type (we compress data in the same column
family together).
• A column family must be created before data can be
stored under any column key in that family.
• After a family has been created, any column key
within the family can be used.
BigTable
Bigtable : Column Families
• It is our intent that the number of distinct column families in a
table be small (in the hundreds at most), and that families rarely
change during operation.
• In contrast, a table may have an unbounded number of columns.
• Each column is identified by a combination of the column family
and a column qualifier, which is a unique name within the column
family.
• A column key is named using the following syntax:
family: qualifier
Column family names must be printable, but qualifiers may be
arbitrary strings.
BigTable
Bigtable : Column Families
• Access control and both disk and memory
accounting are performed at the column-
family level.
BigTable
Bigtable : Timestamps
• Each cell in a Bigtable can contain multiple
versions of the same data.
• These versions are indexed by timestamp.
• Bigtable timestamps are 64-bit integers
BigTable
Bigtable : Timestamps
• Support of two per-column-family settings
that tell Bigtable to garbage-collect cell
versions automatically.
• The client can specify either that only the last
n versions of a cell be kept, or that only new-
enough versions be kept (e.g., only keep
values that were written in the last seven
days).
BigTable
Bigtable – Example
• Need to have a copy of a large collection of
web pages and related information that could
be used by many different projects- let us call
this particular table the Webtable.
BigTable
Bigtable – Webtable
.
BigTable
Bigtable : Rows
• Clients can exploit this property by selecting their row
keys so that they get good locality for their data accesses.
• For example, in Webtable, pages in the same domain are
grouped together into contiguous rows by reversing the
hostname components of the URLs.
• we store data for maps.google.com/index.html under the
key com.google.maps/index.html.
• Storing pages from the same domain near each other
makes some host and domain analyses more efficient.
BigTable
Bigtable : Column Families
• language, which stores the language in which a web page was
written. We use only one column key with an empty qualifier in
the language family to store each web page’s language ID
language: : en1
• the crawled pages stored in the contents: column to the times at
which these page versions were actually crawled.
contents: : maps.google.com/index.html
• the anchor column family contains the text of any anchors that
reference the page
anchor:cnnsi.com: “CNN”
anchor:my.look.ca: “CNN.com”
BigTable
How to write the following for BigTable?
BigTable
How to write the following for BigTable?

Rowkey:1234 ContactID: :1X2B Phone:Cell:9867 Email:1:x@abc Includes


Timestamp
12345 ContactID: :2X2B Phone:Home:1234 Social:Twitter:#bigtable with each
column value
23451 ContactID: :3X3Y Phone:Home:3456 Phone:Cell:9845 Email:2:y@wqa
Social:Facebook:[email protected] Social:Twitter:#hadoop
BigTable
How to write the following for BigTable?
Row:”1234” ContactID: :”1X2B” Phone:Cell:”9867” Email:1:”x@abc” Includes
Timestamp
12345 ContactID: :2X2B Phone:Home:1234 Social:Twitter:#bigtable with each
column value
23451 ContactID: :3X3Y Phone:Home:3456 Phone:Cell:9845 Email:2:y@wqa
Social:Facebook:[email protected] Social:Twitter:#hadoop

Row:”1234”, ContactID: :”1X2B” Phone:Cell:”9867” Email:1:”x@abc”

(Row:”1234”, ContactID: :”1X2B” , time:100001)


(Row:”1234”, Phone:Cell:”9867”, time:100002)
(Row:”1234” , Email:1:”x@abc”, time:100003)
BigTable
Bigtable : Building Blocks
• Bigtable uses the distributed Google File
System (GFS) to store log and data Files.
• The Google SSTable ( log-Structured Storage)
file format is used internally to store Bigtable
data. (SS- Sorted String)
• An SSTable provides a persistent, ordered
immutable map from keys to values, where
both keys and values are arbitrary byte strings.
BigTable
Bigtable : Building Blocks
• A "Sorted String Table“: It is a file which
contains a set of arbitrary, sorted key-value
pairs inside.
• Read in the entire file sequentially as you have
a sorted index
BigTable
Bigtable : Building Blocks
• Optionally, if the file is very large, we can also
prepend, or create a standalone key:offset
index for fast access.

SSTable is: very simple, but also a very useful way to exchange large, sorted data
segments.
BigTable
Bigtable : Building Blocks
• Random writes are fast when the SSTable is in
memory.
• If the table is immutable then an on-disk
SSTable is also fast to read from
BigTable
Bigtable : Building Blocks
1. On-disk SSTable indexes are always loaded into
memory
2. All writes go directly to the MemTable index
3. Reads check the MemTable first and then the
SSTable indexes
4. Periodically, the MemTable is flushed to disk as an
SSTable
5. Periodically, on-disk SSTables are "collapsed
together"
BigTable
Bigtable : Building Blocks
• Bigtable relies on a highly-available and
persistent distributed lock service called
Chubby.
• A Chubby service consists of five active
replicas, one of which is elected to be the
master and actively serve requests.
BigTable
Bigtable : Building Blocks
• Bigtable uses Chubby for a variety of tasks:
– to ensure that there is at most one active master at
any time;
– To store the bootstrap location of Bigtable data
– To discover tablet servers and finalize tablet server
deaths
– To store Bigtable schema information (the column
family information for each table);
– and to store access control lists.
BigTable
Bigtable : Building Blocks
• If Chubby becomes unavailable for an
extended period of time, Bigtable becomes
unavailable.
BigTable
Bigtable : Building Blocks

META0 META1 META2


BigTable
Bigtable : Building Blocks
BigTable
Bigtable : Building Blocks

• Updates are committed to a commit log that stores redo


records.
• The recently committed ones are stored in memory in a
sorted buffer called a memtable.
• A memtable maintains the updates on a row-by-row basis,
where each row is copy-on-write to maintain row-level
consistency.
• Older updates are stored in a sequence of SSTables (which are
immutable).
The name redo log indicates its purpose: If the database crashes, the DB can redo (re-
process) all changes on datafiles which will take the database data back to the state it
was when the last redo record was written.
BigTable
Tablets
• When system memory is filled, it compacts some
tablets
• There are minor and major compactions. Minor
compactions involve only a few tablets, while major
ones involve the whole system
• Major compactions can reclaim hard disk space.
• All the tablets on one machine share a log; otherwise,
one million tablets in a cluster would result in way too
many files opened for writing
BigTable
Tablets
• There is a lot of redundant data in their system
(especially through time), so they make heavy
use of compression.
• Their compression looks for similar values along
the rows, columns, and times.
• They use variations of BMDiff and Zippy. BMDiff
gives them high write speeds (~100MB/s) and
even faster read speeds (~1000MB/s).
BigTable
Column Family
• Column families can be split into locality
groups
• Locality groups cause the columns to be split
into different SSTables
BigTable
You can use Bigtable to store and query all of the following types of data:

• Time-series data, such as CPU and memory usage over time for
multiple servers.
• Marketing data, such as purchase histories and customer preferences.
• Financial data, such as transaction histories, stock prices, and
currency exchange rates.
• Internet of Things data, such as usage reports from energy meters
and home appliances.
• Graph data, such as information about how users are connected to
one another.
Cloud BigTable

https://cloud.google.com/bigtable/
Practice Question

You might also like