NoSQL Database Model
Objectives
At the end of session we will be acquainted with
following topics:
-Introduction to NoSQL Databases
-Understanding Internals of different NoSQL databases
Introduction
Data Facts
Amount of data in circulation over internet by year 2020
- Google : 25 Peta Byte (PB)……… 2,62,14,400
GB !!
- Facebook : 60 million photos (1.5 PB)
Movie Avatar took 1 PB of storage space to render 3D
effects using CGI scripting
Data amount of
few TB started
knowing as
Bigdata
Bigdata : Main Problem Areas
- Efficiently storing and accessing large amounts of data
is difficult. We need backups too!!!
- Manipulating large data sets involves running
immensely parallel processes.
- Managing semi-structured and un-structured data,
generated by diverse sources, add to the problem
BigData : Hardware Challenge
Storage
- 1 TB Hard disk with 7400 RPM reads data at pace of
300 MBPS
- With this pace it will take minimum 55 minute to 1
hour to provide data up to 1 TB
Data Processing Units / Servers
- Either use mainframe servers to process and store
data.
- Use clusters or grid of machine which can scale
horizontally.
Challenges for Bigdata on RDBMS
- RDBMS assumes a well defined structure in data.
- Data distributed among multiple tables
- Tables must be indexed to optimize the
operation
- It assumes that the data is dense and is largely
uniform:
- Properties of the data can be defined up front
and that its interrelationships are well
established and systematically referenced.
NoSQL Database for BigData
- Umbrella term for all databases that:
- Don’t follow the RDBMS principles
- Related to large data sets accessed and
manipulated on a Web scale
- NoSQL is not a single product or even a single
technology
History / Advent of NoSQL DB
- Google has a set of massively scalable application and
infrastructure which operate on large amount of data.
- Google Maps
- Google Apps
- Google Mail
- Google Earth
- For them Google has invented:
- Distributed file system
- Distributed coordination system
- Map reduce based parallel execution algorithm
- Column family oriented data store / database
History / Advent of NoSQL DB
- Using same approach the first search engine system that
came out in market is LUCENE (search engine
framework).
- Later their developers joined Yahoo and worked to mimic
the development model of Google to form a new open
source development framework Hadoop (Apache
Hadoop).
- Later in year 2007 another web giant Amazon has also
revealed the story behind its NoSQL model database
known as Dynamo.
Example of NoSQL databases
NoSQL Database Models
SORTED ORDERED COLUMN-ORIENTED STORES
Hbase
History — Donated to the Apache
Technologies and Language — Implemented in Java.
Access Methods — A JRuby shell allows command-line access to
the store. Thrift, Avro
Query Language — No native querying language. Hive provides a
SQL-like interface for Hbase
Who Uses It — Facebook, Yahoo!
NoSQL Database Models
SORTED ORDERED COLUMN-ORIENTED STORES
Hypertable
History — Created at Zvents in 2007. Now an independent open-
source project.
Technologies and Language — Implemented in C++
Access Methods — A command-line shell is available. Thrift
Query Language — HQL (Hypertable Query Language)
Who Uses It —Baidu (China’s biggest search engine), Rediff
(India’s biggest portal).
NoSQL Database Models
KEY/VALUE STORES
Cassandra
History — Developed at Facebook and open sourced in 2008,
Apache Cassandra was donated to the Apache foundation.
Technologies and Language — Implemented in Java.
Access Methods — A command-line access to the store. Thrift
interface
Query Language — A query language specification is in the
making.
Who Uses It — Facebook, Digg, Reddit, Twitter, and others.
NoSQL Database Models
KEY/VALUE STORES
Voldemort
History — Created by the data and analytics team at
LinkedIn in 2008.
Technologies and Language — Implemented in Java.
Provides for pluggable storage using either Berkeley DB or
MySQL
Access Methods — Integrates with Thrift, Avro, and
protobuf
Who Uses It — LinkedIn.
NoSQL Database Models
Document Based
MongoDB
History — Created at 10gen.
Technologies and Language — Implemented in C++.
Access Methods — A JavaScript command-line interface. Drivers exist for a
number of languages including C, C#, C++, Erlang. Haskell, Java,
JavaScript, Perl, PHP, Python, Ruby, and Scala.
Query Language — SQL-like query language.
Who Uses It — FourSquare, Shutterfl y, Intuit, Github, and more.
NoSQL Database Models
Document Based
CouchDB
History — Work started in 2005 and it was incubated into Apache in 2008
Technologies and Language — Implemented in Erlang with some C and
a JavaScript execution environment.
Access Methods — Upholds REST above every other mechanism. Use
standard web tools and clients to access the database, the same way as
you access web resources.
Who Uses It — Apple, BBC, Canonical, Cern, and more at
http://wiki.apache
NoSQL Database Models
Graph Database
FlockDB
History — Created at Twitter and open sourced in 2010.
Designed to store the adjacency lists for followers on Twitter.
Technologies and Language — Implemented in Scala.
Access Methods — A Thrift and Ruby client.
Open-Source License — Apache License version 2.
Who Uses It — Twitter.
Internals of different NoSQL database
models
NoSQL Database Models
SORTED ORDERED COLUMN-ORIENTED STORES
Relational Database Table Design
Relational Database Table Design
Addition of new attributes will introduce NULL values
We may need to maintain each version of value in case of multi-updates
Record Oriented Stores (RDBMS)
001:10,Smith,Joe,40000;
002:12,Jones,Mary,50000;
003:11,Johnson,Cathy,44000;
004:22,Jones,Bob,55000;
COLUMN-ORIENTED STORES
10:001, 12:002, 11:003, 22:004;
Smith:001,Jones:002,Johnson:003, Jones:004;
Joe:001, Mary:002, Cathy:003, Bob:004;
40000:001,50000:002, 44000:003, 55000:004;
COLUMN-ORIENTED STORES
Column-Family :
-Is a set of columns grouped together into a bundle
-Column-family members are physically stored together
Column database also store multiple version of value
Basic Architechture
NoSQL Database Models
Document Store Internals
Document Based Model (MongoDB)
Start Mongo DB Server
C:\applications\mongodb-win32-x86_64-1.8.1> .\bin\mongod.exe
Connect to Mongo DB Server
C:\applications\mongodb-win32-x86_64-1.8.1> bin/mongo
MongoDB shell version: 1.8.1
connecting to: test
>
Document Based Model (MongoDB)
1. Switch to the prefs database.
2. Define the data sets that need to be stored.
3. Save the defined data sets in a collection, named location.
use prefs
w = {name: “John Doe”, zip: 10001};
x = {name: “Lee Chang”, zip: 94129};
y = {name: “Jenny Gonzalez”, zip: 33101};
z = {name: ”Srinivas Shastri”, zip: 02101};
db.location.save(w);
db.location.save(x);
db.location.save(y);
db.location.save(z);
Document Based Model (MongoDB)
Get all records stored in the collection named location
> db.location.find()
{ “_id” : ObjectId(“4c97053abe67000000003857”), “name” : “John Doe”,
“zip” : 10001 }
{ “_id” : ObjectId(“4c970541be67000000003858”), “name” : “Lee Chang”,
“zip” : 94129 }
{ “_id” : ObjectId(“4c970548be67000000003859”), “name” : “Jenny Gonzalez”,
“zip” : 33101 }
{ “_id” : ObjectId(“4c970555be6700000000385a”), “name” : “Srinivas Shastri”,
“zip” : 1089 }
> db.location.find({zip: 10001});
Document Based Model (MongoDB)
MongoDB maintaint data as:
-File Segments in Virtual Memory as accessing and manipulating
memory is much faster than making system calls
-No separation between the operating system cache and the
database cache
-MongoDB can expand its database cache to use all available
memory without any additional configuration.
-Hence we could enhance MongoDB performance by throwing in a
larger RAM and allocating a larger virtual memory
-In more recent versions, MongoDB supports auto-sharding for
scaling horizontally with ease.
Document Based Model (MongoDB)
NoSQL Database Models
Key/Value Store Data Model
Key/Value Model
- Memacached is one of the Key-Value database which is used by
Facebook, Twitter, Wikipedia
- It is extremely simple with no addon features like
- Failover
- Backup
- Recovery
- Memcached stores its values in a:
- Slab
- Slab is made of pages
- Pages are made of chunks or buckets
- Memcached can store data values up to a maximum of 1 MB in
size
- Values are stored and referenced by a key (which can be upto 250
bytes) in size