Lecture NoSqlIntro
Lecture NoSqlIntro
Background
• Relational databases mainsteam of business
• Web-based applications caused spikes
• explosion of social media sites (Facebook, Twitter) with large data needs
• rise of cloud-based solutions such as Amazon S3 (simple storage solution)
• Hooking RDBMS to web-based application becomes trouble
3
Problem for Relational Database to Scale
• The Relational Database is built on the principle of ACID (Atomicity,
Consistency, Isolation, Durability)
• It implies that a truly distributed relational database should have
availability, consistency and partition tolerance.
• Which unfortunately is impossible …
Scalability is the key for processing huge data
Scaling Up
• Best way to provide ACID and rich query model is to have the dataset
on a single machine
• Limits to scaling up (or vertical scaling: make a “single” machine
more powerful) dataset is just too big!
6
Scaling Out
• Scaling out (or horizontal scaling: adding more smaller/cheaper
servers) is a better choice
• Different approaches for horizontal scaling (multi-node database):
• Master/Slave
• Sharding (partitioning)
Scaling out: Sharding
• Sharding (Partitioning)
• Scales well for both reads and writes
• Not transparent, application needs to be partition-aware
• Can no longer have relationships/joins across partitions
• Loss of referential integrity across shards
8
NoSQL – the history
• The Name:
• Stands for Not Only SQL
• The term NOSQL was introduced by Carl Strozzi in 1998 to name his file-based
database
• It was picked up again as Twitter hash tag in 2009 for NoSQL meet up in San
Francisco organized by Johan Oskarsson.
• A Rackspace employee Eric Evans made it popular by describing the NoSQL
movement “the whole point of seeking alternatives is that you need to solve
a problem that relational databases are a bad fit for …”
9
3 major papers for NoSQL
• Three major papers were the “seeds” of the NoSQL movement:
• BigTable (Google)
• DynamoDB (Amazon)
• Ring partition and replication
• Gossip protocol (discovery and error detection)
• Distributed key-value data stores
• Eventual consistency
• CAP Theorem
10
NoSQL Characteristics
• Non-relational
• Schema-less
• data are replicated to multiple
nodes (so, identical & fault-tolerant)
and can be partitioned:
• down nodes easily replaced
• no single point of failure
• Horizontal scalable
• Cheap, easy to implement
(open-source)
• Massive write performance
• Fast key-value access
11
NoSQL Characteristics
• Don’t fully support relational features
• no join, group by, order by operations (except within partitions)
• no referential integrity constraints across partitions
• No declarative query language (e.g., SQL) more programming
• Relaxed ACID (see CAP theorem) fewer guarantees
• No easy integration with other applications that support SQL
12
Who is using them?
13
NoSQL Categories
• Key-value
o Example: Dynamo, Voldermort, Scalaris
• Document-based
o Example: MongoDB, CouchDB
• Column-based
o Example: BigTable, Cassandra, Hbased
• Graph-based
o Example: Neo4J, InfoGrid
14
Key-value
• Focus on scaling to huge amounts of data
• Designed to handle massive Key-/value-stores have a simple data
model in common: a map/dictionary, allowing clients to put and
request values per key load
• Based on Amazon’s dynamo paper
• Data model: (global) collection of Key-value pairs
• Modern key-value stores favor high scalability over consistency
• The lengths of keys to be stored is limited to a certain number of
bytes while there is less limitation on values.
15
Document-based
• Can model more complex objects
• Inspired by Lotus Notes
• Data model: collection of documents
• JSON (JavaScript Object Notation is a data model, key-value pairs, which
supports objects, records, structs, lists, array, maps, dates, Boolean, etc).
• MongoDB data type: BSON (Binary Serialisation Object Notation, or Binary
JSON)
16
Column-based
• Based on Google’s BigTable paper
• Like column oriented relational databases (store data in column order) but with a twist
• Tables similarly to RDBMS, but handle semi-structured
• Data model:
• Collection of Column Families
• Column family = (key, value) where value = set of related columns (standard, super)
• indexed by the triple (row key, column key and timestamp)
allow key-value pairs to be stored (and retrieved on key) in a massively parallel system
storing principle: big hashed distributed tables
properties: partitioning (horizontally and/or vertically), high availability etc. completely transparent
to application
17
Graph Database
• A graph has nodes and
edges/relationships (directed or
undirected)
• A graph database stores data in a
graph (nodes and relationships)
• Both nodes and relationships can
have properties, this is
sometimes referred to as the
“Property Graph Model”.
Building Blocks of Graph Database
• Nodes
• Relationships
• Attributes
• Labels
Nodes
• Nodes are often used to represent entities, but depending on the
domain relationships may be used for that purpose as well.
• The following are some example nodes:
Relationships
• Relationships organise the nodes by connecting them.
• A relationship connects two nodes – a start node and an end node.
Relationships…