Module_1
Module_1
INTRODUCTION TO
NOSQL DATABASES
Prepared By:
Madhuri J
Assistant Professor
Department of Computer Science and Engineering
Bangalore Institute of Technology
2
Outline
• Background
• What is NOSQL?
• Who is using it?
• 3 major papers for NOSQL
• CAP theorem
• NOSQL categories
• Conclusion
• References
3
Background
• Relational databases mainstay of business
• Web-based applications caused spikes
• explosion of social media sites (Facebook, Twitter) with large data
needs
• rise of cloud-based solutions such as Amazon S3 (simple storage
solution)
• Hooking RDBMS to web-based application becomes
troublesome
4
What is NOSQL?
• The Name:
• Stands for Not Only SQL
• The term NOSQL was introduced by Carl Strozzi in 1998 to name
his file-based database
• It was again re-introduced by Eric Evans when an event was
organized to discuss open source distributed databases
• Eric states that “… but the whole point of seeking alternatives is
that you need to solve a problem that relational databases are a
bad fit for. …”
9
What is NOSQL?
• Key features (advantages):
• non-relational
• don’t require schema
• data are replicated to multiple
nodes (so, identical & fault-tolerant)
and can be partitioned:
• down nodes easily replaced
• no single point of failure
• horizontal scalable
• cheap, easy to implement
(open-source)
• massive write performance
• fast key-value access
10
What is NOSQL?
• Disadvantages:
• Don’t fully support relational features
• no join, group by, order by operations (except within partitions)
• no referential integrity constraints across partitions
• No declarative query language (e.g., SQL) more
programming
• Relaxed ACID (see CAP theorem) fewer guarantees
• No easy integration with other applications that support
SQL
11
CAP Theorem
• ACID
• A DBMS is expected to support “ACID transactions,” processes
that are:
• Atomicity: either the whole process is done or none is
• Consistency: only valid data are written
• Isolation: one operation at a time
• Durability: once committed, it stays that way
• CAP
• Consistency: all data on cluster has the same copies
• Availability: cluster always accepts reads and writes
• Partition tolerance: guaranteed properties are maintained even
when network failures prevent some machines from
communicating with others
15
CAP Theorem
• Brewer’s CAP Theorem:
• For any system sharing data, it is “impossible” to guarantee
simultaneously all of these three properties
• You can have at most two of these three properties for any shared-
data system
• Very large systems will “partition” at some point:
• That leaves either C or A to choose from (traditional DBMS prefers
C over A and P )
• In almost all cases, you would choose A over C (except in specific
applications such as order processing)
16
CAP Theorem
• Consistency
• 2 types of consistency:
1. Strong consistency – ACID (Atomicity, Consistency,
Isolation, Durability)
2. Weak consistency – BASE (Basically Available
Soft-state Eventual consistency)
17
CAP Theorem
• A consistency model determines rules for visibility and
apparent order of updates
• Example:
• Row X is replicated on nodes M and N
• Client A writes row X to node N
• Some period of time t elapses
• Client B reads row X from node M
• Does client B see the write from client A?
• Consistency is a continuum with tradeoffs
• For NOSQL, the answer would be: “maybe”
• CAP theorem states: “strong consistency can't be achieved at the
same time as availability and partition-tolerance”
18
CAP Theorem
• Eventual consistency
• When no updates occur for a long period of time, eventually all
updates will propagate through the system and all the nodes will
be consistent
• Cloud computing
• ACID is hard to achieve, moreover, it is not always required, e.g.
for blogs, status updates, product listings, etc.
19
Impedence mismatch
• Difference between relational model and in-memory data
structures.
• Relational data model organizes data into structure of tables,
rows, relations and tuples
• Tuple: Set of name-value pairs. (single record)
• Relation: Set of tuples.
• Values of relational tuple have to be simple and cannot contain
structures, such as nested record.
• In-memory data structures can take rich structures
• Data structure has to be translated into relational
representation to store it on disk.
• Representations requiring translation is IMPEDENCE
MISMATCH
20
• Integration Database
• Disadvantages
• One application makes changes in data storage, it has
to co-ordinate with other
• Structure integrating many applications becomes
complex.
• Update on application may become problematic to
another application.
22
• Application Database
• Accessed by single application codebase, that’s looked
after by a single team.
• Only the team using the application needs to know
about the database structure.
23
Aggregates
• —Data as atomic units that have a complex structure —
• more structure than just a set of tuples —
• example:
• — complex record with: simple fields, arrays, records nested
inside —
• Aggregate in Domain-Driven Design —
•a collection of related objects that we treat as a unit —
•a unit for data manipulation and management of consistency —
•Advantages of aggregates: —
•easier for application programmers to work with —
•easier for database systems to handle operating on a cluster
24
25
Relational implementation
26
A possible aggregation
27
Aggregate representation
28
Aggregate implementation
29
NOSQL categories
1. Key-value
• Example: DynamoDB, Voldermort, Scalaris
2. Document-based
• Example: MongoDB, CouchDB
3. Column-based
• Example: BigTable, Cassandra, Hbased
4. Graph-based
• Example: Neo4J, InfoGrid
• “No-schema” is a common characteristics of most
NOSQL storage systems
• Provide “flexible” data types
34
Key-Value Database
• Strongly aggregate-oriented
• Lots of aggregates
• Each aggregate has a key
• Data model
• A set of <key, value> pairs
• Value: an aggregate instance
• The aggregate is opaque to the database
• — just a big blob of mostly meaningless bit
Key-value
• Focus on scaling to huge amounts of data
• Designed to handle massive load
• Based on Amazon’s dynamo paper
• Data model: (global) collection of Key-value pairs
• Dynamo ring partitioning and replication
• Example: (DynamoDB)
• items having one or more attributes (name, value)
• An attribute can be single-valued or multi-valued like set.
• items are combined into a table
36
Key-value
• Basic API access:
• get(key): extract the value given a key
• put(key, value): create or update the value given its key
• delete(key): remove the key and its associated value
• execute(key, operation, parameters): invoke an operation to the
value (given its key) which is a special data structure (e.g. List, Set,
Map .... etc)
38
Key-value
Pros:
• very fast
• very scalable (horizontally distributed to nodes based on key)
• simple data model
• eventual consistency
• fault-tolerance
Cons:
- Can’t model more complex data structure such as objects
39
Key-value
Name Producer Data model Querying
SimpleDB Amazon set of couples (key, {attribute}), restricted SQL; select, delete,
where attribute is a couple GetAttributes, and
(name, value) PutAttributes operations
Redis Salvatore set of couples (key, value), primitive operations for each
Sanfilippo where value is simple typed value type
value, list, ordered (according to
ranking) or unordered set, hash
value
Dynamo Amazon like SimpleDB simple get operation and put
in a context
Voldemort LinkeId like SimpleDB similar to Dynamo
40
Document databases
• Strongly aggregate-oriented
• Lots of aggregates
• Each aggregate has a key
• Data model
• A set of <key, document > pairs
• Document: an aggregate instance
• Structure of the aggregate visible
• limits on what we can place in it
• Access to an aggregate:
• queries based on the fields in the aggregate
42
Document-based
• Can model more complex objects
• Data model: collection of documents
• Document: JSON (JavaScript Object Notation is a
data model, key-value pairs, which supports objects,
records, structs, lists, array, maps, dates, Boolean
with nesting), XML, other semi-structured formats.
44
Document-based
• Example: (MongoDB) document
• {Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten: "1",
"Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}
45
Document-based
Name Producer Data model Querying
Column(-Family) Store
48
Cassandra
50
Column-based
• Based on Google’s BigTable paper
• Like column oriented relational databases (store data in column order) but
with a twist
• Tables similarly to RDBMS, but handle semi-structured
• Data model:
• Collection of Column Families
• Column family = (key, value) where value = set of related columns (standard, super)
• indexed by row key, column key and timestamp
Column-based
• One column family can have variable
numbers of columns
• Cells within a column family are sorted “physically”
• Very sparse, most cells have null values
• Comparison: RDBMS vs column-based NOSQL
• Query on multiple tables
• RDBMS: must fetch data from several places on disk and glue together
• Column-based NOSQL: only fetch column families of those columns
that are required by a query (all columns in a column family are stored
together on the disk, so multiple rows can be retrieved in one read
operation data locality)
52
Column-based
• Example: (Cassandra column family--timestamps
removed for simplicity)
UserProfile = {
Cassandra = { emailAddress:”[email protected]” , age:”20”}
TerryCho = { emailAddress:”[email protected]” , gender:”male”}
Cath = { emailAddress:”[email protected]” ,
age:”20”,gender:”female”,address:”Seoul”}
}
53
Column-based
Name Producer Data model Querying
Graph-based
• Focus on modeling the structure of data (interconnectivity)
• Scales to the complexity of data
• Graph databases are motivated by—small records with
complex interconnections
• we have a web of information whose nodes are very small
(nothing more than a name) but there is a rich structure of
interconnections between them.
• Example:
• Neo4j, FlockDB, Pregel, InfoGrid …
56
57
Graph Database
• A graph database is a database that uses graph structures with
nodes, edges, and properties to represent and store data
• A management systems for graph databases offers Create,
Read, Update, and Delete (CRUD) methods to access and
manipulate data
• Graph databases can be used for both OLAP (since are
naturally multidimensional structures ) and OLTP
• Systems tailored to OLTP (e.g., Neo4j) are generally optimized
for transactional performance, and tend to guarantee ACID
properties
58
Schemaless Databases
• A schemaless store also makes it easier to deal with non uniform
data: data where each record has a different set of fields.
• NoSQL databases are schemaless:
• A key-value store allows you to store any data you like under a
key
• A document database effectively does the same thing, since it
makes no restrictions on the structure of the documents you
store
• Column-family databases allow you to store any data under any
column you like
• Graph databases allow you to freely add new edges and freely
add properties to nodes and edges as you wish
60
Schemaless Databases
• This has various advantages:
• Without a schema binding you, you can easily store whatever
you need, and change your data storage as you learn more
about your project
• You can easily add new things as you discover them
• A schemaless store also makes it easier to deal with nonuniform
data: data where each record has a different set of fields
(limiting sparse data storage)
61
Schemaless Databases
• And also some problems
• Indeed, whenever we write a program that accesses data, that program
almost always relies on some form of implicit schema: it will assume
that certain field names are present and carry data with a certain
meaning, and assume something about the type of data stored within
that field
• Having the implicit schema in the application means that in order to
understand what data is present you have to dig into the application
code
• Furthermore, the database remains ignorant of the schema: it cannot
use the schema to support the decision on how to store and retrieve
data efficiently.
62
Materialized views
• This is the model where all the data for the customer is embedded
using a key-value store.
• If the requirements are to read the orders or the products sold in each
order, the whole object has tobe read and then parsed on the client
side to build the results.
65
66