NO SQL Data Management
NO SQL Data Management
Unit: 2
• Course Objective
• Course Outcome
• CO and PO Mapping
• Introduction to No SQL
• aggregate data models
• Aggregates
• key-value and document data models
• relationships
• graph databases
• Schema less databases
• materialized views
• distribution models
• sharding
• master-slave replication
• peer-peer replication
• sharding and replication
• consistency
• relaxing consistency
• version stamps
• map-reduce
• partitioning and combining
• composing map-reduce calculations
• Summary
• Big data is a popular term used to describe the exponential growth and
availability of data, both structured and unstructured.
• 3 dimensions / characteristics of Big data: 3Vs (volume, variety and
velocity)
• Web analytics is the measurement, collection, analysis and reporting of
web data for purposes of understanding and optimizing web usage.
• Fraud is intentional deception made for personal gain or to damage another
individual.
• Credit risk management is a critical function that spans a diversity of
businesses across a wide range of industries.
• HDFS is the storage system for a Hadoop cluster.
• MapReduce are designed to continue to work in the face of system failures.
• Introduction to NoSQL
• NoSQL Databases
• NoSQL Databases
• Use the right data model for the right problem: Different data
models are used to solve different problems.
• Distributed systems and cloud computing support: Not everyone
is worried about scale or performance over and above that which
can be achieved by non-NoSQL systems.
• For the type of data to be stored: SQL databases are not best fit for
hierarchical data storage. But, NoSQL database fits better for the
hierarchical data storage as it follows the key-value pair way of
storing data similar to JSON data. NoSQL database are highly
preferred for large data set (i.e for big data).
There are four general types of NoSQL databases, each with their
own specific attributes:
1. Key-Value storage
2. Document Databases
3. Column Storage
4. Graph Storage
Advantages
– Data persistence
– Concurrency – ACID, transactions, etc.
– Integration across multiple applications
– Standard Model – tables and SQL
Disadvantages
– Impedance mismatch
– Integration databases vs. application databases
– Not designed for clustering
Unit: 2
• The following figure presents some sample data for this model.
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}
],
}
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 48
Aggregate Data Model in NoSQL (CO2)
Why schemaless?
– A schemaless store also makes it easier to deal with nonuniform
data
– When starting a new development project you don't need to
spend the same amount of time on up-front design of the
schema.
– No need to learn SQL or database specific stuff and tools.
– The rigid schema of a relational database (RDBMS). It can be
harder to push data into the DB as it has to perfectly fit the
schema.
Pros:
– More freedom and flexibility
– you can easily change your data organization
– you can deal with non uniform data
Cons:
– A program that accesses data: almost always relies on some form
of implicit schema, it assumes that certain fields are present ,
carry data with a certain meaning
– The implicit schema is shifted into the application code that
accesses data
Multiple servers:
– In NoSQL systems, data distributed over large clusters.
Single server:
– simplest model, everything on one machine. Run the database on
a single machine that handles all the reads and writes to the data
store.
Sharding:
• DB Sharding is nothing but horizontal partitioning of data. Different
people are accessing different parts of the dataset.
• In these circumstances we can support horizontal scalability by
putting different parts of the data onto different servers—a technique
that’s called sharding.
Improving performance:
Master
– is the authoritative source for the data
– is responsible for processing any updates to that data
– can be appointed manually or automatically
Slaves
– A replication process synchronizes the slaves with the master
– After a failure of the master, a slave can be appointed as new
master very quickly
Pros
– More read requests:
– Add more slave nodes
– Ensure that all read requests are routed to the slaves
Cons
– The master is a bottleneck
– Limited by its ability to process updates and to pass those
updates on
– Its failure does eliminate the ability to handle writes until:
• All the replicas have equal weight, they can all accept writes
• The loss of any of them doesn’t prevent access to the data store.
Pros and cons of peer-to-peer replication
Pros:
– you can ride over node failures without losing access to data
– you can easily add nodes to improve your performance
Cons:
– Inconsistency
– Slow propagation of changes to copies on different nodes
Two schemes:
– A node can be a master for some data and slaves for others
– Nodes are dedicated for master or slave duties
Key Points:
– Sharding distributes different data across multiple servers, so
each server acts as the single source for a subset of data.
– Replication copies data across multiple servers, so each bit of
data can be found in multiple places.
A system may use either or both techniques. Replication comes in
two forms:
– Master-slave replication makes one node the authoritative copy
that handles writes while slaves synchronize with the master and
may handle reads.
– Peer-to-peer replication allows writes to any node; the nodes
coordinate to synchronize their copies of the data.
Unit: 2
• Version Stamp
• Positioning and Combining
• The consistency property ensures that any transaction will bring the
database from one valid state to another.
• Relational databases has strong consistency whereas NoSQL
systems hass mostly eventual consistency.
• ACID: A DBMS is expected to support “ACID transactions,”
processes that are:
– Atomicity: either the whole process is done or none is
– Consistency: only valid data are written
– Isolation: one operation at a time
– Durability: once committed, it stays that way
Replication consistency
• Let’s imagine there’s one last hotel room for a desirable event. The
hotel reservation system runs on many nodes.
• This is another inconsistent read—but it’s a breach of a different
form of consistency we call replication consistency: ensuring that
the same data item has the same value when read from different
replicas.
Replication consistency
Eventual consistency:
• At any time, nodes may have replication inconsistencies but, if there
are no further updates, eventually all nodes will be updated to the
same value.
• In other words, Eventual consistency is a consistency model used in
nosql database to achieve high availability that informally
guarantees that, if no new updates are made to a given data item,
eventually all accesses to that item will return the last updated value.
• A field that changes every time the underlying data in the record
changes.
• When you read the data you keep a note of the version stamp, so
that when you write data you can check to see if the version has
changed.
• You may have come across this technique with updating resources
with HTTP.
• In short,
It helps you detect concurrency conflicts.
When you read data, then update it, you can check the version
stamp to ensure nobody updated the data between your read and
write
Version stamps can be implemented using counters, GUIDs (a
large random number that’s guaranteed to be unique), content
hashes, timestamps, or a combination of these.
• The CAP Theorem: The basic statement of the CAP theorem is that,
given the three properties of Consistency, Availability, and Partition
tolerance, you can only get two.
– Consistency: all people see the same data at the same time
– Availability: if you can talk to a node in the cluster, it can read
and write data
– Partition tolerance: the cluster can survive communication
breakages that separate the cluster into partitions unable to
communicate with each other
An example
– Ann is trying to book a room of the Ace Hotel in New York on a
node located in London of a booking system
– Pathin is trying to do the same on a node located in Mumbai
Possible solutions
– CP: Neither user can book any hotel room, sacrificing
availability
– CAP: Designate Mumbai node as the master for Ace hotel
• It is a way to take a big task and divide it into discrete tasks that can
be done in parallel.
• A common use case for Map/Reduce is in document database .
• A Map Reduce program is composed of a Map() procedure that
performs filtering and sorting and a Reduce() procedure that
performs a summary operation.
• "Map" step
• "Reduce" step
Logical view
• The Map function is applied in parallel to every pair in the input
dataset.
• Map(k1,v1) → list(k2,v2)
• The Reduce function is then applied in parallel to each group, which in
turn produces a collection of values in the same domain:
• Reduce(k2, list (v2)) → list(v3)
• Each Reduce call typically produces either one value v3 or an empty
return
• Let us see how this works, we start by applying the map query to the
set of documents that we have, producing this output:
• Combinable Reducer:
A combiner function is, in essence, a reducer function—indeed, in
many cases the same function can be used for combining as the final
reduction. The reduce function needs a special shape for this to
work: Its output must match its input. We call such a function a
combinable reducer.
• The consistency property ensures that any transaction will bring the
database from one valid state to another.
• Relational databases has strong consistency whereas NoSQL
systems hass mostly eventual consistency.
• https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
• https://www.tutorialspoint.com/hadoop/hadoop_mapreduce
.htm
• https://www.sanfoundry.com/mapreduce-questions-answers/
1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big
Analytics: Emerging Business Intelligence and Analytic Trends for Today's
Businesses", Wiley, 2013.
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging
World of
3. Polyglot Persistence", Addison-Wesley Professional, 2012.
4. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012.
5. Eric Sammer, "Hadoop Operations", O'Reilley, 2012.
6. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley,
2012.
7. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011.
8. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010.
9. Alan Gates, "Programming Pig", O'Reilley, 2011.
Thank You
08/11/2021 Hirdesh Sharma RCA E45 Big Data Unit: 2 123