ADBMS-Module 2
ADBMS-Module 2
Concurrency
• Enterprise applications can have lots of users and other systems all working
concurrently,
• Relational databases help handle this by controlling all access to their data
through transactions
• Transactions also play a role in error handling. With transactions, you can
make a change, and if an error occurs during the processing of the change
you can roll back the transaction to clean things up.
Integration
Standard Model
• Developers and database professionals can learn the basic relational model
and apply it in many projects.
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
• Developers and database professionals can learn the basic relational model
and apply it in many projects.
Drawbacks of RDBMS
Impedance Mismatch
• The difference between the relational model and the in-memory data
structures.
• An order, which looks like a single aggregate structure in the UI, is split into
many rows from many tables in a relational database(Refer the following
figure)
• Impedance mismatch has been made much easier to deal with by the wide
availability of object- relational mapping frameworks,
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Interoperability concerns
• Once you have made the decision to use an application database, you get
more freedom of choosing a database. Since there is a decoupling between
your internal database and the services with which you talk to the outside
world, the outside world doesn’t have to care how you store your data,
allowing you to consider non relational options
• Coping with the increase in data and traffic required more computing
resources. To handle this kind of increase, you have two choices: up or out.
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
In this model, we have two main aggregates: customer and order. We’ve used the
black-diamond composition marker in UML to show how data fits into the
aggregation structure. The customer contains a list of billing addresses; the order
contains a list of order items, a shipping address, and payments. The payment itself
contains a billing address for that payment.
A single logical address record appears three times in the example data, but instead
of using IDs it’s treated as a value and copied each time. This fits the domain
where we would not want the shipping address, nor the payment’s billing address,
to change. In a relational database, we would ensure that the address rows aren’t
updated for this case, making a new row instead. With aggregates, we can copy the
whole address structure into the aggregate as we need to.
In a different way we put all the orders for a customer into the customer aggregate
• Aggregation is not a logical data property It is all about how the data is
being used by applications.
• An aggregate structure may be an obstacle for others but help with some
data interactions.
Advantages
• Easy Replication.
Disadvantages
• No standard rules.
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
• Key-value databases store data as pairs, where each key maps to a single
value.
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
(DataModel)
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
(DataModel)
(Querying)
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Performance Very fast for single key- May be slower than key-
based operations due to value databases for
simple indexing simple key lookups, but
optimized for complex
queries and indexing.
(Querying)
Performance Very fast for single key- May be slower than key-
based operations due to value databases for
simple indexing simple key lookups, but
optimized for complex
queries and indexing.
(Schema)
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
(Use Cases)
Column-Family Stores
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Key Features:
Use Cases:
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Graph Databases
• This type of database is chosen when Small records and more interactions to
be done
• Graph databases are designed to handle and query data that is best
represented as a network of interconnected entities. They use graph
structures with nodes, edges, and properties to model and query relationship
• Once you have built up a graph of nodes and edges, a graph database allows
you to query that network with query operations
Relationships
1.Key-Value Stores:
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Data Model: Store data as simple key-value pairs, with relationships managed
manually.
2. Document Stores:
• A common theme across all the forms of NoSQL databases is that they are
schema less. When you want to store data in a relational database, you first
have to define a schema—a defined structure for the database which says
what tables exist, which columns exist, and what data types each column can
hold. Before you store some data, you have to have the schema defined for
it.
Advantage of schemaless DB
• Without a schema binding you, you can easily store whatever you need.
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
• Schemaless store also makes it easier to deal with non uniform data
Drawback
• Lack of Structure: May be less suitable for applications requiring strong data
normalization and relationships
Materialized views
NoSQL databases don’t have views, they may have pre computed and cached
queries, and they reuse the term “materialized view” to describe them.
In an aggregate data model, materialized views are used to pre-compute and store
aggregated data, which helps in improving query performance for complex
analytical queries. By storing aggregated results like sums, averages, or counts,
you reduce the computational overhead during query execution.
1.Where you update the materialized view at the same time you update the base
data for it. This approach is good when you have more frequent reads of the
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
materialized view than you have writes and you want the materialized views to be
as fresh as possible
In this case, you provide the computation that needs to be done, and the database
executes the computation when needed according to some parameters that you
configure
Distribution Models
1.Single Server
Run the database on a single machine that handles all the reads and writes to the
data store. it’s easy for operations people to manage and easy for application
developers to reason about.
If the data model of the NoSQL store is more suited to the application
Advg:It eliminates all the complexities that the other options introduce; it’s easy
for operations people to manage and easy for application developers
2.Sharding
Putting different parts of the data onto different servers—a technique that’s called
sharding
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
we have different users all talking to different server nodes. Each user only has to
talk to one server, so gets rapid responses from that server. The load is balanced we
have to ensure that data that’s accessed together is clumped together on the same
node and that these clumps are arranged on the nodes to provide the best data
access.Using aggregate orientation data can be clumped.Combine data that’s
commonly accessed together.
Another factor is trying to keep the load even. This means that you should try to
arrange aggregates so they are evenly distributed across the nodes which all get
equal amounts of the load.
Auto-sharding
Database takes on the responsibility of allocating data to shards and ensuring that
data access goes to the right shard. This can make it much easier to use sharding in
an application.
Advantages
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Dis Advantages
Some databases are intended from the beginning to use sharding ,Other databases
use sharding as a deliberate step up from a single-server configuration step from a
single node to sharding is going to be tricky
Example of sharding
One shard might handle transactions for accounts starting with 'A' to 'E,' while
another manages accounts from 'F' to 'J. ' This division reduces each shard's
workload, leading to faster processing and better network performance. It's a
practical example of sharding definition in action
Master-Slave Replication
Data is replicated across multiple nodes. One node is designated as the master, or
primary. This master is the authoritative source for the data and is usually
responsible for processing any updates to that data. The other nodes are slaves, or
secondaries. A replication process synchronizes the slaves with the master
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Master-slave replication is most helpful for scaling when you have a read-intensive
dataset. You can scale horizontally to handle more read requests by adding more
slave nodes and ensuring that all read requests are routed to the slaves
Advantages
1.Master-slave replication is most helpful for scaling when you have a read-
intensive dataset. You can scale horizontally to handle more read requests by
adding more slave nodes and ensuring that all read requests are routed to the
slaves.
Limitation
1.Limited by the ability of the master to process updates and its ability to pass
those updates on
2.It’s not a good scheme for datasets with heavy write traffic
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Peer-to-Peer Replication
Version Stamps
Updation in a transaction can happen without human intervention with the help of
version stamps
There are various ways you can construct your version stamps.
1.Counter
2.GUID
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
3.Hash
4.Timestamp
GUID
Example: 6B29FC40-CA47-1067-B31D-00DD010662DA.
Version stamps help you detect concurrency conflicts. When you read data,
then update it, you can check the version stamp to ensure nobody updated the
data --between your read and write.
Version stamps can be implemented using counters, GUIDs, content hashes,
timestamps, or a combination of these.
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
The simplest form of version stamp is a counter. Each time a node updates the
data, it increments the counter and puts the value of the counter into the
version stamp
Basic Map-Reduce
Let’s assume we have chosen orders as our aggregate, with each order having
line items. Each line item has a product ID, quantity, and the price charged.
This aggregate makes a lot of sense as usually people want to see the whole
order in one access. We have lots of orders, so we’ve sharded the dataset over
many machines. Sales analysis people want to see a product and its total
revenue for the last seven days. This report doesn’t fit the aggregate structure
that we have.
In order to get the product revenue report, you’ll have to visit every machine
in the cluster and examine many records on each machine.
The first stage in a map-reduce job is the map. A map is a function whose
input is a single aggregate and whose output is a bunch of key- value pairs. In
this case, the input would be an order. The output would be key-value pairs
corresponding to the line items. Each one would have the product ID as the
key and an embedded map with the quantity and price as the values
A map operation only operates on a single record; the reduce function takes
multiple map outputs with the same key and combines their values. So, a map
function might yield 1000 line items from orders for “Database Refactoring”;
the reduce function would reduce down to one, with the totals for the quantity
and revenue. While the map function is limited to working only on data from
a single aggregate, the reduce function can use all values emitted for a single
key
The map-reduce framework arranges for map tasks to be run on the correct
nodes to process all the documents and for data to be moved to the reduce
function. To make it easier to write the reduce function, the framework
collects all the values for a single pair and calls the reduce function once
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
The first thing we can do is increase parallelism by partitioning the output of the
mappers. The framework then takes the data from all the nodes for one partition,
combines it into a single group for that partition, and sends it off to a reducer
The next problem we can deal with is the amount of data being moved from node
to node between A combiner function cuts this data down by combining all the
data for the same key into a single value the map and reduce stage
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
The outputs from all the map tasks running on the various nodes are concatenated
together and sent into the reduce
When you have combining reducers, the map-reduce framework can safely run not
only in parallel (to reduce different partitions), but also in series to reduce the same
partition at different times and places.
To deal with is the amount of data being moved from node to node between the
map and reduce stages. Much of this data is repetitive, consisting of multiple key-
value pairs for the same key. A combiner function cuts this data down by
combining all the data for the same key intoa single value
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Figure: This reduce function, which counts how many unique customers order
a particular tea, is not combinable.
Not all reduce functions are combinable. Consider a function that counts the
number of unique customers for a particular product. The map function for such an
operation would need to emit the product and the customer. The reducer can then
combine them and count how many times each customer appears for a particular
product, emitting the product and the count
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
operate on a single key. This means you have to think differently about structuring
your programs so they work well within these constraints.
Consider the kind of orders we’ve been looking at so far; suppose we want
to know the average ordered quantity of each product. An important property of
averages is that they are not composable that is, if I take two groups of orders, I
can’t combine their averages alone. Instead, I need to take total amount and the
count of orders from each group, combine those, and then calculate the average
from the combined sum and count
Figure: Calculating averages, the sum and count can be combined in the
reduce calculation, but the average must be calculated from the combined
sum and count.
This notion of looking for calculations that reduce neatly also affects how
we do counts. To make a count, the mapping function will emit count fields with a
value of 1, which can be summed to get a total count
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Figure: When making a count, each map emits 1, which can be summed
to get a total.
When map-reduce calculations get more complex, it’s useful to break them down
into stages with the output of one stage serving as input to the next.
Consider an example where we want to compare the sales of products for each
month in 2011 to the prior year. To do this, we’ll break the calculations down into
two stages. The first stage will produce records showing the aggregate figures for a
single product in a single month of the year. The second stage then uses these as
inputs and produces the result for a single product by comparing one month’s
results with the same month in the prior year
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
A first stage would read the original order records and output a series of key-value
pairs for the sales of each product per month. The only new feature is using a
composite key so that we can reduce records based on the values of multiple fields
Figure: The second stage mapper creates base records for year-on-year
comparisons.
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Second-stage mappers
The second-stage mappers process this output depending on the year. A 2011
record populates the current year quantity while a 2010 record populates a prior
year quantity. Records for earlier years (such as 2009) don’t result in any mapping
output being emitted
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
Incremental Map-Reduce
The more complex case is the reduce step, since it pulls together the outputs from
many maps and any change in the map outputs could trigger a new reduction. This
recomputation can be lessened depending on how parallel the reduce step is. If we
are partitioning the data for reduction, then any partition that’s unchanged does not
need to be re-reduced. Similarly, if there’s a combiner step, it doesn’t need to be
rerun if its source data hasn’t changed
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23
If the changes are additive—that is, if we are only adding new records but are not
changing or deleting any old records—then we can just run the reduce with the
existing result and the new additions.
If there are destructive changes, that is updates and deletes, then we can avoid
some recomputation by breaking up the reduce operation into steps and only
recalculating those steps whose inputs have changed.
DEPT.OF.CSE,BMSIT&M