0% found this document useful (0 votes)
2 views

ADBMS-Module 2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ADBMS-Module 2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

ADBMS & BIGDATA NOTES -22MCS23

The Value of Relational Databases


Getting at Persistent Data

Value of a database is keeping large amounts of persistent data. Most


computer architectures have the notion of two areas of memory: a fast volatile
“main memory” and a larger but slower “backing store.” Main memory is both
limited in space and loses all data when you lose power or something bad happens
to the operating system. Therefore, to keep data around, we write it to a backing
store, commonly seen a disk For most enterprise applications, however, the
backing store is a database. The database allows more flexibility than a file system
in storing large amounts of data in a way that allows an application program to get
at small bits of that information quickly and easily.

Concurrency

• Enterprise applications can have lots of users and other systems all working
concurrently,

• Relational databases help handle this by controlling all access to their data
through transactions

• Transactions also play a role in error handling. With transactions, you can
make a change, and if an error occurs during the processing of the change
you can roll back the transaction to clean things up.

Integration

• Enterprise applications live in a rich ecosystem that requires multiple


applications, written by different teams, to collaborate in order to get things
done.

• A common way to do this is shared database integration where multiple


applications store their data in a single database.

Standard Model

• Developers and database professionals can learn the basic relational model
and apply it in many projects.

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

• Developers and database professionals can learn the basic relational model
and apply it in many projects.

Drawbacks of RDBMS

Impedance Mismatch

• The difference between the relational model and the in-memory data
structures.

• Relations provides a certain elegance and simplicity, but it also introduces


limitations. In particular, the values in a relational tuple have to be simple
they cannot contain any structure, such as a nested record or a list. If you
want to use a richer in- memory data structure, you have to translate it to a
relational representation to store it on disk. Hence the impedance
mismatch—two different representations that require translation

• An order, which looks like a single aggregate structure in the UI, is split into
many rows from many tables in a relational database(Refer the following
figure)

Advantage of SQL over OO languages

• Object-oriented programming languages, and with them came object-


oriented databases

• Impedance mismatch has been made much easier to deal with by the wide
availability of object- relational mapping frameworks,
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Drawback of shared data integration

• If application wants to make changes to its data storage, it needs to


coordinate with all the other applications using the database. Different
applications have different structural and performance needs, so an index
required by one application may cause a problematic hit on inserts for
another. Thus needs to take responsibility for data integrity within the
database itself.

Resolving data integration issue

• Different approach is to treat your database as an application database—


which is only directly accessed by a single application codebase that’s
looked after by a single team. With an application database, only the team
using the application needs to know about the database structure, which
makes it much easier to maintain and evolve the schema. Since the
application team controls both the database and the application code, the
responsibility for database integrity can be put in the application code.

Interoperability concerns

• Shift to web services as an integration mechanism was that it resulted in


more flexibility for the structure of the data that was being exchanged. If you
communicate with SQL, the data must be structured as relations. However,
with a service, you are able to use richer data structures with nested records
and lists. These are usually represented as documents in XML or, more
recently, JSON

• Once you have made the decision to use an application database, you get
more freedom of choosing a database. Since there is a decoupling between
your internal database and the services with which you talk to the outside
world, the outside world doesn’t have to care how you store your data,
allowing you to consider non relational options

Attack of the Clusters

• Coping with the increase in data and traffic required more computing
resources. To handle this kind of increase, you have two choices: up or out.

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

• Scaling up implies bigger machines, more processors, disk storage, and


memory. But bigger machines get more and more expensive, not to mention
that there are real limits as your size increases.

• The alternative is to use lots of small machines in a cluster. A cluster of


small machines can use commodity hardware and ends up being cheaper at
these kinds of scales. It can also be more resilient—while individual
machine failures are common, the overall cluster can be built to keep going
despite such failures, providing high reliability.

• Relational databases are not designed to be run on clusters. Clustered


relational databases, such as the Oracle RAC or Microsoft SQL Server, work
on the concept of a shared disk subsystem.

Aggregate Data Models:

• The term aggregate means a collection of objects that we use to treat as a


unit. An aggregate is a collection of data that we interact with as a unit.
These units of data or aggregates form the boundaries for ACID operation.
This definition matches really well with how key-value, document, and
column-family databases work. Dealing in aggregates makes it much easier
for these databases to handle operating on a cluster

Example of Relations and Aggregates

• Let’s assume we have to build an e-commerce website; we are going to be


selling items directly to customers over the web, and we will have to store
information about users, our product catalog, orders, shipping addresses,
billing addresses, and payment data. We can use this scenario to model the
data using a relation data store as well as NoSQL data stores and talk about
their pros and cons. For a relational database, we might start with a data
model shown in figure

Data model oriented around a relational database(using UML)

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Typical data using RDBMS data model

Everything is properly normalized, so that no data is repeated in multiple


tables. We also have referential integrity

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Figure :Typical data using RDBMS data model

Figure :An aggregate data model

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

In this model, we have two main aggregates: customer and order. We’ve used the
black-diamond composition marker in UML to show how data fits into the
aggregation structure. The customer contains a list of billing addresses; the order
contains a list of order items, a shipping address, and payments. The payment itself
contains a billing address for that payment.

A single logical address record appears three times in the example data, but instead
of using IDs it’s treated as a value and copied each time. This fits the domain
where we would not want the shipping address, nor the payment’s billing address,
to change. In a relational database, we would ensure that the address rows aren’t
updated for this case, making a new row instead. With aggregates, we can copy the
whole address structure into the aggregate as we need to.

In a different way we put all the orders for a customer into the customer aggregate

Figure: Consequences of Aggregate Orientation:


DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

• Aggregation is not a logical data property It is all about how the data is
being used by applications.

• An aggregate structure may be an obstacle for others but help with some
data interactions.

• It has an important consequence for transactions.

• NoSQL databases don’t support ACID transactions thus sacrificing


consistency.

Advantages

• It can be used as a primary data source for online applications.

• Easy Replication.

• No single point Failure.

• It provides fast performance and horizontal Scalability.

• It can handle Structured semi-structured and unstructured data with equal


effort.

• aggregate-oriented databases support the atomic manipulation of a single


aggregate at a time.

Disadvantages

• No standard rules.

• Limited query capabilities.

• Doesn’t work well with relational data.

• Not so popular in the enterprise.

• When the value of data increases it is difficult to maintain unique values.

Data Models in Nosql

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Key-Value and Document Data Models

• Key-value databases store data as pairs, where each key maps to a single
value.
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

• Document databases, on the other hand, store data in documents (usually


JSON or BSON) that can include nested structures

Comparison between Key value and document data model

Parameters Keyvalue Document data model

(DataModel)

Structure Data is stored as a Data is stored in


collection of key-value documents, which are
pairs. Each key is typically JSON, BSON,
unique and maps to a or XML formatted
single value.

Value Type Values can be simple Documents can contain


data types (strings, complex data structures
numbers) or more with nested arrays and
complex types (binary objects.
data

Use Case Best for scenarios Documents can contain


where you need to complex data structures
retrieve or store values with nested arrays and
based on a unique key, objects
such as caching, session
storage, or storing user
preferences.

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Parameters Keyvalue Document data model

(DataModel)

Structure Data is stored as a Data is stored in


collection of key-value documents, which are
pairs. Each key is unique typically JSON, BSON,
and maps to a single or XML formatted
value.

Value Type Values can be simple Documents can contain


data types (strings, complex data structures
numbers) or more with nested arrays and
complex types (binary objects.
data

Use Case Best for scenarios where Documents can contain


you need to retrieve or complex data structures
store values based on a with nested arrays and
unique key, such as objects
caching, session storage,
or storing user
preferences.

Parameters Key value Document data model

(Querying)

Querying Generally limited to Supports rich querying


operations like GET, capabilities, including

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

PUT, DELETE, and filtering, sorting, and


sometimes basic range aggregations.
queries or prefix
matching

Performance Very fast for single key- May be slower than key-
based operations due to value databases for
simple indexing simple key lookups, but
optimized for complex
queries and indexing.

Parameters Key value Document data model

(Querying)

Querying Generally limited to Supports rich querying


operations like GET, capabilities, including
PUT, DELETE, and filtering, sorting, and
sometimes basic range aggregations.
queries or prefix
matching

Performance Very fast for single key- May be slower than key-
based operations due to value databases for
simple indexing simple key lookups, but
optimized for complex
queries and indexing.

Parameters Key value Document data model

(Schema)

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Schema: Schema-less; data is Schema-less as well, but


stored in a allows for complex and
straightforward key-value nested document
format without any structures. Different
constraints or documents in the same
requirements on the collection can have
structure of values. different fields.

Parameters Key Value Document data model

(Use Cases)

Use Cases: Caching (e.g., Redis), Content management


session storage, real-time systems (e.g.,
analytics (e.g., Amazon MongoDB), user profiles
DynamoDB), user with variable attributes,
preferences, and simple e-commerce catalogs,
lookups. and applications

• Key-Value Databases: Best for simple, high-performance use cases where


data retrieval is based on a unique key. They offer simplicity and speed but
are limited in querying and data complexity.

• Document Databases: Better suited for applications needing flexible


schema and rich querying capabilities. They handle complex, hierarchical
data well and offer powerful querying features.

Column-Family Stores

A column-store database organizes data by columns rather than rows, storing


values for each column separately. This approach is highly efficient for read-

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

heavy operations particularly in analytical workloads where queries often involve


aggregations over large datasets

Key Features:

• Data Storage: Data is stored column-wise, making it faster to access and


aggregate data from specific columns without scanning irrelevant rows.

• Compression: Columnar storage allows for better compression as similar data


types are stored together, reducing storage space.

• Query Performance: Optimized for read-heavy queries and analytical


workloads, where operations involve scanning and aggregating data across
multiple rows.

Use Cases:

• Analytical Processing: Ideal for data warehousing and business intelligence


applications where complex queries and large-scale aggregations are common
(e.g., Apache Cassandra, Google Bigtable).

• Data Warehousing: Often used in data warehouses where query performance is


crucial for large-scale data analysis (e.g., Amazon Redshift, Snowflake).

• Apache Cassandra, DataStax, Microsoft Azure Cosmos DB, and ScyllaDB,

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Graph Databases
• This type of database is chosen when Small records and more interactions to
be done

• Graph databases are designed to handle and query data that is best
represented as a network of interconnected entities. They use graph
structures with nodes, edges, and properties to model and query relationship

• Once you have built up a graph of nodes and edges, a graph database allows
you to query that network with query operations

Figure: An example graph structure

Relationships

1.Key-Value Stores:

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Data Model: Store data as simple key-value pairs, with relationships managed
manually.

Relationships: Typically, there is no inherent support for relationships. You need


to implement relationship logic at the application level, such as using keys to
reference related values or managing joins manually.

2. Document Stores:

Relationships: Embedded Documents: Relationships can be represented by


embedding related data within a document. This is efficient for related data that is
frequently accessed together.

References: For larger or more complex relationships, documents can reference


other documents using IDs or links. This approach is useful for handling one-to-
many relationships.

• Schema less Databases

• A common theme across all the forms of NoSQL databases is that they are
schema less. When you want to store data in a relational database, you first
have to define a schema—a defined structure for the database which says
what tables exist, which columns exist, and what data types each column can
hold. Before you store some data, you have to have the schema defined for
it.

• With NoSQL databases, storing data is much more casual. A key-value


store allows you to store any data you like under a key. A document
database effectively does the same thing, since it makes no restrictions on
the structure of the documents you store. Column-family databases allow
you to store any data under any column you like. Graph databases allow you
to freely add new edges and freely add properties to nodes and edges as you
wish.

Advantage of schemaless DB

• Schemalessness provides freedom and flexibility

• Without a schema binding you, you can easily store whatever you need.

• This allows you to easily change your data storage

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

• Handling columns also easy

• Schemaless store also makes it easier to deal with non uniform data

Drawback

• Implicit schema in the application code results in some problem

• Data Integrity: Without a defined schema, ensuring data consistency and


validity can be more challenging.

• Complexity in Querying: Queries may become complex if data structures


vary widely or if there is a need to handle missing or optional fields.

• Lack of Structure: May be less suitable for applications requiring strong data
normalization and relationships

Materialized views

In aggregate-oriented data models If you want to access orders, it’s useful to


have all the data for an order contained in a single aggregate that can be stored and
accessed as a unit if a product manager wants to know how much a particular item
has sold over the last couple of weeks it is forcing you to potentially read every
order in the database to answer the question. Relational databases allows them to
support accessing data in different ways they provide a convenient mechanism that
allows you to look at data differently from the way it’s stored—views

NoSQL databases don’t have views, they may have pre computed and cached
queries, and they reuse the term “materialized view” to describe them.

In an aggregate data model, materialized views are used to pre-compute and store
aggregated data, which helps in improving query performance for complex
analytical queries. By storing aggregated results like sums, averages, or counts,
you reduce the computational overhead during query execution.

There are two rough strategies to building a materialized view.

1.Where you update the materialized view at the same time you update the base
data for it. This approach is good when you have more frequent reads of the

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

materialized view than you have writes and you want the materialized views to be
as fresh as possible

2. Update the materialized views at regular intervals

In this case, you provide the computation that needs to be done, and the database
executes the computation when needed according to some parameters that you
configure

Distribution Models

NoSQL has ability to run databases on a large cluster. Depending on your


distribution model, you can get a data store that will give you the ability to handle
larger quantities of data, the ability to process a greater read or write traffic, or
more availability in the face of network slowdowns or breakages depending on
your distribution model, you can get a data store that will give you the ability to
handle larger quantities of data, the ability to process a greater read or write traffic,
or more availability in the face of network slowdowns or breakages. Replication
comes into two forms: master-slave and peer-to-peer

1.Single Server

Run the database on a single machine that handles all the reads and writes to the
data store. it’s easy for operations people to manage and easy for application
developers to reason about.

If the data model of the NoSQL store is more suited to the application

Advg:It eliminates all the complexities that the other options introduce; it’s easy
for operations people to manage and easy for application developers

2.Sharding

Putting different parts of the data onto different servers—a technique that’s called
sharding

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

we have different users all talking to different server nodes. Each user only has to
talk to one server, so gets rapid responses from that server. The load is balanced we
have to ensure that data that’s accessed together is clumped together on the same
node and that these clumps are arranged on the nodes to provide the best data
access.Using aggregate orientation data can be clumped.Combine data that’s
commonly accessed together.

Arranging the data on the nodes

Most accesses of certain aggregates are based on a physical location, you


can place the data close to where it’s being accessed.

Another factor is trying to keep the load even. This means that you should try to
arrange aggregates so they are evenly distributed across the nodes which all get
equal amounts of the load.

Auto-sharding

Database takes on the responsibility of allocating data to shards and ensuring that
data access goes to the right shard. This can make it much easier to use sharding in
an application.

Advantages

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

1.It can improve both read and write performance

Dis Advantages

1.sharding alone is likely to decrease resilience. The data is on different nodes, a


node failure makes that shard’s data unavailable

When to use sharding??

Some databases are intended from the beginning to use sharding ,Other databases
use sharding as a deliberate step up from a single-server configuration step from a
single node to sharding is going to be tricky

Example of sharding

One shard might handle transactions for accounts starting with 'A' to 'E,' while
another manages accounts from 'F' to 'J. ' This division reduces each shard's
workload, leading to faster processing and better network performance. It's a
practical example of sharding definition in action

Master-Slave Replication

Data is replicated across multiple nodes. One node is designated as the master, or
primary. This master is the authoritative source for the data and is usually
responsible for processing any updates to that data. The other nodes are slaves, or
secondaries. A replication process synchronizes the slaves with the master

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Master-slave replication is most helpful for scaling when you have a read-intensive
dataset. You can scale horizontally to handle more read requests by adding more
slave nodes and ensuring that all read requests are routed to the slaves

Advantages

1.Master-slave replication is most helpful for scaling when you have a read-
intensive dataset. You can scale horizontally to handle more read requests by
adding more slave nodes and ensuring that all read requests are routed to the
slaves.

2. Second advantage of master-slave replication is read resilience

Limitation

1.Limited by the ability of the master to process updates and its ability to pass
those updates on

2.It’s not a good scheme for datasets with heavy write traffic
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Peer-to-Peer Replication

It overcomes the drawback of master slave replication .Master-slave replication


helps with read scalability but doesn’t help with scalability of writes. It provides
resilience against failure of a slave, but not of a master. Peer-to-peer replication.
Attacks these problems by not having a master. All the replicas have equal weight,
they can all accept writes, and the loss of any of them doesn’t prevent access to the
datastore.

Version Stamps

Updation in a transaction can happen without human intervention with the help of
version stamps

There are various ways you can construct your version stamps.

1.Counter

2.GUID
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

3.Hash

4.Timestamp

GUID

A large random number that’s guaranteed to be unique. These use some


combination of dates, hardware information, and whatever other sources of
randomness they can pick up. The nice thing about GUIDs is that they can be
generated by anyone and you’ll never get a duplicate;

(A GUID (globally unique identifier) is a 128-bit text string that represents an


identification (ID). Organizations generate GUIDs when a unique reference
number is needed to identify information on a computer or network. A GUID can
be used to ID hardware, software, accounts, documents and other items.

A GUID is a 128-bit value consisting of one group of 8 hexadecimal digits,


followed by three groups of 4 hexadecimal digits each, followed by one group of
12 hexadecimal digits. The following example GUID shows the groupings of
hexadecimal digits in a GUID:

Example: 6B29FC40-CA47-1067-B31D-00DD010662DA.

Version Stamps on Multiple Nodes

 Version stamps help you detect concurrency conflicts. When you read data,
then update it, you can check the version stamp to ensure nobody updated the
data --between your read and write.
 Version stamps can be implemented using counters, GUIDs, content hashes,
timestamps, or a combination of these.

 With distributed systems, a vector of version stamps allows you to detect


when different nodes have conflicting updates.
 The basic version stamp works well when you have a single authoritative
source for data, such as a single server or master-slave replication. In that case
the version stamp is controlled by the master. Any slaves follow the master’s
stamps. But this system has to be enhanced in a peer-to-peer distribution
model because there’s no longer a single place to set the version stamps

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

 The simplest form of version stamp is a counter. Each time a node updates the
data, it increments the counter and puts the value of the counter into the
version stamp

 Map Reduce function

 The map-reduce pattern (a form of Scatter-Gather [Hohpe and Woolf]) is a


way to organize processing in such a way as to take advantage of multiple
machines on a cluster while keeping as much processing and the data it needs
together on the same machine.

Basic Map-Reduce
 Let’s assume we have chosen orders as our aggregate, with each order having
line items. Each line item has a product ID, quantity, and the price charged.
This aggregate makes a lot of sense as usually people want to see the whole
order in one access. We have lots of orders, so we’ve sharded the dataset over
many machines. Sales analysis people want to see a product and its total
revenue for the last seven days. This report doesn’t fit the aggregate structure
that we have.

 In order to get the product revenue report, you’ll have to visit every machine
in the cluster and examine many records on each machine.

 The first stage in a map-reduce job is the map. A map is a function whose
input is a single aggregate and whose output is a bunch of key- value pairs. In
this case, the input would be an order. The output would be key-value pairs
corresponding to the line items. Each one would have the product ID as the
key and an embedded map with the quantity and price as the values
 A map operation only operates on a single record; the reduce function takes
multiple map outputs with the same key and combines their values. So, a map
function might yield 1000 line items from orders for “Database Refactoring”;
the reduce function would reduce down to one, with the totals for the quantity
and revenue. While the map function is limited to working only on data from
a single aggregate, the reduce function can use all values emitted for a single
key

 The map-reduce framework arranges for map tasks to be run on the correct
nodes to process all the documents and for data to be moved to the reduce
function. To make it easier to write the reduce function, the framework
collects all the values for a single pair and calls the reduce function once

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Partitioning and Combining


The simplest form, we think of a map-reduce job as having a single reduce
function. The outputs from all the map tasks running on the various nodes are
concatenated together and sent into the reduce but still it have some drawbacks like
less parallelisms and high data transfer.

The first thing we can do is increase parallelism by partitioning the output of the
mappers. The framework then takes the data from all the nodes for one partition,
combines it into a single group for that partition, and sends it off to a reducer

The next problem we can deal with is the amount of data being moved from node
to node between A combiner function cuts this data down by combining all the
data for the same key into a single value the map and reduce stage
DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

The outputs from all the map tasks running on the various nodes are concatenated
together and sent into the reduce

When you have combining reducers, the map-reduce framework can safely run not
only in parallel (to reduce different partitions), but also in series to reduce the same
partition at different times and places.

Figure:Partitioning allows reduce functions to run in parallel on different


keys.

It allows you to run multiple reducers in parallel

Multiple keys are grouped together into partitions.

To deal with is the amount of data being moved from node to node between the
map and reduce stages. Much of this data is repetitive, consisting of multiple key-
value pairs for the same key. A combiner function cuts this data down by
combining all the data for the same key intoa single value

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Figure: Combining reduces data before sending it across the network.

Figure: This reduce function, which counts how many unique customers order
a particular tea, is not combinable.

Not all reduce functions are combinable. Consider a function that counts the
number of unique customers for a particular product. The map function for such an
operation would need to emit the product and the customer. The reducer can then
combine them and count how many times each customer appears for a particular
product, emitting the product and the count

Composing Map-Reduce Calculations


There are constraints on what you can do in your calculations. Within a map task,
you can only operate on a single aggregate. Within a reduce task, you can only

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

operate on a single key. This means you have to think differently about structuring
your programs so they work well within these constraints.

Consider the kind of orders we’ve been looking at so far; suppose we want
to know the average ordered quantity of each product. An important property of
averages is that they are not composable that is, if I take two groups of orders, I
can’t combine their averages alone. Instead, I need to take total amount and the
count of orders from each group, combine those, and then calculate the average
from the combined sum and count

Figure: Calculating averages, the sum and count can be combined in the
reduce calculation, but the average must be calculated from the combined
sum and count.

This notion of looking for calculations that reduce neatly also affects how
we do counts. To make a count, the mapping function will emit count fields with a
value of 1, which can be summed to get a total count

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Figure: When making a count, each map emits 1, which can be summed
to get a total.

A Two Stage Map-Reduce Example

When map-reduce calculations get more complex, it’s useful to break them down
into stages with the output of one stage serving as input to the next.

Consider an example where we want to compare the sales of products for each
month in 2011 to the prior year. To do this, we’ll break the calculations down into
two stages. The first stage will produce records showing the aggregate figures for a
single product in a single month of the year. The second stage then uses these as
inputs and produces the result for a single product by comparing one month’s
results with the same month in the prior year

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

A first stage would read the original order records and output a series of key-value
pairs for the sales of each product per month. The only new feature is using a
composite key so that we can reduce records based on the values of multiple fields

Figure. Creating records for monthly sales of a product

Figure: The second stage mapper creates base records for year-on-year
comparisons.

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Second-stage mappers

The second-stage mappers process this output depending on the year. A 2011
record populates the current year quantity while a 2010 record populates a prior
year quantity. Records for earlier years (such as 2009) don’t result in any mapping
output being emitted

The reduce in this following figure is a merge of records where combining


the values by summing allows two different year outputs to be reduced to a single
value.

Figure . The reduction step is a merge of incomplete records.

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

Figure:The second stage mapper creates base records for year-on-year


comparisons.

Incremental Map-Reduce

it’s useful to structure a map-reduce computation to allow incremental updates, so


that only the minimum computation needs to be done. The map stages of a map-
reduce are easy to handle incrementally—only if the input data changes does the
mapper need to be rerun. Since maps are isolated from each other, incremental
updates are straightforward.

The more complex case is the reduce step, since it pulls together the outputs from
many maps and any change in the map outputs could trigger a new reduction. This
recomputation can be lessened depending on how parallel the reduce step is. If we
are partitioning the data for reduction, then any partition that’s unchanged does not
need to be re-reduced. Similarly, if there’s a combiner step, it doesn’t need to be
rerun if its source data hasn’t changed

If our reducer is combinable, there’s some more opportunities for computation


avoidance.

DEPT.OF.CSE,BMSIT&M
ADBMS & BIGDATA NOTES -22MCS23

If the changes are additive—that is, if we are only adding new records but are not
changing or deleting any old records—then we can just run the reduce with the
existing result and the new additions.

If there are destructive changes, that is updates and deletes, then we can avoid
some recomputation by breaking up the reduce operation into steps and only
recalculating those steps whose inputs have changed.

DEPT.OF.CSE,BMSIT&M

You might also like