Module 3 Nosql
Module 3 Nosql
MODULE 3
In a basic setup, the outputs of all mappers are concatenated and sent into a single
reduce function. This can become inefficient, especially as the size of the data grows.
In this setup, the key-value pairs are grouped into partitions based on the key. These
partitions are then shuffled and distributed to the corresponding reducers. Multiple
reducers work on different partitions in parallel, and the results are merged at the end.
The solution to this is a combiner function, which processes the data on the map side
before it is transferred to the reducers. The combiner aggregates values for the same
key, reducing the amount of data transferred. This helps cut down on network
overhead.
Non-Combinable Reducers:
Some reduce functions cannot be used as combiners. For instance, a reduce function
that counts unique customers for a product might not be combinable. This is because
the output of such a reduce function (the total count) differs from the input (individual
product-customer pairs).
In such cases, a different approach is used, such as eliminating duplicates before they
reach the reducer, but this doesn’t combine the data in the same way as a combiner
would.
1
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
When using combinable reducers, not only can the map-reduce job run in parallel on
different partitions, but combining can occur across nodes as well. This flexibility
allows for earlier combining before all the mappers have completed, and even allows
some data combining to happen on the map side before it’s sent over the network.
Framework Considerations:
2
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
In the first stage, the goal is to summarize sales by product and month for each year. This
stage involves:
1. Mapping: Each input record (a single sale) is mapped to a key-value pair where the
key combines the year, month, and product, and the value is the quantity sold.
2. Reducing: All records with the same key (i.e., the same product in the same month of
the same year) are aggregated, summing up quantities. This gives the total sales for
each product in each month.
Value: quantity
The reducer then aggregates these records to produce one record per product per month, such
as:
In the second stage, the output from Stage 1 is processed to compare the sales of each product
in a given month with the previous year. This is achieved by:
1. Mapping: Each record is mapped, and the mapper identifies whether it belongs to the
current year (2011) or the previous year (2010).
2. Reducing: The reducer merges records for the same product and month from both
years, calculates the percentage increase or decrease, and produces a final record
showing the comparison.
Example: For the same product "puerh" in December 2011 and 2010, the reducer might
produce:
3
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
Parallelism: Each map and reduce task can be executed in parallel, making it efficient
for large datasets.
Cluster-Suitability: The final outputs are ideal for distributed storage, which enables
quick data access for downstream processing.
Using tools like Apache Pig or Hive on Hadoop further simplifies this model by providing
high-level abstractions for MapReduce operations. This is particularly helpful as data scales
and demands for high-volume processing increase.
Language Support: Tools like Apache Pig and Hive simplify MapReduce with user-friendly
scripting and SQL-like syntax, making it easier to use with Hadoop.
Beyond NoSQL: MapReduce is useful in many data environments, not just NoSQL, and is
ideal for distributed processing on large datasets.
4
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
5
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
1. Map Function: The first phase of MapReduce is the map function, which processes
each data record independently. Each record, or "aggregate" in database terms, is
converted into a series of key-value pairs. For example, when processing orders that
contain line items (product IDs, quantities, and prices), the map function extracts each
product and associates it with its details (product ID as the key, quantity, and price as
values). This setup enables efficient data processing by focusing only on relevant
details for each record.
2. Parallelism and Independence: The map function processes each aggregate (order)
independently, making it highly parallelizable. Since each map operation works
without reference to others, the framework can assign these tasks across multiple
nodes in a cluster. This parallelism enables faster data processing by distributing tasks
across the system.
3. Reduce Function: The second phase, known as the reduce function, aggregates data
by combining all values associated with each unique key. The reduce function
processes collections of values with the same key—such as all orders containing a
specific product—and consolidates them into a single output. For example, if the map
phase produced several entries for a product (each detailing quantity and revenue
from different orders), the reduce function sums these values to yield total sales for
that product.
6
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
pairs and ensuring the appropriate data reaches the reduce function. This coordination
allows developers to focus on writing the map and reduce functions without needing
to handle data shuffling or parallel task management directly.
4 How are calculations composed in Map reduce? Explain with neat diagram
The MapReduce approach is a model designed for concurrent data processing, prioritizing
ease of parallelization over flexibility. Here’s an overview of its core principles and
limitations:
Constraints in MapReduce
Single Aggregate per Map Task: Each map task can only work with individual
records or aggregates (e.g., single orders), meaning that processing must be designed
to operate independently on each data entry without reference to others.
Single Key per Reduce Task: Each reduce task operates on values associated with a
specific key (e.g., one product ID), so computations must be structured around
aggregating values that share the same key.
Structuring Calculations
To use MapReduce effectively, calculations must fit within the model’s constraints. Here’s
how different calculations are handled:
o Instead, each map task must output the total sum and count of quantities,
allowing the reduce function to combine these values. The final average is
computed from the combined sum and count, not from intermediate averages.
2. Counting Operations:
7
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
Example Workflows:
In a product order analysis, each map function could output entries with a product ID
key, a count of 1, and a quantity. The reduce function then combines all entries with
the same key to produce total counts and quantities, enabling further calculations like
averages based on the combined data.
What are key value stores? List out some popular key value database. Explain how all
5
data is stored in a single bucket of key value data store
Key-value stores are among the simplest and most high-performing types of NoSQL
databases, using a straightforward API model focused on basic operations for managing data.
Core Characteristics:
1. Basic Operations:
2. Data Structure:
o Responsibility for understanding and managing the structure of stored data lies
entirely with the application.
8
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
3. Primary-Key Access:
Redis: Often referred to as a data structure server, supports complex structures like
lists, sets, and hashes, enabling more versatile use.
Some stores, such as Redis, offer data structure support for lists, sets, and hashes,
allowing for a range of operations like unions and intersections.
Single Bucket Approach: All data (e.g., session data, shopping carts) can be stored
within a single bucket under one key-value pair, creating a unified object. However,
this can risk key conflicts due to different data types being stored under the same
bucket.
Separate Buckets for Data Types: By appending object names to keys or creating
specific buckets for each data type (e.g., sessionID_userProfile), it’s possible to avoid
key conflicts and access only the necessary object types without needing extensive
key design changes.
9
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
Redis supports lists and arrays, allowing it to store more structured information like
states, visit logs, or address types, making it ideal for data that requires order or
grouping
The key-value store model provides a simple and efficient approach to data management,
offering features that differ significantly from those of traditional relational databases.
1. Consistency
This flexibility in consistency settings can be defined at the bucket level, where
options such as allow Siblings, n Val (replication factor), and w (write quorum)
enable control over the balance between data consistency and performance.
2. Transactions
Transactions in key-value stores are limited or non-existent due to the lack of support
for multi-key or multi-document transactions. To manage transactional requirements,
some key-value stores, like Riak, employ a quorum model for writes and reads. By
configuring values like N (total replicas), W (write quorum), and R (read quorum),
users can achieve a level of reliability in write success and data availability.
3. Query Features
Key-value stores primarily support direct key-based lookups, without the complex
query capabilities found in SQL databases. This design is fast but limits flexibility, as
querying by fields within the value requires either application-level filtering or special
indexing capabilities (like Riak Search, which enables Lucene-based querying).
Key design becomes crucial, as the application must generate or derive meaningful
keys for efficient data retrieval. This constraint makes key-value stores ideal for
applications where queries are predictable, such as session storage or shopping carts.
4. Structure of Data
The value part of key-value pairs is typically stored as a blob, leaving the content and
structure to the application. This flexibility allows for storing various data types (e.g.,
10
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
JSON, XML, text), but it also shifts the responsibility of data interpretation to the
client application.
For instance, Riak allows users to specify data types in requests via the Content-Type
header, which can simplify deserialization but does not affect how the database stores
the blob.
5. Scaling
Sharding, or partitioning data across multiple nodes based on keys, enables key-value
stores to scale horizontally. Each node handles a subset of keys, based on a
deterministic function, allowing seamless expansion by adding more nodes to the
cluster.
However, this approach also introduces risks; if a node responsible for certain keys
fails, data with those keys becomes unavailable until the node is restored. Key-value
stores address these issues with replication and settings for the CAP theorem (e.g., N,
R, and W values in Riak), offering a trade-off between consistency, availability, and
partition tolerance.
Advantage: Fast retrieval and storage in a single PUT or GET request, ideal for
storing session data.
Example Solution: Memcached or Riak can be used, with Riak offering enhanced
availability for session consistency across requests.
Advantage: All user profile data can be stored in a single object, allowing quick
retrieval of preferences.
Example Solution: The profile can be stored with a unique user ID as the key,
making it simple to access user settings with a single GET.
Use Case: Shopping carts tied to individual users across sessions, browsers, and
devices.
11
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers
Advantage: All cart information is stored under a unique userid key, ensuring high
availability.
Example Solution: A Riak cluster, which maintains availability and fault tolerance,
making it suitable for this application.
While key-value stores are effective for certain types of data storage, they are not ideal for
every scenario:
1. Data Relationships:
Limitation: Key-value stores lack the querying capability and relational structure that
relational databases provide.
12
Koustav Biswas. Dept. Of CSE, DSATM