0% found this document useful (0 votes)
108 views

Module 3 Nosql

Uploaded by

Raghu Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views

Module 3 Nosql

Uploaded by

Raghu Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

NOSQL Database 21CS745 Question Bank & Answers

MODULE 3

Question Bank with Answers

1 Explain with a neat diagram, the partitioning and combining in MapReduce


Parallelism with Partitioning:

 In a basic setup, the outputs of all mappers are concatenated and sent into a single
reduce function. This can become inefficient, especially as the size of the data grows.

 To increase parallelism and minimize bottlenecks, we partition the output of the


mappers. Each reducer operates on a subset of data associated with a specific key.
This allows multiple reduce tasks to run in parallel, speeding up the process.

 In this setup, the key-value pairs are grouped into partitions based on the key. These
partitions are then shuffled and distributed to the corresponding reducers. Multiple
reducers work on different partitions in parallel, and the results are merged at the end.

Data Transfer Reduction with Combining:

 A significant issue in map-reduce jobs is the amount of data being transferred


between the map and reduce phases. Much of the data consists of repeated key-value
pairs for the same key.

 The solution to this is a combiner function, which processes the data on the map side
before it is transferred to the reducers. The combiner aggregates values for the same
key, reducing the amount of data transferred. This helps cut down on network
overhead.

 A combiner function is essentially a mini-reduce function. In many cases, the


combiner function can be the same as the reducer function, but with a constraint: the
output of the combiner must match the input of the reduce function. These are called
combinable reducers.

Non-Combinable Reducers:

 Some reduce functions cannot be used as combiners. For instance, a reduce function
that counts unique customers for a product might not be combinable. This is because
the output of such a reduce function (the total count) differs from the input (individual
product-customer pairs).

 In such cases, a different approach is used, such as eliminating duplicates before they
reach the reducer, but this doesn’t combine the data in the same way as a combiner
would.

1
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

Combining Across Mappers:

 When using combinable reducers, not only can the map-reduce job run in parallel on
different partitions, but combining can occur across nodes as well. This flexibility
allows for earlier combining before all the mappers have completed, and even allows
some data combining to happen on the map side before it’s sent over the network.

Framework Considerations:

 Some map-reduce frameworks require all reducers to be combinable, which


maximizes flexibility by allowing parallel and serial reductions. If a non-combinable
reducer is necessary, it’s typically handled by breaking the processing into pipelined
map-reduce steps.

2
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

2 Explain two stages Map reduce example, with neat diagram


This "pipes-and-filters" model is beneficial when processing tasks involve multiple phases,
each of which can build upon the output of the previous stage.

Stage 1: Aggregate Monthly Sales

In the first stage, the goal is to summarize sales by product and month for each year. This
stage involves:

1. Mapping: Each input record (a single sale) is mapped to a key-value pair where the
key combines the year, month, and product, and the value is the quantity sold.

2. Reducing: All records with the same key (i.e., the same product in the same month of
the same year) are aggregated, summing up quantities. This gives the total sales for
each product in each month.

Example: For each sales record, the mapper might output:

 Key: 2011:12: puerh

 Value: quantity

The reducer then aggregates these records to produce one record per product per month, such
as:

 {year: 2011, month: 12, product: puerh, quantity: 1200}.

Stage 2: Year-on-Year Comparison

In the second stage, the output from Stage 1 is processed to compare the sales of each product
in a given month with the previous year. This is achieved by:

1. Mapping: Each record is mapped, and the mapper identifies whether it belongs to the
current year (2011) or the previous year (2010).

2. Reducing: The reducer merges records for the same product and month from both
years, calculates the percentage increase or decrease, and produces a final record
showing the comparison.

Example: For the same product "puerh" in December 2011 and 2010, the reducer might
produce:

 {product: puerh, month: 12, current_quantity: 1200, prior_quantity: 1000, increase:


20%}.

Benefits of the Two-Stage Approach

3
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

 Parallelism: Each map and reduce task can be executed in parallel, making it efficient
for large datasets.

 Reusability: The intermediate data can be stored, reused, or analyzed separately.

 Cluster-Suitability: The final outputs are ideal for distributed storage, which enables
quick data access for downstream processing.

Using tools like Apache Pig or Hive on Hadoop further simplifies this model by providing
high-level abstractions for MapReduce operations. This is particularly helpful as data scales
and demands for high-volume processing increase.

Reusable Intermediate Outputs: Intermediate results from MapReduce can be stored as


materialized views, saving time and resources for future calculations.

Optimizing Query Patterns: Build materialized views based on actual queries, as


speculative reuse can be inefficient.

Language Support: Tools like Apache Pig and Hive simplify MapReduce with user-friendly
scripting and SQL-like syntax, making it easier to use with Hadoop.

Beyond NoSQL: MapReduce is useful in many data environments, not just NoSQL, and is
ideal for distributed processing on large datasets.

Cluster-Friendly: MapReduce is well-suited for handling large volumes of data across


clusters, making it a crucial tool as data processing demands grow.

4
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

5
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

3 Explain basic map reduce, with neat diagram


The MapReduce framework is a programming model designed to handle large-scale data
processing across distributed systems. It allows complex computations on large datasets by
breaking down tasks into parallelizable units, making it especially effective for handling tasks
like data aggregation and analysis.

Core Components of MapReduce

1. Map Function: The first phase of MapReduce is the map function, which processes
each data record independently. Each record, or "aggregate" in database terms, is
converted into a series of key-value pairs. For example, when processing orders that
contain line items (product IDs, quantities, and prices), the map function extracts each
product and associates it with its details (product ID as the key, quantity, and price as
values). This setup enables efficient data processing by focusing only on relevant
details for each record.

2. Parallelism and Independence: The map function processes each aggregate (order)
independently, making it highly parallelizable. Since each map operation works
without reference to others, the framework can assign these tasks across multiple
nodes in a cluster. This parallelism enables faster data processing by distributing tasks
across the system.

3. Reduce Function: The second phase, known as the reduce function, aggregates data
by combining all values associated with each unique key. The reduce function
processes collections of values with the same key—such as all orders containing a
specific product—and consolidates them into a single output. For example, if the map
phase produced several entries for a product (each detailing quantity and revenue
from different orders), the reduce function sums these values to yield total sales for
that product.

4. Framework Coordination: The MapReduce framework automatically manages data


flow between the map and reduce phases, including moving and sorting key-value

6
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

pairs and ensuring the appropriate data reaches the reduce function. This coordination
allows developers to focus on writing the map and reduce functions without needing
to handle data shuffling or parallel task management directly.

4 How are calculations composed in Map reduce? Explain with neat diagram
The MapReduce approach is a model designed for concurrent data processing, prioritizing
ease of parallelization over flexibility. Here’s an overview of its core principles and
limitations:

Constraints in MapReduce

 Single Aggregate per Map Task: Each map task can only work with individual
records or aggregates (e.g., single orders), meaning that processing must be designed
to operate independently on each data entry without reference to others.

 Single Key per Reduce Task: Each reduce task operates on values associated with a
specific key (e.g., one product ID), so computations must be structured around
aggregating values that share the same key.

Structuring Calculations

To use MapReduce effectively, calculations must fit within the model’s constraints. Here’s
how different calculations are handled:

1. Non-Composable Calculations (e.g., Averages):

o Calculating averages illustrates a limitation in MapReduce because averages


are not composable—you can’t merge two average values directly.

o Instead, each map task must output the total sum and count of quantities,
allowing the reduce function to combine these values. The final average is
computed from the combined sum and count, not from intermediate averages.

2. Counting Operations:

o Counts are straightforward in MapReduce. Each map task emits a count of 1


for each occurrence, and the reduce function simply sums these to get the total
count.

7
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

Example Workflows:

 In a product order analysis, each map function could output entries with a product ID
key, a count of 1, and a quantity. The reduce function then combines all entries with
the same key to produce total counts and quantities, enabling further calculations like
averages based on the combined data.

What are key value stores? List out some popular key value database. Explain how all
5
data is stored in a single bucket of key value data store

Key-value stores are among the simplest and most high-performing types of NoSQL
databases, using a straightforward API model focused on basic operations for managing data.

Core Characteristics:

1. Basic Operations:

o Get: Retrieve the value associated with a key.

o Put: Insert or update a value for a key.

o Delete: Remove a key and its associated value.

2. Data Structure:

o The value in a key-value store is an opaque blob (binary large object),


meaning the database stores it without needing to interpret its content.

o Responsibility for understanding and managing the structure of stored data lies
entirely with the application.

8
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

3. Primary-Key Access:

o Key-value stores operate solely on primary keys, allowing efficient, direct


access to data and making these databases highly performant and scalable.

Popular Key-Value Databases:

 Riak: Uses a "bucket" structure for segmenting keys, aiding organization.

 Redis: Often referred to as a data structure server, supports complex structures like
lists, sets, and hashes, enabling more versatile use.

 Memcached, Berkeley DB, HamsterDB, Amazon DynamoDB, Project


Voldemort.

Advanced Features in Key-Value Databases:

 Some stores, such as Redis, offer data structure support for lists, sets, and hashes,
allowing for a range of operations like unions and intersections.

Bucket Organization in Key-Value Stores:

 Single Bucket Approach: All data (e.g., session data, shopping carts) can be stored
within a single bucket under one key-value pair, creating a unified object. However,
this can risk key conflicts due to different data types being stored under the same
bucket.

 Separate Buckets for Data Types: By appending object names to keys or creating
specific buckets for each data type (e.g., sessionID_userProfile), it’s possible to avoid
key conflicts and access only the necessary object types without needing extensive
key design changes.

9
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

Example of Redis Use:

 Redis supports lists and arrays, allowing it to store more structured information like
states, visit logs, or address types, making it ideal for data that requires order or
grouping

6 What are the key value features. Explain in detail

The key-value store model provides a simple and efficient approach to data management,
offering features that differ significantly from those of traditional relational databases.

1. Consistency

 Key-value stores are typically optimized for high performance, particularly in


distributed settings, using an eventually consistent model. This means that changes
made to the data may take time to propagate across all nodes, which can lead to
temporary inconsistencies. For instance, in Riak, users can choose either "last write
wins" or "multiple values returned" for handling conflicting writes, allowing client-
side resolution.

 This flexibility in consistency settings can be defined at the bucket level, where
options such as allow Siblings, n Val (replication factor), and w (write quorum)
enable control over the balance between data consistency and performance.

2. Transactions

 Transactions in key-value stores are limited or non-existent due to the lack of support
for multi-key or multi-document transactions. To manage transactional requirements,
some key-value stores, like Riak, employ a quorum model for writes and reads. By
configuring values like N (total replicas), W (write quorum), and R (read quorum),
users can achieve a level of reliability in write success and data availability.

3. Query Features

 Key-value stores primarily support direct key-based lookups, without the complex
query capabilities found in SQL databases. This design is fast but limits flexibility, as
querying by fields within the value requires either application-level filtering or special
indexing capabilities (like Riak Search, which enables Lucene-based querying).

 Key design becomes crucial, as the application must generate or derive meaningful
keys for efficient data retrieval. This constraint makes key-value stores ideal for
applications where queries are predictable, such as session storage or shopping carts.

4. Structure of Data

 The value part of key-value pairs is typically stored as a blob, leaving the content and
structure to the application. This flexibility allows for storing various data types (e.g.,

10
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

JSON, XML, text), but it also shifts the responsibility of data interpretation to the
client application.

 For instance, Riak allows users to specify data types in requests via the Content-Type
header, which can simplify deserialization but does not affect how the database stores
the blob.

5. Scaling

 Sharding, or partitioning data across multiple nodes based on keys, enables key-value
stores to scale horizontally. Each node handles a subset of keys, based on a
deterministic function, allowing seamless expansion by adding more nodes to the
cluster.

 However, this approach also introduces risks; if a node responsible for certain keys
fails, data with those keys becomes unavailable until the node is restored. Key-value
stores address these issues with replication and settings for the CAP theorem (e.g., N,
R, and W values in Riak), offering a trade-off between consistency, availability, and
partition tolerance.

7 Explain with suitable use cases of key value stores


Key-value stores offer a simple and efficient storage model suitable for applications where
data can be represented as individual items with unique keys.:

1. Storing Session Information:

 Use Case: Each web session is assigned a unique sessionid.

 Advantage: Fast retrieval and storage in a single PUT or GET request, ideal for
storing session data.

 Example Solution: Memcached or Riak can be used, with Riak offering enhanced
availability for session consistency across requests.

2. User Profiles and Preferences:

 Use Case: User-specific settings such as language, timezone, or access permissions.

 Advantage: All user profile data can be stored in a single object, allowing quick
retrieval of preferences.

 Example Solution: The profile can be stored with a unique user ID as the key,
making it simple to access user settings with a single GET.

3. Shopping Cart Data:

 Use Case: Shopping carts tied to individual users across sessions, browsers, and
devices.

11
Koustav Biswas. Dept. Of CSE, DSATM
NOSQL Database 21CS745 Question Bank & Answers

 Advantage: All cart information is stored under a unique userid key, ensuring high
availability.

 Example Solution: A Riak cluster, which maintains availability and fault tolerance,
making it suitable for this application.

When Not to Use Key-Value Stores

While key-value stores are effective for certain types of data storage, they are not ideal for
every scenario:

1. Data Relationships:

 Challenge: Complex relationships or associations between data items are difficult to


model in a key-value store.

 Limitation: Key-value stores lack the querying capability and relational structure that
relational databases provide.

 Alternative: Consider a relational database or a graph database where relationships


among entities are critical.

----------------------------------------END OF MODULE 3----------------------------------------------

12
Koustav Biswas. Dept. Of CSE, DSATM

You might also like