0% found this document useful (0 votes)
3 views

Module 3 and 4

Cryptography
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 3 and 4

Cryptography
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Module-4

1. Explain Hive Integration and work flow steps involved with a diagram.
2. Describe the Map tasks, Reduce tasks and Map Reduce Execution process.
3. Describe the Hive architecture and its characteristics.
4. Describe the Pig Architecture and features of pig and Applications.
5. Differentiate between Pig and Map Reduce

Module-3

1. Explain about NOSQL data store and its characteristics.


2. Describe the principle of working of the CAP theorem.
3. Demonstrate the working of key-value store with an example.
4. Explain NOSQL Data Architecture Patterns
5. What are the different ways of handling Big Data Problems?
6. Explain Shared Nothing Architecture for Big Data tasks.

2) Describe the Map tasks, Reduce tasks and Map Reduce Execution process

MapReduce is the data processing layer. It processes the huge amount of structured and
unstructured data stored in HDFS.

MapReduce Execution Steps:


MapReduce processess the data in various phases with the help of different components.
Let’s discuss the steps of job execution in Hadoop.
1. Input Files: In input files data for MapReduce job is stored. In HDFS, input files
reside. Input files format is arbitrary. Line-based log files and binary format can also be
used.
2. InputSplits: It represents the data which will be processed by an individual Mapper.
For each split, one map task is created
3. RecordReader: It communicates with the inputSplit. And then converts the data into
key-value pairs suitable for reading by the Mapper. RecordReader by default uses
TextInputFormat to convert data into a key-value pair.
4. Mapper: It processes input record produced by the RecordReader and generates
intermediate keyvalue pairs. The intermediate output is completely different from the
input pair. The output of the mapper is the full collection of key-value pairs. \
5. Combiner: Combiner is Mini-reducer which performs local aggregation on the
mapper’s output. It minimizes the data transfer between mapper and reducer.
6. Shuffling and Sorting: After partitioning, the output is shuffled to the reduce node. The
shuffling is the physical movement of the data which is done over the network.
7. Reducer: Reducer then takes set of intermediate key-value pairs produced by the
mappers as the input. After that runs a reducer function on each of them to generate the
output.
8. Output: OutputFormat defines the way how RecordReader writes these output
key-value pairs in output files. So, its instances provided by the Hadoop write files in
HDFS. Thus OutputFormat instances write the final output of the reducer on HDFS.

2) Describe the Hive architecture and its characteristics.


A)
3) Explain Hive Integration and work flow steps involved with a diagram.
A)
6. Describe the Pig Architecture and features of pig and Applications.

A)

Execution Flow in Pig:


1. Parser: The parser checks for type errors and syntax errors.
2. Optimizer: The DAG is sent to the logical optimizer, where optimization
activities are performed automatically to reduce data flow and improve
efficiency.
3. Compiler: After optimization, the compiler generates a series of MapReduce
jobs that correspond to the logical plan.
4. Execution Engine:
a. The MapReduce jobs are submitted for execution by the execution
engine.
b. The jobs run on the cluster, performing the computations, and output the
final result.
Module - 3
1. What is NoSQL? Explain the CAP Theorem ?
A) NoSQL refers to a class of database management systems that don’t rely on the
traditional relational database model (tables, rows, and SQL for querying).
- NoSQL databases are designed for flexibility, scalability, and
high-performance data storage

CAP Theorem
In distributed systems, the CAP Theorem states that among Consistency (C),
Availability (A), and Partition Tolerance (P), only two can be fully achieved
simultaneously. Here’s a breakdown of these principles:

1. Consistency (C)
Consistency ensures that all copies of the data reflect the same value at any given
time, similar to traditional databases. In distributed databases, consistency means
that:
● All nodes observe the same data simultaneously.
● Changes made in one partition should immediately reflect in other related
partitions and tables using that data.
2. Availability (A):
Availability ensures that the system provides a response to every request, even in the
event of a failure. This means:
● If one partition becomes inactive, other copies of the data in active partitions
remain accessible.
● Distributed systems use replication to maintain availability, ensuring that if one
node fails, another can handle requests.

3. Partition Tolerance (P):


Partition tolerance ensures the system continues functioning even when there is a
network failure or communication breakdown between parts of the system. It involves:
● Dividing a large database into smaller partitions while ensuring they operate
independently without disrupting overall functionality.

2) Explain NOSQL Data Architecture Patterns.


A)
1. Key-Value Store:
● The simplest way to implement a schema-less data store is to use
key-value pairs.
● The data store characteristics are high performance, scalability and
flexibility.


● Advantages:
Can handle large amounts of data and heavy load
Easy retrieval of data by keys.
● Examples: • DynamoDB
2. Column Store Database:
● Rather than storing data in relational tuples, the data is stored in
individual cells which are further grouped into columns.
● Column-oriented databases work only on columns.
● Advantages: • Data is readily available
● Examples: • HBase ,Bigtable by Google
3. Document Database:
● The document database fetches and accumulates data in form of
key-value pairs but here, the values are called as Documents.
● Document can be stated as a complex data structure.
● Advantages:
1. This type of format is very useful and apt for semi-structured data.
2. Storage retrieval and managing of documents is easy.
● Examples:
1. MongoDB
2. CouchDB
4. Graph Databases:
● Clearly, this architecture pattern deals with the storage and management
of data in graphs.
● Graphs are basically structures that depict connections between two or
more objects in some data
● Advantages:
1. Fastest traversal because of connections
2. Spatial data can be easily handled.
● Examples:
1. Neo4J
2. FlockDB
6) Explain Shared Nothing Architecture for Big Data tasks.
A) 1) Single Server Model
A single server processes data sequentially

2) Sharding Model

3)Master-Slave Distribution Model


● One master node handles writes and directs slave nodes
4) Peer to Peer Distribution Model

You might also like