Module 3 and 4
Module 3 and 4
1. Explain Hive Integration and work flow steps involved with a diagram.
2. Describe the Map tasks, Reduce tasks and Map Reduce Execution process.
3. Describe the Hive architecture and its characteristics.
4. Describe the Pig Architecture and features of pig and Applications.
5. Differentiate between Pig and Map Reduce
Module-3
2) Describe the Map tasks, Reduce tasks and Map Reduce Execution process
MapReduce is the data processing layer. It processes the huge amount of structured and
unstructured data stored in HDFS.
A)
CAP Theorem
In distributed systems, the CAP Theorem states that among Consistency (C),
Availability (A), and Partition Tolerance (P), only two can be fully achieved
simultaneously. Here’s a breakdown of these principles:
1. Consistency (C)
Consistency ensures that all copies of the data reflect the same value at any given
time, similar to traditional databases. In distributed databases, consistency means
that:
● All nodes observe the same data simultaneously.
● Changes made in one partition should immediately reflect in other related
partitions and tables using that data.
2. Availability (A):
Availability ensures that the system provides a response to every request, even in the
event of a failure. This means:
● If one partition becomes inactive, other copies of the data in active partitions
remain accessible.
● Distributed systems use replication to maintain availability, ensuring that if one
node fails, another can handle requests.
●
● Advantages:
Can handle large amounts of data and heavy load
Easy retrieval of data by keys.
● Examples: • DynamoDB
2. Column Store Database:
● Rather than storing data in relational tuples, the data is stored in
individual cells which are further grouped into columns.
● Column-oriented databases work only on columns.
● Advantages: • Data is readily available
● Examples: • HBase ,Bigtable by Google
3. Document Database:
● The document database fetches and accumulates data in form of
key-value pairs but here, the values are called as Documents.
● Document can be stated as a complex data structure.
● Advantages:
1. This type of format is very useful and apt for semi-structured data.
2. Storage retrieval and managing of documents is easy.
● Examples:
1. MongoDB
2. CouchDB
4. Graph Databases:
● Clearly, this architecture pattern deals with the storage and management
of data in graphs.
● Graphs are basically structures that depict connections between two or
more objects in some data
● Advantages:
1. Fastest traversal because of connections
2. Spatial data can be easily handled.
● Examples:
1. Neo4J
2. FlockDB
6) Explain Shared Nothing Architecture for Big Data tasks.
A) 1) Single Server Model
A single server processes data sequentially
2) Sharding Model