Big Data Unit 2 Notes
Big Data Unit 2 Notes
Hadoop was created by Doug Cutting and Mike Cafarella in 2005, inspired by Google's MapReduce and
Google File System (GFS) papers.
Named after a toy elephant, Hadoop was initially developed to support distributed processing of large
datasets across clusters of commodity hardware.
The project was later contributed to the Apache Software Foundation and became an open-source
platform for big data processing and analytics.
Apache Hadoop:
Apache Hadoop is an open-source framework for distributed storage and processing of big data.
It provides a scalable and fault-tolerant ecosystem for storing, processing, and analyzing large datasets
across clusters of computers.
Key components of Apache Hadoop include Hadoop Distributed File System (HDFS) for storage and
MapReduce for processing.
HDFS is a distributed file system designed to store large files and streaming data across multiple nodes
in a Hadoop cluster.
It provides high throughput access to application data and ensures fault tolerance by replicating data
across multiple nodes.
HDFS follows a master-slave architecture with NameNode as the master and DataNodes as slaves.
Components of Hadoop:
HDFS (Hadoop Distributed File System): Stores data across a distributed cluster of machines.
MapReduce: Provides a programming model and processing framework for distributed processing of
large datasets.
YARN (Yet Another Resource Negotiator): Manages resources and schedules tasks across the Hadoop
cluster.
Hadoop Common: Contains libraries and utilities used by other Hadoop modules.
Hadoop EcoSystem: Includes various tools and frameworks built on top of Hadoop for specific tasks like
data ingestion, processing, and analysis.
Data Format:
Hadoop supports various data formats, including structured (e.g., Avro, Parquet), semi-structured (e.g.,
JSON, XML), and unstructured (e.g., text files).
It can handle data in any format, but using optimized formats like Avro and Parquet can improve
performance and storage efficiency.
Hadoop enables analyzing large datasets using the MapReduce programming model, which involves
breaking down tasks into map and reduce phases.
Map tasks process input data and emit intermediate key-value pairs, which are then shuffled, sorted,
and aggregated by reduce tasks to produce the final output.
Scaling Out:
Hadoop allows scaling out by adding more nodes to the cluster to accommodate increasing data
volumes and processing demands.
It provides automatic load balancing and fault tolerance mechanisms to ensure smooth operation as the
cluster grows.
Hadoop Streaming enables running non-Java applications (e.g., Python, Perl) as MapReduce jobs by
streaming input/output through standard input/output streams.
The Hadoop ecosystem consists of various tools and frameworks that extend the functionality of
Hadoop for specific use cases:
MapReduce is a programming model and processing framework for parallel processing of large datasets
across distributed clusters.
Map tasks process input data and generate intermediate key-value pairs, which are then shuffled,
sorted, and aggregated by reduce tasks to produce the final output.
Developing a MapReduce application involves implementing map and reduce functions to process input
data and produce output.
Hadoop provides APIs for writing MapReduce applications in Java, Python (using Hadoop Streaming),
and other programming languages.
A MapReduce job consists of multiple map and reduce tasks, which are executed across nodes in the
Hadoop cluster.
Input data is divided into splits, processed by map tasks in parallel, and then shuffled, sorted, and
aggregated by reduce tasks.
Failures:
Hadoop handles failures gracefully by automatically restarting failed tasks and reallocating resources to
healthy nodes.
Data replication in HDFS ensures fault tolerance by storing multiple copies of data across the cluster.
Job Scheduling:
YARN (Yet Another Resource Negotiator) in Hadoop is responsible for scheduling and managing
resources across applications in the cluster.
It allocates resources based on application requirements and cluster availability to optimize resource
utilization.
Shuffle and sort phase in MapReduce involves transferring intermediate key-value pairs from map tasks
to reduce tasks, sorting them by keys, and grouping values with the same key for aggregation.
Task Execution:
Map and reduce tasks are executed on individual nodes in the Hadoop cluster, with each task processing
a portion of the input data in parallel.
MapReduce Types:
MapReduce supports different types of input and output formats, including text, sequence, and custom
formats.
Input and output formats define how data is read from input sources and written to output destinations.
Input Formats:
Input formats determine how input data is split and processed by map tasks.
Hadoop provides built-in input formats for handling various types of data sources, including text files,
sequence files, and database tables.
Output Formats:
Hadoop provides built-in output formats for writing output data to different storage systems, including
HDFS, databases, and distributed file systems.
MapReduce Features:
Fault tolerance: Hadoop provides fault tolerance mechanisms to handle node failures and ensure job
completion.
Scalability: Hadoop scales horizontally by adding more nodes to the cluster to accommodate increasing
data volumes and processing demands.
Data locality: Hadoop optimizes data processing by scheduling tasks on nodes where data is located to
minimize data transfer over the network.
Task parallelism: MapReduce enables parallel processing of data by executing map and reduce tasks in
parallel across multiple nodes in the cluster.
Real-world MapReduce Notes:
In real-world scenarios, MapReduce is used for processing large datasets in various domains, including
web search, social media analytics, log processing, and recommendation systems.
MapReduce jobs can be optimized for performance by tuning parameters like input/output formats,
data partitioning, and task parallelism.
Monitoring and debugging tools like Hadoop JobTracker and TaskTracker provide insights into job
execution and performance metrics.