0% found this document useful (0 votes)
15 views6 pages

Introduction to Hadoop Framework

Uploaded by

Bhavani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Introduction to Hadoop Framework

Uploaded by

Bhavani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Introduction to Hadoop

Hadoop is an open-source, distributed processing framework designed to handle


large datasets across clusters of commodity hardware. It provides a reliable,
scalable, and cost-effective solution for storing and processing massive amounts of
data.

Why Hadoop?

Hadoop addresses the challenges posed by Big Data, which is characterized by its
volume, velocity, and variety. Traditional systems struggle to handle such data
efficiently. Hadoop offers:

 Scalability: Easily scales to handle petabytes and exabytes of data by adding


more nodes to the cluster.

 Fault Tolerance: Data is replicated across multiple nodes, ensuring data


availability even if some nodes fail.

 Cost-Effectiveness: Utilizes commodity hardware, reducing infrastructure


costs.

 Flexibility: Supports various data types and processing paradigms.

Why Not RDBMS?

Relational Database Management Systems (RDBMS) are designed for structured


data and transactional processing. They face limitations when dealing with Big
Data:

 Scalability Limitations: Scaling RDBMS vertically (adding more resources


to a single server) becomes expensive and eventually reaches a limit.

 Schema Dependency: RDBMS require a predefined schema, making it


difficult to handle unstructured or semi-structured data.

 Cost: Enterprise-grade RDBMS can be expensive to license and maintain.

 Processing Speed: Complex queries on large datasets can be slow in


RDBMS.
 RDBMS vs. Hadoop

Feature RDBMS Hadoop


Structured, Semi-
Data Structure Structured structured,
Unstructured
Horizontal (highly
Scalability Vertical (limited)
scalable)
Massive (petabytes,
Data Volume Limited
exabytes)
Batch Processing,
Processing Transactional, Complex Queries
Data Analytics
Schema Predefined Schema-on-Read
Hardware Specialized, Expensive Commodity Hardware
Transaction Processing, Big Data Analytics,
Use Cases
Reporting Data Warehousing

History of Hadoop

 2002: Doug Cutting and Mike Cafarella started the Nutch project, an open-
source web search engine.

 2003: Google published the Google File System (GFS) paper, which
inspired the development of HDFS.

 2004: Google published the MapReduce paper, which inspired the


development of the MapReduce framework.

 2006: Hadoop was created as a subproject of Nutch.

 2008: Hadoop became a top-level Apache project.

 Present: Hadoop ecosystem continues to evolve with new components and


improvements.

Hadoop Overview

Hadoop consists of several core components:


 HDFS (Hadoop Distributed File System): A distributed file system that
stores large datasets across a cluster of machines.

 YARN (Yet Another Resource Negotiator): A resource management and


job scheduling framework.

 MapReduce: A programming model for processing large datasets in


parallel.

 Hadoop Common: Provides common utilities and libraries used by other


Hadoop modules.

Use Case of Hadoop

 Log Processing: Analyzing web server logs, application logs, and security
logs.

 Data Warehousing: Building large-scale data warehouses for business


intelligence.

 Recommendation Systems: Building recommendation engines for e-


commerce and content platforms.

 Fraud Detection: Detecting fraudulent transactions in financial institutions.

 Social Media Analytics: Analyzing social media data to understand trends


and sentiment.

HDFS (Hadoop Distributed File System)

HDFS is a distributed file system designed to store large datasets reliably and
efficiently. Key features include:

 Data Replication: Data is replicated across multiple nodes to ensure fault


tolerance.

 Block Size: Data is divided into blocks (typically 128MB) and stored across
the cluster.

 Namenode: Manages the file system namespace and metadata.

 Datanode: Stores the actual data blocks.


 High Throughput: Optimized for sequential data access.

Processing Data with Hadoop

Hadoop processes data using the MapReduce programming model. This involves
dividing the data into smaller chunks, processing them in parallel using mappers,
and then aggregating the results using reducers.

Managing Resources and Applications with Hadoop YARN (Yet Another


Resource Negotiator)

YARN is the resource management layer in Hadoop. It allows multiple


applications to run on the same cluster, sharing resources dynamically.

 ResourceManager: Manages the cluster resources and allocates them to


applications.

 NodeManager: Manages the resources on individual nodes and executes


tasks.

 ApplicationMaster: Manages the lifecycle of an application and


coordinates its tasks.

Introduction to MapReduce Programming

Introduction

MapReduce is a programming model for processing large datasets in parallel. It


involves two main phases: Map and Reduce.

Mapper

The Mapper phase transforms input data into key-value pairs. It processes each
input record and emits one or more key-value pairs.

 Input: Input data is read from HDFS.

 Processing: The mapper function is applied to each input record.

 Output: Key-value pairs are emitted.

Reducer
The Reducer phase aggregates and summarizes the data based on the keys emitted
by the mappers. It receives the output of the mappers, sorts it by key, and then
applies a reduce function to each key and its associated values.

 Input: Key-value pairs from the mappers.

 Processing: The reducer function is applied to each key and its associated
values.

 Output: Final results are written to HDFS.

Combiner

A Combiner is an optional optimization that performs local aggregation of the


mapper output before it is sent to the reducers. This reduces the amount of data that
needs to be transferred across the network.

 Functionality: Performs local aggregation on the mapper output.

 Benefits: Reduces network traffic and improves performance.

 Placement: Runs on the same node as the mapper.

Partitioner

The Partitioner determines which reducer will receive a given key-value pair from
the mapper. It ensures that all key-value pairs with the same key are sent to the
same reducer.

 Functionality: Distributes the mapper output to the reducers.

 Default Partitioner: Uses a hash function to distribute the keys evenly.

 Custom Partitioner: Can be implemented to control the distribution of


keys.

Searching

MapReduce can be used for searching large datasets. The mapper can filter the data
based on a search query, and the reducer can aggregate the results.

Sorting
Hadoop automatically sorts the mapper output by key before it is sent to the
reducers. This makes it easy to perform sorting operations on large datasets.

Compression

Compression can be used to reduce the storage space and network bandwidth
required for Hadoop jobs. Hadoop supports various compression codecs, such as
Gzip, LZO, and Snappy.

 Benefits: Reduces storage costs, improves network performance.

 Considerations: Compression and decompression can add overhead to the


processing time.

This study guide provides a foundational understanding of Hadoop and


MapReduce. Remember to review your lecture notes, assignments, and practice
problems to further solidify your knowledge. Good luck with your final exams!

You might also like