0% found this document useful (0 votes)
10 views4 pages

Unit 2 BDA

unit 2 of big data analytics

Uploaded by

saisri.pentapati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

Unit 2 BDA

unit 2 of big data analytics

Uploaded by

saisri.pentapati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Unit 2: Apache Hadoop

(10-mark answers for each topic)

1. Introduction to Apache Hadoop

• Definition: Hadoop is an open-source framework for distributed storage and


processing of Big Data using simple programming models.

• Key Features:

1. Handles large-scale data storage (HDFS).

2. Processes data in parallel using MapReduce.

3. Scalable and fault-tolerant.

• Applications:

o Social media analysis (Facebook, Twitter).

o Fraud detection in banking.

o Search engines like Google.

2. System Principle

• Core Concept: Hadoop distributes data across multiple nodes and processes it
in parallel, ensuring high efficiency.

• Key Components:

1. HDFS: Hadoop Distributed File System for storage.

2. MapReduce: Programming model for processing.

3. YARN: Resource management and job scheduling.

3. Hadoop Architecture

• Layers:

1. Storage Layer (HDFS): Manages large data storage across clusters.

2. Processing Layer (MapReduce): Executes parallel computations on


data.

3. Resource Layer (YARN): Allocates system resources efficiently.


• Diagram:
(Diagram of Hadoop architecture with HDFS, MapReduce, and YARN interactions
will be included.)

4. Hadoop Distributed File System (HDFS)

• Overview: HDFS is designed for storing large datasets across multiple nodes.

• Features:

1. Data Blocks: Files are split into blocks (default size: 128 MB).

2. Replication: Data is replicated across nodes for fault tolerance.

3. Write Once, Read Many: Optimized for reading operations.

5. Hadoop MapReduce

• Definition: A programming model for parallel data processing.

• How it Works:

1. Input Split: Data is divided into chunks.

2. Map Phase: Processes data in key-value pairs.

3. Reduce Phase: Aggregates and produces the final output.

• Advantages:

o High scalability and efficiency.

o Works on commodity hardware.

6. YARN (Yet Another Resource Negotiator)

• Definition: A resource management layer in Hadoop for job scheduling.

• Components:

1. Resource Manager: Allocates resources for applications.

2. Node Manager: Monitors individual nodes and reports to the Resource


Manager.

• Advantages:

o Increases cluster utilization.


o Supports multiple workloads (MapReduce, Spark).

7. Hadoop Installation and Modes

• Installation:

1. Download and install Hadoop.

2. Configure HDFS and MapReduce settings.

3. Start Hadoop services.

• Modes:

1. Standalone Mode: Single node for testing.

2. Pseudo-Distributed Mode: Simulates a cluster on one machine.

3. Fully Distributed Mode: Real cluster with multiple nodes.

8. Hadoop Commands

• HDFS Commands:

1. hdfs dfs -ls: List files in HDFS.

2. hdfs dfs -put: Upload files to HDFS.

3. hdfs dfs -get: Download files from HDFS.

• YARN Commands:

1. yarn application -list: View running applications.

2. yarn logs: View application logs.

9. Moving Data In and Out of Hadoop

• Using HDFS:

o Upload data using commands like hdfs dfs -put.

o Retrieve processed data using hdfs dfs -get.

• Integration Tools: Sqoop for transferring data between Hadoop and relational
databases.
10. Hadoop Programming

• Overview: Writing applications in Java or Python to process data using


MapReduce.

• Example Program: Word Count application in Hadoop.

You might also like