0% found this document useful (0 votes)
15 views

Unit 3

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Unit 3

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Hadoop

Cluster
A Hadoop cluster is a powerful and scalable system used to manage and process large
amounts of data across multiple computers, known as nodes. This setup allows
organizations to handle "big data" tasks efficiently by distributing the data and
processing work across these nodes. Here’s a detailed breakdown of what a Hadoop
cluster is and how it works:

1. Key Components of a Hadoop Cluster


A Hadoop cluster consists of several nodes, which are divided into two main types:
• Master Node: The master node is the brain of the cluster, responsible for managing the
storage and processing of data.
• NameNode: This component manages the Hadoop Distributed File System (HDFS). It
keeps track of where the data is stored across the cluster. The NameNode knows the
location of all the data blocks and how they are replicated.
• ResourceManager: Part of YARN (Yet Another Resource Negotiator), this component is
responsible for allocating system resources (like memory and CPU) to the tasks running
on the cluster.
• Worker Nodes: These are the nodes that do the actual work of storing data and
performing computations.
– DataNode: Each worker node has a DataNode, which stores the data in HDFS. The
DataNode manages the actual storage on its node and regularly reports back to the
NameNode.
– NodeManager: Also part of YARN, the NodeManager monitors the resources on the
worker node and handles the execution of tasks. It communicates with the
ResourceManager to ensure tasks are completed efficiently.

2. How Data is Stored: HDFS (Hadoop Distributed File System)


• HDFS is the storage system used in a Hadoop cluster. It is designed to store very large files
by splitting them into smaller blocks and distributing these blocks across multiple
DataNodes.
• Data Blocks: A large file is broken down into smaller chunks called data blocks. These
blocks are typically 128MB or 256MB in size. Each block is stored on a different DataNode
in the cluster.
• Replication: To ensure data safety and reliability, each block is replicated across multiple
DataNodes. The default replication factor is three, meaning each block of data is stored
on three different nodes. This replication ensures that if one node fails, the data is still
available from another node.
3. How Data is Processed: MapReduce and YARN
• Hadoop provides a framework for processing large datasets through distributed
computing. The original processing engine in Hadoop is called MapReduce.
• MapReduce: This processing model breaks down a big task into smaller sub-tasks
(Map and Reduce tasks). The master node (using the ResourceManager or
JobTracker in older versions) assigns these tasks to the worker nodes. The worker
nodes process the data in parallel, and the results are combined to produce the
final output.
– Map Phase: The data is processed and transformed into key-value pairs.
– Reduce Phase: These key-value pairs are then aggregated to produce the final
result.
• YARN (Yet Another Resource Negotiator): YARN is an advanced resource
management layer in Hadoop that allows different processing engines to run on
the same Hadoop cluster. This makes Hadoop more flexible and capable of running
non-MapReduce jobs like Apache Spark, Tez, and others.
4. Scalability and Fault Tolerance
• A Hadoop cluster is designed to scale easily and provide fault tolerance:
• Scalability: You can increase the capacity of a Hadoop cluster by adding more
nodes. As the data grows, more nodes can be added to the cluster without
requiring major changes to the existing setup.
• Fault Tolerance: If a node fails, Hadoop automatically redirects tasks to other
available nodes and uses replicated data from other nodes to continue processing.
This ensures that the system continues to function smoothly even in the case of
hardware failure.
5. Advantages of a Hadoop Cluster
• Cost-Effective: Hadoop clusters use commodity hardware (affordable, standard
computers), making it cheaper than using specialized, high-end machines.
• High Scalability: It’s easy to add more nodes to the cluster to handle more data or
to increase processing power.
• Fault Tolerance: Data is replicated, and tasks are redistributed in case of node
failures, ensuring reliability and continuous operation.
• Flexibility: Hadoop can handle all types of data (structured, semi-structured,
unstructured) and run different types of data processing jobs.
Types Of Hadoop Cluster
1. Single-Node Cluster
• Definition: A single-node Hadoop cluster is a setup where all the Hadoop
components—both master and worker services—run on a single
machine.
• Components: On this single machine, the NameNode, DataNode,
ResourceManager, and NodeManager are all configured to run. This
means the machine handles both the storage and processing tasks.
• Use Case:
– Learning and Development: Ideal for individuals learning Hadoop,
testing applications, or developing new features.
– Testing: Used for testing small-scale jobs or configurations before
deploying them on a larger, multi-node cluster.
• Limitations: It doesn’t offer the scalability or fault tolerance of a multi-
node cluster. Performance is limited to the capacity of a single machine.
2.Multi-Node Cluster
• Definition: A multi-node Hadoop cluster consists of multiple machines, where different
machines are designated as either master nodes or worker nodes. This is the standard setup
used in production environments.
• Components:
– Master Nodes: These nodes are dedicated to running master services like the NameNode
and ResourceManager. They manage the cluster and coordinate data storage and task
execution.
– Worker Nodes: These nodes are responsible for running the DataNode and NodeManager
services. They handle data storage and perform the computational tasks assigned by the
master nodes.
• Use Case:
– Production: Used in large-scale data processing tasks in real-world applications, where the
dataset is too large to be handled by a single machine.
– Big Data Analytics: Ideal for processing and analyzing vast amounts of data in industries
like finance, healthcare, retail, and more.
• Advantages:
– Scalability: Easily add more nodes to increase storage and processing power.
– Fault Tolerance: The system remains operational even if some nodes fail, thanks to data
replication and task redistribution.
– High Performance: Multiple machines work together, allowing the cluster to handle large-
scale data processing efficiently.
3. Pseudo-Distributed Cluster
• Definition: A pseudo-distributed Hadoop cluster is a middle ground between a
single-node and a multi-node cluster. It simulates a multi-node environment on a
single machine by running each Hadoop service (NameNode, DataNode,
ResourceManager, NodeManager) in separate processes.
• Components: Although all services run on the same machine, they behave as if
they are on different nodes, communicating over the network.
• Use Case:
– Development and Testing: Developers use pseudo-distributed clusters to
simulate a multi-node environment on their local machines without needing
access to a full cluster.
– Configuration Testing: Useful for testing how different configurations would
perform in a multi-node cluster.
• Advantages:
– Ease of Setup: Provides a way to test and develop in a distributed-like
environment without needing multiple machines.
– Learning Tool: Helps users understand how Hadoop components interact in a
distributed setting.
• In a pseudo-distributed Hadoop cluster, the key to making it behave like a multi-node
system is how the processes (services) are set up to communicate with each other.
Here’s how this works:
• Separate Daemons (Processes): Hadoop runs multiple daemons (background
processes), like the NameNode, DataNode, ResourceManager, and NodeManager. In a
pseudo-distributed setup, each of these services runs as a separate process on the
same machine. This mimics the setup of a real multi-node cluster where these services
would run on different machines.
• Localhost Communication: Instead of communicating over a network between
different physical nodes, in a pseudo-distributed mode, the services communicate over
localhost (which is the local loopback IP address 127.0.0.1). This makes it feel like the
services are talking to each other as if they were on different machines.
• HDFS and MapReduce Simulation:
– HDFS (Hadoop Distributed File System) is designed to distribute data across many nodes in a
cluster. In pseudo-distributed mode, Hadoop still uses HDFS, but all the "blocks" of data are
stored on the single machine, spread across different directories (pretending they are different
nodes).
– MapReduce/YARN will still simulate distributing tasks across different "nodes" (which are
really just different processes on the same machine), pretending to manage resources as if it
were on multiple physical computers.
• In short, while everything is running on the same machine, the services are isolated
from each other and communicate in the same way they would in a real distributed
cluster, creating the illusion of a multi-node setup.
4. Fully-Distributed Cluster
• Definition: A fully-distributed Hadoop cluster is a multi-node cluster where each
Hadoop service runs on dedicated nodes. It’s the most robust and scalable setup,
used for handling extensive production workloads.
• Components:
– Dedicated Master Nodes: NameNode and ResourceManager run on separate,
dedicated machines.
– Dedicated Worker Nodes: Multiple machines are used solely for running
DataNode and NodeManager services.
• Use Case:
– Enterprise-Scale Applications: Suitable for enterprises needing to process
petabytes of data across hundreds or thousands of nodes.
– High Availability: Critical for applications where uptime and fault tolerance are
paramount.
• Advantages:
– Maximum Scalability: Supports large-scale deployments, with the ability to
add hundreds or thousands of nodes.
– High Fault Tolerance: Redundant setups for master and worker nodes ensure
the cluster remains operational even during failures.
• In Hadoop, both multi-node clusters and fully distributed clusters involve multiple
machines working together, but there are key differences:
• 1. Multi-node Cluster:
• Multiple Machines: In a multi-node cluster, there are more than one machine (or
node) involved, but they may or may not be fully distributed.
• Not Necessarily Fully Distributed: Sometimes, in smaller setups, all the nodes
might be located in one physical location or data center. They still communicate
over a network, but they may not represent a full, geographically diverse
distribution.
• Example: Imagine a few computers in the same office working together to store
data and run programs.
• 2. Fully Distributed Cluster:
• Fully Distributed Setup: This means the nodes are spread out across multiple,
sometimes distant, locations, often in different data centers or geographical
regions.
• High Fault Tolerance: Since the nodes are distributed widely, if one location or
data center goes down, the system can still function using nodes from other
locations.
• Example: Imagine many computers spread across different cities or even
countries, all working together in sync to process data.
5. Edge or Cloud-Based Cluster
• Definition: These clusters can be deployed on cloud platforms (like AWS,
Azure, or Google Cloud) instead of on-premises hardware.
• Components:
– Cloud Master and Worker Nodes: The master and worker nodes are
hosted on cloud instances, allowing for flexible scaling and easy
management.
• Use Case:
– On-Demand Scaling: Ideal for organizations that need to scale up or
down quickly based on demand without investing in physical
hardware.
– Cost-Efficiency: Pay-as-you-go pricing models make it cost-effective for
short-term or variable workloads.
• Advantages:
– Flexibility: Easily scale resources up or down as needed.
– Cost Savings: Reduce costs by only paying for the resources used.
– Accessibility: Manage the cluster from anywhere with internet access.
Modes on
which
Hadoop
works
1. Standalone Mode
• What It Is: Hadoop runs on a single computer.
• Use: Good for testing or learning. It uses the local
file system, not HDFS, and doesn't involve any
network communication.

2. Pseudo-Distributed Mode
• What It Is: Simulates a multi-node setup on one
computer.
• Use: Ideal for development and testing. Hadoop
services (like NameNode, DataNode) run as
separate processes on the same machine,
communicating like they would in a real cluster.

3. Fully-Distributed Mode
• What It Is: Hadoop runs on multiple computers
(nodes).
• Use: Used in real-world production. Data and
processing are distributed across many machines,
providing high performance, scalability, and fault
tolerance.
Hadoop Ecosystem
The Hadoop ecosystem is a collection of tools and frameworks that work together to help
you store, process, analyze, and manage big data. Here’s a brief overview:
1. HDFS (Hadoop Distributed File System)
• Purpose: Stores large amounts of data across multiple machines, with redundancy to
ensure data safety.

2. YARN (Yet Another Resource Negotiator)


• Purpose: Manages and schedules resources (like CPU and memory) for processing
tasks in the Hadoop cluster.

3. MapReduce
• Purpose: A programming model for processing large datasets by breaking them into
smaller tasks that run in parallel.

4. Hive
• Purpose: A data warehouse system that allows you to query and manage large
datasets using SQL-like language.

5. HBase
• Purpose: A NoSQL database that provides real-time read/write access to big data
stored in HDFS.
6. Pig
• Purpose: A scripting platform for analyzing large datasets, offering a simpler way
to write data processing tasks than MapReduce.
7. Sqoop
• Purpose: A tool for transferring data between Hadoop and relational databases.
8. Flume
• Purpose: A tool for collecting and moving large amounts of log data into
Hadoop.
9. Oozie
• Purpose: A workflow scheduler for managing Hadoop jobs and coordinating
tasks.
10. Zookeeper
• Purpose: Manages and coordinates distributed applications in the Hadoop
ecosystem, ensuring they work together smoothly.
11. Spark
• Purpose: A fast, in-memory data processing engine that can be used as an
alternative to MapReduce, supporting a variety of workloads like batch
processing, streaming, and machine learning.
Pros And Cons of Hadoop Ecosystem
Pros
• Scalability
– Pros: Can handle increasing amounts of data by adding more nodes to the cluster. It scales out
horizontally, meaning you can grow your system by adding more machines rather than
upgrading existing ones.
• Cost-Effective
– Pros: Uses commodity hardware (affordable, standard servers), which reduces overall
infrastructure costs compared to expensive, specialized hardware.
• Fault Tolerance
– Pros: Data is replicated across multiple nodes, so if one node fails, data is still available from
other nodes. The system can continue operating smoothly even with hardware failures.
• Flexibility
– Pros: Can store and process various types of data, including structured, semi-structured, and
unstructured data. Supports different processing models and tools within its ecosystem.
• Large-Scale Data Processing
– Pros: Efficiently processes large datasets through parallel processing. Frameworks like
MapReduce and Spark allow for handling and analyzing massive amounts of data.
• Integration with Other Tools
– Pros: Easily integrates with a variety of data processing and management tools (e.g., Hive, Pig,
HBase) and supports data transfer tools (e.g., Sqoop, Flume).
Cons
• Complexity
– Cons: Setting up and managing a Hadoop cluster can be complex and requires
specialized knowledge. The ecosystem involves many tools, each with its own
configuration and management requirements.
• Performance Overhead
– Cons: Traditional MapReduce jobs can be slow due to their batch processing nature.
Although tools like Spark offer faster processing, some workloads may still experience
performance issues.
• Resource Management
– Cons: YARN handles resource allocation, but managing resources effectively across
many nodes can be challenging. Overhead and inefficiencies can occur if not managed
properly.
• Data Security
– Cons: Out-of-the-box security features may be limited. Implementing robust security
measures often requires additional configuration and tools to ensure data protection.
• Learning Curve
– Cons: Learning to use Hadoop and its ecosystem effectively can be time-consuming.
Understanding various components and how they interact requires significant training
and experience.
• Maintenance and Monitoring
– Cons: Requires ongoing maintenance and monitoring to ensure the cluster runs
smoothly. Managing a large number of nodes and handling failures can be labor-
intensive.
Hadoop Security
Hadoop security is crucial for protecting sensitive data and ensuring that
only authorized users can access and process it. Here’s a summary of key
aspects of Hadoop security:

• Authentication: This is about making sure the right people are accessing the
system. Hadoop uses something called Kerberos, a system that checks the
identity of users and services to confirm they are who they say they are. Think of
it like showing an ID card before entering a secure building.

• Authorization: Once someone is authenticated, they are given permission (or


denied) to access specific data or perform certain tasks. Hadoop uses a system to
check if a user has the right to read, write, or execute certain data or jobs.

• Encryption: This ensures that the data stored in Hadoop, and the data moving
across the network, is protected from being read by unauthorized people. It’s like
locking up sensitive information so that only people with the right key can see it.

• Audit Logs: Hadoop keeps track of who accessed the data and what they did. This
log is like a security camera that records everything that happens, so you can
review it if needed.
1. Authentication
• Kerberos: Hadoop uses Kerberos for strong authentication. Kerberos is a network authentication
protocol that ensures users and services are who they claim to be. It requires users to prove their
identity using tickets.

How Kerberos works (in simple terms):


• When you try to access Hadoop, you don't just get in right away.
• You first need to prove your identity. Kerberos gives you a ticket if you’re legit, kind of like a
special key.
• You can then use this ticket to access different parts of the Hadoop system, like data or services.
• The ticket expires after a certain time to ensure that you need to re-authenicate regularly, which
increases security.

2. Authorization
• HDFS Permissions: Hadoop Distributed File System (HDFS) uses a permissions model similar to
Unix, where files and directories have read, write, and execute permissions. Access is controlled
by file ownership and group memberships.
• Apache Ranger: A tool for managing and enforcing security policies across Hadoop components.
It provides centralized access control, auditing, and data masking.
Apache Ranger is a security tool for Hadoop that helps manage and enforce who can do what
with the data stored in Hadoop. It’s like a police officer in the Hadoop world, making sure that
only authorized people can access specific data or perform certain actions.
3. Data Encryption
• Data at Rest: Data stored in HDFS can be encrypted using technologies like Hadoop’s
Transparent Data Encryption (TDE) or third-party tools. This ensures that data is
protected while stored on disk.
• Data in Transit: Data transmitted between Hadoop nodes and between clients and the
cluster can be encrypted using SSL/TLS to prevent unauthorized access during
transmission.
4. Audit and Monitoring
• Apache Ranger Audit: Provides detailed auditing capabilities to track access to data and
monitor for unauthorized or suspicious activities.
• Apache Sentry: An older tool for policy-based authorization and auditing, often used
alongside or instead of Ranger.
5. Data Masking
• Apache Ranger Data Masking: Allows sensitive data to be masked or redacted in
queries, so users can access only the data they are authorized to see.
How does data masking work?
• Masking the data means that sensitive information gets replaced with fake or scrambled
characters.
• For example, if the data contains a credit card number like 1234-5678-9012-3456, after
masking, someone might only see XXXX-XXXX-XXXX-3456. The last part is still visible, but
the critical parts are hidden.
6. Secure Data Access
• NameNode and ResourceManager Security: Access to the NameNode (which
manages the metadata of HDFS) and ResourceManager (which allocates resources
and schedules jobs) is controlled through Kerberos authentication and access
controls.
• Service-Level Security: Individual Hadoop services (e.g., Hive, HBase) can be
secured through service-specific configurations and integrations with Kerberos.
7. Configuration Management
• Secure Configuration: Ensuring that Hadoop components are configured securely,
including setting appropriate permissions, disabling unused services, and applying
security patches.
8. User and Group Management
• Hadoop Groups: Users are organized into groups, and access to resources can be
controlled at the group level, simplifying permissions management.
Hadoop security involves several layers of protection, including authentication
(Kerberos), authorization (HDFS permissions, Apache Ranger), data encryption (both
at rest and in transit), auditing, and secure configuration. Implementing these security
measures helps protect sensitive data, control access, and ensure compliance with
security policies.

You might also like