Unit 3
Unit 3
Cluster
A Hadoop cluster is a powerful and scalable system used to manage and process large
amounts of data across multiple computers, known as nodes. This setup allows
organizations to handle "big data" tasks efficiently by distributing the data and
processing work across these nodes. Here’s a detailed breakdown of what a Hadoop
cluster is and how it works:
2. Pseudo-Distributed Mode
• What It Is: Simulates a multi-node setup on one
computer.
• Use: Ideal for development and testing. Hadoop
services (like NameNode, DataNode) run as
separate processes on the same machine,
communicating like they would in a real cluster.
3. Fully-Distributed Mode
• What It Is: Hadoop runs on multiple computers
(nodes).
• Use: Used in real-world production. Data and
processing are distributed across many machines,
providing high performance, scalability, and fault
tolerance.
Hadoop Ecosystem
The Hadoop ecosystem is a collection of tools and frameworks that work together to help
you store, process, analyze, and manage big data. Here’s a brief overview:
1. HDFS (Hadoop Distributed File System)
• Purpose: Stores large amounts of data across multiple machines, with redundancy to
ensure data safety.
3. MapReduce
• Purpose: A programming model for processing large datasets by breaking them into
smaller tasks that run in parallel.
4. Hive
• Purpose: A data warehouse system that allows you to query and manage large
datasets using SQL-like language.
5. HBase
• Purpose: A NoSQL database that provides real-time read/write access to big data
stored in HDFS.
6. Pig
• Purpose: A scripting platform for analyzing large datasets, offering a simpler way
to write data processing tasks than MapReduce.
7. Sqoop
• Purpose: A tool for transferring data between Hadoop and relational databases.
8. Flume
• Purpose: A tool for collecting and moving large amounts of log data into
Hadoop.
9. Oozie
• Purpose: A workflow scheduler for managing Hadoop jobs and coordinating
tasks.
10. Zookeeper
• Purpose: Manages and coordinates distributed applications in the Hadoop
ecosystem, ensuring they work together smoothly.
11. Spark
• Purpose: A fast, in-memory data processing engine that can be used as an
alternative to MapReduce, supporting a variety of workloads like batch
processing, streaming, and machine learning.
Pros And Cons of Hadoop Ecosystem
Pros
• Scalability
– Pros: Can handle increasing amounts of data by adding more nodes to the cluster. It scales out
horizontally, meaning you can grow your system by adding more machines rather than
upgrading existing ones.
• Cost-Effective
– Pros: Uses commodity hardware (affordable, standard servers), which reduces overall
infrastructure costs compared to expensive, specialized hardware.
• Fault Tolerance
– Pros: Data is replicated across multiple nodes, so if one node fails, data is still available from
other nodes. The system can continue operating smoothly even with hardware failures.
• Flexibility
– Pros: Can store and process various types of data, including structured, semi-structured, and
unstructured data. Supports different processing models and tools within its ecosystem.
• Large-Scale Data Processing
– Pros: Efficiently processes large datasets through parallel processing. Frameworks like
MapReduce and Spark allow for handling and analyzing massive amounts of data.
• Integration with Other Tools
– Pros: Easily integrates with a variety of data processing and management tools (e.g., Hive, Pig,
HBase) and supports data transfer tools (e.g., Sqoop, Flume).
Cons
• Complexity
– Cons: Setting up and managing a Hadoop cluster can be complex and requires
specialized knowledge. The ecosystem involves many tools, each with its own
configuration and management requirements.
• Performance Overhead
– Cons: Traditional MapReduce jobs can be slow due to their batch processing nature.
Although tools like Spark offer faster processing, some workloads may still experience
performance issues.
• Resource Management
– Cons: YARN handles resource allocation, but managing resources effectively across
many nodes can be challenging. Overhead and inefficiencies can occur if not managed
properly.
• Data Security
– Cons: Out-of-the-box security features may be limited. Implementing robust security
measures often requires additional configuration and tools to ensure data protection.
• Learning Curve
– Cons: Learning to use Hadoop and its ecosystem effectively can be time-consuming.
Understanding various components and how they interact requires significant training
and experience.
• Maintenance and Monitoring
– Cons: Requires ongoing maintenance and monitoring to ensure the cluster runs
smoothly. Managing a large number of nodes and handling failures can be labor-
intensive.
Hadoop Security
Hadoop security is crucial for protecting sensitive data and ensuring that
only authorized users can access and process it. Here’s a summary of key
aspects of Hadoop security:
• Authentication: This is about making sure the right people are accessing the
system. Hadoop uses something called Kerberos, a system that checks the
identity of users and services to confirm they are who they say they are. Think of
it like showing an ID card before entering a secure building.
• Encryption: This ensures that the data stored in Hadoop, and the data moving
across the network, is protected from being read by unauthorized people. It’s like
locking up sensitive information so that only people with the right key can see it.
• Audit Logs: Hadoop keeps track of who accessed the data and what they did. This
log is like a security camera that records everything that happens, so you can
review it if needed.
1. Authentication
• Kerberos: Hadoop uses Kerberos for strong authentication. Kerberos is a network authentication
protocol that ensures users and services are who they claim to be. It requires users to prove their
identity using tickets.
2. Authorization
• HDFS Permissions: Hadoop Distributed File System (HDFS) uses a permissions model similar to
Unix, where files and directories have read, write, and execute permissions. Access is controlled
by file ownership and group memberships.
• Apache Ranger: A tool for managing and enforcing security policies across Hadoop components.
It provides centralized access control, auditing, and data masking.
Apache Ranger is a security tool for Hadoop that helps manage and enforce who can do what
with the data stored in Hadoop. It’s like a police officer in the Hadoop world, making sure that
only authorized people can access specific data or perform certain actions.
3. Data Encryption
• Data at Rest: Data stored in HDFS can be encrypted using technologies like Hadoop’s
Transparent Data Encryption (TDE) or third-party tools. This ensures that data is
protected while stored on disk.
• Data in Transit: Data transmitted between Hadoop nodes and between clients and the
cluster can be encrypted using SSL/TLS to prevent unauthorized access during
transmission.
4. Audit and Monitoring
• Apache Ranger Audit: Provides detailed auditing capabilities to track access to data and
monitor for unauthorized or suspicious activities.
• Apache Sentry: An older tool for policy-based authorization and auditing, often used
alongside or instead of Ranger.
5. Data Masking
• Apache Ranger Data Masking: Allows sensitive data to be masked or redacted in
queries, so users can access only the data they are authorized to see.
How does data masking work?
• Masking the data means that sensitive information gets replaced with fake or scrambled
characters.
• For example, if the data contains a credit card number like 1234-5678-9012-3456, after
masking, someone might only see XXXX-XXXX-XXXX-3456. The last part is still visible, but
the critical parts are hidden.
6. Secure Data Access
• NameNode and ResourceManager Security: Access to the NameNode (which
manages the metadata of HDFS) and ResourceManager (which allocates resources
and schedules jobs) is controlled through Kerberos authentication and access
controls.
• Service-Level Security: Individual Hadoop services (e.g., Hive, HBase) can be
secured through service-specific configurations and integrations with Kerberos.
7. Configuration Management
• Secure Configuration: Ensuring that Hadoop components are configured securely,
including setting appropriate permissions, disabling unused services, and applying
security patches.
8. User and Group Management
• Hadoop Groups: Users are organized into groups, and access to resources can be
controlled at the group level, simplifying permissions management.
Hadoop security involves several layers of protection, including authentication
(Kerberos), authorization (HDFS permissions, Apache Ranger), data encryption (both
at rest and in transit), auditing, and secure configuration. Implementing these security
measures helps protect sensitive data, control access, and ensure compliance with
security policies.