0% found this document useful (0 votes)

26 views

BDA_HADOOP_UNIT-2

The document provides an introduction to Hadoop, an open-source framework for storing and processing large datasets. It covers key characteristics of Hadoop, its history, differences from RDBMS, and its ecosystem components including HDFS, YARN, and various projects like Hive and Pig. Additionally, it discusses Hadoop distributions and deployment architectures.

Uploaded by

solomon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

BDA_HADOOP_UNIT-2

Uploaded by

solomon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Welcome

Chapter -2
Introduction to HADOOP
_________________________________________
@ 2024 FTVT Institute All Rights Reserved

PPT 1
Topics
§ What is Hadoop
§ A Brief History of Hadoop
§ Difference between RDBMS and Hadoop
§ Hadoop cluster
§ Hadoop EcoSystem projects
§ Hadoop distributions
§ Hadoop deployment architecture

ICT Department TVTI 21/11/24

What is Hadoop

ICT Department TVTI 21/11/24

What is Hadoop

ICT Department TVTI 21/11/24

Hadoop Key Characteristics
Hadoop is an open-source framework for storing and processing large datasets across clusters of computers. Here
are its key characteristics:
1. Distributed Storage
- Hadoop stores data across multiple nodes in a cluster, enabling the handling of very large datasets.
- Uses Hadoop Distributed File System (HDFS), which splits files into blocks and distributes them across the
cluster.
2. Scalability
- Easily scales horizontally by adding more nodes to the cluster.
- Supports thousands of nodes and petabytes of data.
3. Fault Tolerance
- Automatically replicates data blocks across multiple nodes.
- Ensures high availability and reliability, even if a node fails.
- Nodes can recover quickly without impacting the overall system.

ICT Department TVTI 21/11/24

Hadoop Key Characteristics
4. High Throughput
- Designed for high-speed data processing across distributed nodes.
- Ensures efficient use of resources by distributing computation to the data's location (data locality).
5. Cost-Effectiveness
- Built on commodity hardware, reducing costs compared to specialized systems.
- Open-source nature avoids expensive software licensing fees.
6. Batch Processing
- Designed primarily for batch-oriented processing of large-scale data.
- Processes data in parallel across nodes using frameworks like MapReduce.
7. Data Variety
- Handles structured, semi-structured, and unstructured data.
- Works with diverse data formats such as text, images, videos, and logs.
8. Open-Source Framework
- Maintained by the Apache Software Foundation, with community contributions and updates.
- Compatible with a broad ecosystem of tools like Hive, Pig, Spark, and more.
.
ICT Department TVTI 21/11/24
Hadoop Key Characteristics
9. Data Locality
- Processes data where it resides to minimize network traffic and improve speed.
- Reduces the need to move data across the network.
10. Security
- Provides mechanisms like Kerberos for authentication.
- Supports data encryption for secure storage and transfer.
- Integrates with external security tools for enhanced protection.
11. Flexibility
- Adaptable to a wide range of use cases, including:
- Data warehousing
- Machine learning
- Log analysis
- ETL (Extract, Transform, Load) processes
12. Ecosystem Integration
- Integrates seamlessly with other big data tools and technologies:
- Hive: SQL-like querying
- Pig: Data transformation
- HBase: NoSQL database
- Spark: Real-time processing
ICT Department TVTI 21/11/24
Hadoop Key Characteristics
13. Write Once, Read Many (WORM) Model
- HDFS follows a write-once-read-many model, making it suitable for archival and analysis workloads.
14. Parallel Processing
- Leverages the power of parallelism by executing multiple tasks across nodes simultaneously.
15. Extensibility
- New features and components can be easily integrated into the Hadoop ecosystem due to its modular design.

By combining these characteristics, Hadoop has become a foundational technology for big data storage and
analysis.

ICT Department TVTI 21/11/24

A Brief History of Hadoop
• Hadoop was created by Doug Cutting, the creator of Apache Lucene, the
widely used text search library.
• Hadoop has its origins in Apache Nutch, an open source web search engine,
itself a part of the Lucene project.
• 2004-Initial versions of what is now Hadoop Distributed Filesystem and
MapReduce implemented by Doug Cutting and Mike Cafarella.
• Hadoop 1.x: Simple and effective for batch processing with MapReduce but
lacks scalability and flexibility.
• Hadoop 2.x: Introduced YARN, supporting multiple data processing
frameworks and improving scalability and fault tolerance.
• Hadoop 3.x: Further enhancements with erasure coding, resource efficiency,
and modern features like Docker support.

ICT Department TVTI 21/11/24

RDBMS vs HADOOP DIFFERNCES
Feature RDBMS Hadoop
Stores data in structured tabular format with predefined Can handle structured, semi-structured, and unstructured data
Data Structure
schema. (e.g., text, video, logs).
Data Volume Designed for smaller datasets (GB to TB). Designed for large-scale datasets (TB to PB and beyond).
Scalability Vertical scaling (adding resources to a single server). Horizontal scaling (adding more nodes to the cluster).
Processes real-time, transactional data with ACID Processes batch data using distributed computing (e.g.,
Processing Model
compliance. MapReduce).
Limited fault tolerance; relies on backups for data Built-in fault tolerance through data replication across nodes
Fault Tolerance
recovery. (e.g., in HDFS).
Uses a fixed schema; schema must be defined before Schema-on-read; data can be loaded without predefined
Schema
loading data. schema and interpreted later.
Often involves high licensing and hardware costs (e.g., Open-source and cost-effective, designed to run on
Cost
Oracle, SQL Server). commodity hardware.
Uses SQL (Structured Query Language) for querying Supports tools like Hive (SQL-like), Pig (data flow), or direct
Data Query Language
data. programming (e.g., Java, Python).
Optimized for quick read/write operations on small Optimized for batch processing rather than real-time data
Data Access Speed
datasets. access.
Hardware
Requires high-end, specialized hardware. Runs on low-cost, commodity hardware.
Requirements

ICT Department TVTI 21/11/24

RDBMS vs HADOOP DIFFERNCES
Feature RDBMS Hadoop
Ensures high integrity with ACID properties Weak in ensuring ACID properties; relies on
Data Integrity
(Atomicity, Consistency, Isolation, Durability). external tools for transactional processing.
- Large-scale data analysis (e.g., clickstream, social
Use Cases - Transactional systems (e.g., banking, retail)
media logs)
- Applications requiring frequent updates. - Batch processing and distributed storage.
Strong parallelism; processes data across multiple
Parallel Processing Limited parallelism within the system.
nodes in a cluster.
Struggles with scaling efficiently for exponential data Easily handles rapid data growth with distributed
Data Growth Management
growth. storage and processing.
Relies on external tools like Kerberos for enhanced
Security Strong built-in security mechanisms.
security.
Ideal for real-time processing of transactions and Not designed for real-time processing; better suited
Real-Time Processing
queries. for batch-oriented tasks.
Easier for developers familiar with SQL and Requires knowledge of distributed systems,
Learning Curve
relational databases. MapReduce, and Hadoop ecosystem tools.
Apache Hadoop (HDFS, MapReduce), Hive, Pig,
Example Technologies Oracle, MySQL, SQL Server, PostgreSQL, IBM DB2
Spark

ICT Department TVTI 21/11/24

Hadoop Cluster

ICT Department TVTI 21/11/24

Hadoop 2.X Cluster Architecture

ICT Department TVTI 21/11/24

Hadoop cluster modes

ICT Department TVTI 21/11/24

HADOOP Eco System/ Architecture

21/11/24
ICT Department TVTI
HADOOP Eco System/ Architecture
§ HDFS, Hadoop Distributed File System, covered in terms of files.

§ On top of HDFS, the second core part of a Hadoop implementation is Map Reduce, data processing
framework

§ YARN stands for Yet Another Resource Navigator, Map Reduce two is more commonly used, but it
does build on top of Map Reduce one, so it's a good way to learn the processing framework.

§ HBase is very commonly used to be able to query out of a column store abstraction over the top of
the file system

§ Hive is HQL, or the sequel, like query Language, that is used to query Hbase.

§ Pig is a scripting language that's used for ETL-like processes or extracting, transforming, and loading.

ICT Department TVTI 21/11/24

HADOOP Eco System/ Architecture
§ Mahout library is for machine learning or predictive analytics.

§ Oozie is for workflow or coordination of jobs, and that works in combination with Zookeeper.

§ Zookeeper is the coordination of groups of jobs.

§ Sqoop is for data exchange in between other systems, particularly relational systems, like SQL Server and
Hadoop.

§ Flume is a log collector because Hadoop jobs produce a large amount of log information about job
process because the jobs are run in batches.

§ Ambari is provisioning, managing, and monitoring Hadoop clusters

ICT Department TVTI 21/11/24

HADOOP projects

ICT Department TVTI 21/11/24

APACHE FLUME

§ Flume is a log collector because Hadoop jobs/Applications produce a large amount of log
information about job process because the jobs are run in batches.

ICT Department TVTI 21/11/24

APACHE FLUME

§ Flume is a log collector because applications produce a large amount of

log information.
ICT Department TVTI 21/11/24
SQOOP[SQL+HADOOP]

§ Sqoop is for data exchange in between other systems, particularly relational systems, like
SQL Server and Hadoop.
ICT Department TVTI 21/11/24
APACHE PIG

ICT Department TVTI 21/11/24

APACHE PIG

ICT Department TVTI 21/11/24

APACHE PIG

ICT Department TVTI 21/11/24

APACHE HIVE

ICT Department TVTI 21/11/24

APACHE HIVE

ICT Department TVTI 21/11/24

APACHE HIVE

ICT Department TVTI 21/11/24

APACHE HBASE

ICT Department TVTI 21/11/24

HBASE AND RDBMS DIFFERENCES
Feature HBase RDBMS
Schema-less (NoSQL): Data is stored as key- Schema-based (SQL): Data is stored in structured tables
Data Model
value pairs in column families. with predefined schemas.
Does not use SQL; requires API-based Uses SQL (Structured Query Language) for querying
Query Language
querying (Java, Python, etc.). data.
Designed for sparse, wide tables with millions Designed for structured, normalized tables with relations
Structure
of rows and columns. between them.
Horizontally scalable: Add more nodes to the Vertically scalable: Limited by the capacity of a single
Scalability
cluster to handle growth. server.
Data Volume Handles large-scale datasets (petabytes). Handles smaller datasets (gigabytes to terabytes).
High fault tolerance via HDFS replication and Fault tolerance depends on backups and replication,
Fault Tolerance
distributed architecture. often manual.
Limited ACID compliance (supports atomic Fully ACID-compliant (Atomicity, Consistency,
Transactions
operations at row level). Isolation, Durability).
Optimized for random, real-time read/write Optimized for complex transactional operations and
Performance
access to large datasets. joins.
Data Processing Eventual consistency (writes may not appear Strict consistency (ensures all writes are immediately
Model immediately on all nodes). visible).

ICT Department TVTI 21/11/24

HBASE AND RDBMS DIFFERENCES
Feature HBase RDBMS
Suitable for semi-structured and
Data Type Suitable for structured data.
unstructured data.
Does not use traditional indexing; instead,
Indexing Uses traditional indexing (e.g., primary, secondary keys).
relies on row keys for quick access.
Primary Use Real-time access for big data applications like
Transactional systems like banking, ERP, and CRM.
Case IoT, log data, and analytics.
Storage System Built on HDFS, offering distributed storage. Typically uses local file systems or SAN/NAS storage.
Joins and Does not support complex joins or foreign
Supports joins, foreign keys, and relational constraints.
Relations keys.
Write Model Designed for high write throughput. Designed for write consistency and low write latency.
Low-latency for reads/writes of small chunks Low-latency for transactional queries, higher latency for
Latency
of data. large datasets.
Integrates with Hadoop ecosystem tools (Hive, Integrates with BI tools, reporting software, and legacy
Tool Integration
Spark, Pig). applications.
Can involve licensing costs (e.g., Oracle, Microsoft SQL
Cost Open-source; runs on commodity hardware.
Server).
Examples Apache HBase, Google Bigtable, Cassandra. MySQL, PostgreSQL, Oracle, SQL Server.

ICT Department TVTI 21/11/24

NOSQL data bases

ICT Department TVTI 21/11/24

APACHE SPARK

ICT Department TVTI 21/11/24

SPARK Ecosystem

ICT Department TVTI 21/11/24

ZOO KEEPER

ICT Department TVTI 21/11/24

OOZIE

ICT Department TVTI 21/11/24

APACHE DRILL

ICT Department TVTI 21/11/24

APACHE KAFKA

ICT Department TVTI 21/11/24

APACHE MAHOUT

ICT Department TVTI 21/11/24

SPARK MLib

ICT Department TVTI 21/11/24

Apache ambari

ICT Department TVTI 21/11/24

Hadoop Ecosystem Summary
Category Components
Core HDFS, YARN, MapReduce
Data Storage HBase, Hive, Kudu, HCatalog
Processing Spark, Pig, Tez
Data Ingestion Sqoop, Flume
Streaming Kafka, Storm
Workflow Oozie, Zookeeper
Monitoring Ambari, Ganglia
Security Ranger, Atlas
Machine Learning Mahout, TensorFlow
Data Formats Parquet, ORC, Avro

ICT Department TVTI 21/11/24

HADOOP DISTRIBUTIONS

ICT Department TVTI 21/11/24

HADOOP DISTRIBUTIONS

ICT Department TVTI 21/11/24

HADOOP DISTRIBUTIONS
Apache Amazon Azure Google
Feature Cloudera MapR
Hadoop EMR HDInsight Dataproc
On- On- Fully Fully Fully
Deployment Hybrid
premises prem/cloud managed managed managed

Cost Free Paid Paid Pay-as-you-go Pay-as-you-go Pay-as-you-go

Ease of Use Moderate Advanced Advanced Easy Easy Easy

Open- Enterprise- High Cloud Azure Google

Key Strength
source ready performance integration integration integration

ICT Department TVTI 21/11/24

HADOOP 1.0 Deployment Architecture

ICT Department TVTI 21/11/24

NEED for YARN

ICT Department TVTI 21/11/24

HADOOP 2.0 YARN

ICT Department TVTI 21/11/24

HDFS Vs YARN

ICT Department TVTI 21/11/24

Hadoop 2.X- High Availability

ICT Department TVTI 21/11/24

Hadoop 2.X- Resource Management

ICT Department TVTI 21/11/24

Hadoop 2.X- Resource Management(Contd)

ICT Department TVTI 21/11/24

HADOOP 2.0 Deployment Architecture

ICT Department TVTI 21/11/24

HADOOP 2.0 Deployment Components