Unit-2-_Hadoop2_
Unit-2-_Hadoop2_
Hadoop
Hadoop is an open-source framework that allows for the distributed storage and
processing of large data sets across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage. Rather than relying on
expensive, high-end hardware, Hadoop is designed to scale out using clusters of
commodity hardware, making it a very cost-effective solution for big data
problems.
Imagine you have a massive library filled with countless books, each representing
a piece of data. Instead of relying on a single librarian to manage this vast
collection, Hadoop acts like a team of librarians, each responsible for a specific
section of the library. This team works together to organize, store, and retrieve
books (data) efficiently, making it easier to find the information you need.
Hadoop
distributes the data across multiple computers, allowing for parallel
processing. Think of it as having multiple librarians working simultaneously on
different sections of the library, processing requests much faster than a single
librarian could.
Uses of Hadoop:
Log analysis: Hadoop can be used to analyze large volumes of log data to identify
patterns and trends.
Web analytics: Hadoop can be used to analyze web traffic data to understand
user behavior and improve website performance.
Advantages of Hadoop :
Scalability: Hadoop is a highly scalable platform that can be easily expanded to
handle large amounts of data. This is because Hadoop is designed to distribute
data and processing across multiple nodes, or computers, in a cluster. As the
amount of data grows, more nodes can be added to the cluster to handle the
additional processing load.
Cost-effectiveness: Hadoop is a cost-effective solution for big data problems
because it can run on clusters of commodity hardware. Commodity hardware is
simply the type of hardware that is widely available and relatively inexpensive.
This means that you can build a Hadoop cluster using off-the-shelf hardware,
rather than having to purchase expensive, high-end hardware.
Fault tolerance: Hadoop is a highly fault-tolerant platform that can handle the
failure of any node in the cluster without losing data. This is because Hadoop
replicates data across multiple nodes. If a node fails, the data on that node can
be recovered from another node in the cluster. This ensures that your data is
always available, even if there are hardware failures.
Flexibility: Hadoop can be used to process a wide variety of data types, including
structured, semi-structured, and unstructured data. This makes Hadoop a very
versatile platform that can be used for a wide range of applications.
Large community: Hadoop has a large and active community of users and
developers. This means that there is a wealth of resources available online, such
as documentation, tutorials, and forums. This makes it easy to get help when you
need it.
Maturity: Hadoop is a mature platform that has been around for many years. This
means that it is a stable and reliable platform that you can trust to handle your
data processing needs.
RDMS V/S Hadoop :
Hadoop Architecture :
Hadoop works on MapReduce Programming Algorithm that was introduced by
Google. Today lots of Big Brand Companies are using Hadoop in their Organization
to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc.
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
1. MapReduce:
MapReduce nothing but just like an Algorithm or a data structure.The major
feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster which Makes Hadoop working so fast. When you are dealing with
Big Data, serial processing is no more of any use. MapReduce has mainly 2 tasks
which are divided phase-wise:
2. HDFS:
It is utilized for storage permission. It is mainly designed for working on
commodity Hardware devices(inexpensive devices), working on a distributed file
system design. HDFS is designed in such a way that it believes more in storing
the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage
layer and the other devices present in that Hadoop cluster.
Data storage Nodes in HDFS.
NameNode(Master)
DataNode(Slave)
NameNode: NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the
user’s activity in a Hadoop cluster.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or
even more than that. The more number of DataNode, the Hadoop cluster will be
able to store more data. So it is advised that the DataNode should have High
storing capacity to store a large number of file blocks.
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single
block of data is divided into multiple blocks of size 128MB which is default and
you can also change it manually.
Replication In HDFS : Replication ensures the availability of the data. Replication
is making a copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor. As we have seen in
File blocks that the HDFS stores the data in the form of various blocks at the same
time Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured
means you can change it manually as per your requirement like in above example
we have made 4 file blocks which means that 3 Replica or copy of each file block
is made means total of 4×3 = 12 blocks are made for the backup purpose.
Rack Awareness : The rack is nothing but just the physical collection of nodes in
our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so
many Racks . with the help of this Racks information Namenode chooses the
closest Datanode to achieve the maximum performance.
3. YARN(Yet Another Resource Negotiator):
YARN is a Framework on which MapReduce works. YARN performs 2 operations
that are Job scheduling and Resource Management. The Purpose of Job
schedular is to divide a big task into small jobs so that each job can be assigned
to various slaves in a Hadoop cluster and Processing can be Maximized. Job
Scheduler also keeps track of which job is important, which job has more priority,
dependencies between the jobs and all the other information like job timing, etc.
And
the use of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster.
Features of YARN
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
4. Hadoop common or Common Utilities :
Hadoop common or Common utilities are nothing but our java library and java
files or we can say the java scripts that we need for all the other components
present in a Hadoop cluster. Hadoop Common verify that Hardware failure in a
Hadoop cluster is common so it needs to be solved automatically in software by
Hadoop Framework.
Goals of HDFS :
Store and manage large datasets: HDFS is designed to store and manage
large datasets, typically ranging from gigabytes to petabytes in size.
Provide high-throughput data access: HDFS is designed to provide
high-throughput data access, which means that it can transfer large
amounts of data quickly.
Be highly available: HDFS is designed to be highly available, meaning that it
is always accessible to users.
Be scalable: HDFS is designed to be scalable, meaning that it can be easily
expanded to accommodate growing data volumes.
Be cost-effective: HDFS is designed to be cost-effective, meaning that it can
be deployed and maintained on commodity hardware.
Fault tolerance: HDFS is highly fault-tolerant, meaning that it can continue
to operate even if some of its nodes fail. This is achieved by replicating data
across multiple nodes in the cluster.
Data streaming: HDFS is designed for streaming data access, which means
that data can be read and written to HDFS without having to load the entire
file into memory at once. This makes it well-suited for processing large
datasets.
Commodity hardware: HDFS is designed to run on commodity hardware,
which makes it inexpensive to deploy and maintain.
--- copyFromLocal (or) put: To copy files/folders from local file system to hdfs
store. This is the most important command. Local filesystem means the files
present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
--- copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
--- rmr: This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
--- dus:: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Librarian: The librarian is like the Java API. They have a deep understanding of the
library's organization and can perform various tasks, such as finding specific
books, checking them out, and adding new books.
Shelf Map: The shelf map is like the native C API. It provides a detailed layout of
the library, allowing you to locate specific books directly.
Website: The library's website is like WebHDFS. It allows you to access and
manage your books remotely using a web browser.
Card Catalog: The card catalog is like FS Shell. It's a simple way to find books by
title or author.
Special Requests Desk: The special requests desk is like third-party APIs. They can
provide additional services or access to rare books that may not be available
through regular channels.
Hadoop 1
Master-slave architecture: A single master node controls all the slave nodes in
the cluster. This can create a bottleneck if the master node becomes overloaded.
MapReduce: Hadoop 1 uses MapReduce for both resource management and data
processing. This can make it difficult to scale and manage complex jobs.
Limited scalability: Hadoop 1 can only scale up to a few thousand nodes.
Hadoop 2
Resource Manager (RM) and Node Manager (NM): These two services manage
the resources in the cluster, freeing up the master node to focus on job
scheduling.
Yet Another Resource Negotiator (YARN): YARN allows multiple processing
frameworks, such as Spark and MapReduce, to run on the same cluster.
In simpler terms, Hadoop 1 is like a small town with a single mayor, while
Hadoop 2 is like a large city with a mayor and a city council.
Here is a table that summarizes the key differences between Hadoop 1 and 2:
Hadoop Ecosystem :
Together, these librarians and assistants form a powerful team that can manage
and analyze even the largest libraries. The Hadoop ecosystem is like this team,
providing the tools and framework needed to store, process, and understand big
data.
Data Streaming -
Data streaming in Hadoop refers to the process of continuously ingesting and
processing data as it is generated, rather than batch processing it after it has
been stored. This allows for near real-time analysis and decision-making.
Hadoop Streaming is a tool that enables you to write MapReduce programs in any
programming language that can read from standard input (STDIN) and write to
standard output (STDOUT). This makes it possible to use Hadoop to process data
from a variety of sources, including databases, web services, and sensors.
Micro-batching: This approach breaks down a data stream into small batches of
data, which are then processed using a traditional batch processing framework,
such as Hadoop MapReduce. This approach is less scalable than stream
processing, but it can be more efficient for certain types of workloads.
Data streaming is a powerful tool for analyzing real-time data. It can be used for a
variety of applications, such as fraud detection, anomaly detection, and real-time
recommendations.
Here are some of the benefits of data streaming in Hadoop:
Data arrives from various sources, such as sensors, web services, or social
media feeds.
Data is divided into smaller streams.
Each stream is processed by a separate "mapper" program.
The mapper program breaks down the data into key-value pairs.
The key-value pairs are shuffled and distributed to "reducer" programs.
The reducer programs aggregate the key-value pairs and produce final
results.
The final results are stored and can be analyzed or used for further
processing.
Real life ex -
Imagine you're making a sandwich. You start by gathering all the ingredients –
bread, cheese, lettuce, tomato, and mayonnaise. This is like data ingestion, where
you collect data from various sources.
Next, you assemble the sandwich by placing the ingredients on the bread in the
desired order. This is like data transformation, where you prepare the data for
processing.
Then, you take a bite of the sandwich and enjoy the combination of flavors. This is
like data analysis, where you extract insights from the processed data.
Finally, you discard the wrapper and any leftover ingredients. This is like data
storage, where you save the analyzed data for future reference or disposal.
Data flow in data streaming is similar to making a sandwich. It involves a series of
steps that transform raw data into useful insights, just like turning simple
ingredients into a delicious meal.
The general data flow in data streaming in Hadoop involves the following steps:
Data Ingestion: Data is collected from various sources, such as sensors, web
services, or social media feeds.
Data Processing: The ingested data is transformed, cleaned, and enriched to
prepare it for analysis.
Data Analysis: Real-time insights are extracted from the processed data using
stream processing techniques.
Data Storage: The analyzed data is stored in a database or data lake for future
reference or further processing.
Data Models in Data Streaming
Data models define the structure and organization of data within a stream
processing system. They provide a framework for representing and manipulating
data in a way that is consistent and efficient.
Real life Ex -
Tuple-based model:
Imagine a list of grocery items, where each item is represented by a tuple of
(name, quantity, price). For example, ("apple", 2, 1.29). This model is well-suited
for simple data structures, where each data element has a clear name and value.
Event-based model:
Think of a social media feed, where each post is represented as an event with a
timestamp, sender, and content. For instance, { "timestamp":
"2023-11-13T16:42:21Z", "sender": "Bard", "content": "Explaining data models in
simpler terms." }. This model is flexible for complex data with timestamps and
multiple attributes.
Stream-based model:
Consider a temperature sensor continuously sending readings. The data is
represented as a stream of temperature values, each accompanied by a
timestamp. For example, [23.4, 23.5, 23.6, 23.7]. This model is efficient for
handling large volumes of continuously generated data.
Apache Flume
Apache Flume is a distributed, reliable, and highly available service for efficiently
collecting, aggregating, and moving large amounts of streaming event data. It is a
flexible, scalable, and fault-tolerant platform that can handle a wide variety of
data sources and sinks.
Features of Apache Flume
Distributed and Scalable: Flume's architecture is designed for distributed
operation, allowing it to handle large volumes of data with ease. It can be easily
scaled up or down by adding or removing nodes.
Reliable and Fault-tolerant: Flume uses a variety of techniques to ensure data
reliability, including replication, checkpoints, and failover mechanisms. This
ensures that data is not lost even in the event of node failures.
Flexible and Extensible: Flume supports a wide variety of data sources and sinks,
including Kafka, HDFS, and Amazon S3. It also has a rich ecosystem of plugins that
extend its functionality.
Customizability: Flume's components are pluggable, allowing you to customize
data processing and routing.
Ease of Use: Flume is relatively easy to configure and manage, making it
accessible to a broad range of users
Architecture of Apache Flume:
Agents: Flume agents are the building blocks of the system. Each agent
represents a data flow and consists of three main components:
Source: The source ingests data from external sources like Twitter feeds,
log files, or sensors.
Channel: The channel temporarily buffers the ingested data before it is
consumed by sinks.
Sink: The sink delivers the data to a destination system like HDFS, Kafka, or
Elasticsearch.
Channels: Channels act as message queues, providing temporary storage for data
in transit between sources and sinks. They ensure that data is not lost if a sink is
unavailable or overloaded.
Collectors: Collectors are optional components that aggregate data from multiple
agents and send it to a central sink. They are useful when multiple agents need to
deliver data to a single destination.
Installation OF Hadoop
1.Download Hadoop: Download the latest stable release of Hadoop from the
Apache Hadoop website.
2.Extract Hadoop: Extract the downloaded Hadoop archive to a directory of your
choice. For example, you can extract it to /opt/hadoop.
3.Configure Hadoop: Edit the /opt/hadoop/conf/core-site.xml file and set the
fs.defaultFS property to file:///. This tells Hadoop to use the local file system for
storage.
Set Hadoop environment variables: Edit your ~/.bashrc or ~/.zshrc file and add the
following lines to set the Hadoop environment variables:
Bash
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
4.Verify the installation: Start the Hadoop daemons using the following
commands:
Bash
$ $HADOOP_HOME/sbin/start-all.sh
$ $HADOOP_HOME/sbin/jps