0% found this document useful (0 votes)

5 views

Unit-2-_Hadoop2_

Hadoop is an open-source framework designed for distributed storage and processing of large data sets across clusters of commodity hardware, making it a cost-effective solution for big data challenges. Its core components include HDFS for data storage, MapReduce for parallel processing, and YARN for resource management, which together enable efficient data handling and fault tolerance. Hadoop is widely used in various applications such as log analysis, web analytics, fraud detection, and scientific research, supported by a large community and multiple programming language compatibility.

Uploaded by

thaakuranujtomar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Unit-2-_Hadoop2_

Uploaded by

thaakuranujtomar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Unit - 2

Hadoop
Hadoop is an open-source framework that allows for the distributed storage and
processing of large data sets across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage. Rather than relying on
expensive, high-end hardware, Hadoop is designed to scale out using clusters of
commodity hardware, making it a very cost-effective solution for big data
problems.

Real life eg. -

Imagine you have a massive library filled with countless books, each representing
a piece of data. Instead of relying on a single librarian to manage this vast
collection, Hadoop acts like a team of librarians, each responsible for a specific
section of the library. This team works together to organize, store, and retrieve
books (data) efficiently, making it easier to find the information you need.

Hadoop
distributes the data across multiple computers, allowing for parallel
processing. Think of it as having multiple librarians working simultaneously on
different sections of the library, processing requests much faster than a single
librarian could.

Core Components of Hadoop:

Hadoop is a collection of several open-source software modules that work

together to provide a distributed storage and processing platform. The core
components of Hadoop are:
Hadoop Distributed File System (HDFS): HDFS is a distributed file system that
stores data across multiple nodes in a cluster. It is designed to be highly
fault-tolerant and can handle the failure of any node without losing data.

MapReduce: MapReduce is a programming model for processing large data sets

in parallel. It breaks down a computation into smaller tasks that can be run on
multiple nodes simultaneously.

YARN: YARN is a resource management platform that schedules and manages

MapReduce jobs. It also provides a framework for running other distributed
applications.

Uses of Hadoop:

Log analysis: Hadoop can be used to analyze large volumes of log data to identify
patterns and trends.

Web analytics: Hadoop can be used to analyze web traffic data to understand
user behavior and improve website performance.

Fraud detection: Hadoop can be used to analyze financial transactions to identify

fraudulent activity.

Scientific research: Hadoop can be used to process large volumes of scientific

data to make new discoveries.

Advantages of Hadoop :
Scalability: Hadoop is a highly scalable platform that can be easily expanded to
handle large amounts of data. This is because Hadoop is designed to distribute
data and processing across multiple nodes, or computers, in a cluster. As the
amount of data grows, more nodes can be added to the cluster to handle the
additional processing load.
Cost-effectiveness: Hadoop is a cost-effective solution for big data problems
because it can run on clusters of commodity hardware. Commodity hardware is
simply the type of hardware that is widely available and relatively inexpensive.
This means that you can build a Hadoop cluster using off-the-shelf hardware,
rather than having to purchase expensive, high-end hardware.

Fault tolerance: Hadoop is a highly fault-tolerant platform that can handle the
failure of any node in the cluster without losing data. This is because Hadoop
replicates data across multiple nodes. If a node fails, the data on that node can
be recovered from another node in the cluster. This ensures that your data is
always available, even if there are hardware failures.

Flexibility: Hadoop can be used to process a wide variety of data types, including
structured, semi-structured, and unstructured data. This makes Hadoop a very
versatile platform that can be used for a wide range of applications.

Open-source: Hadoop is an open-source software foundation. This means that

the source code for Hadoop is freely available for anyone to use, modify, and
distribute. This makes Hadoop a very cost-effective solution, as you do not have
to pay for licensing fees.

Large community: Hadoop has a large and active community of users and
developers. This means that there is a wealth of resources available online, such
as documentation, tutorials, and forums. This makes it easy to get help when you
need it.

Support for multiple programming languages: Hadoop supports multiple

programming languages, including Java, Python, and Scala. This makes it easy to
develop applications using the programming language that you are most
comfortable with.

Maturity: Hadoop is a mature platform that has been around for many years. This
means that it is a stable and reliable platform that you can trust to handle your
data processing needs.
RDMS V/S Hadoop :

Hadoop Architecture :
Hadoop works on MapReduce Programming Algorithm that was introduced by
Google. Today lots of Big Brand Companies are using Hadoop in their Organization
to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc.

The Hadoop Architecture Mainly consists of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

1. MapReduce:
MapReduce nothing but just like an Algorithm or a data structure.The major
feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster which Makes Hadoop working so fast. When you are dealing with
Big Data, serial processing is no more of any use. MapReduce has mainly 2 tasks
which are divided phase-wise:

In first phase, Map is utilized and in next phase Reduce is utilized.

Here, we can see that the Input is provided to the Map() function then it’s
output is used as an input to the Reduce function and after that, we receive our
final output.

2. HDFS:
It is utilized for storage permission. It is mainly designed for working on
commodity Hardware devices(inexpensive devices), working on a distributed file
system design. HDFS is designed in such a way that it believes more in storing
the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage
layer and the other devices present in that Hadoop cluster.
Data storage Nodes in HDFS.
 NameNode(Master)
 DataNode(Slave)
NameNode: NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the
user’s activity in a Hadoop cluster.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or
even more than that. The more number of DataNode, the Hadoop cluster will be
able to store more data. So it is advised that the DataNode should have High
storing capacity to store a large number of file blocks.

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single
block of data is divided into multiple blocks of size 128MB which is default and
you can also change it manually.
Replication In HDFS : Replication ensures the availability of the data. Replication
is making a copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor. As we have seen in
File blocks that the HDFS stores the data in the form of various blocks at the same
time Hadoop is also configured to make a copy of those file blocks.

By default, the Replication Factor for Hadoop is set to 3 which can be configured
means you can change it manually as per your requirement like in above example
we have made 4 file blocks which means that 3 Replica or copy of each file block
is made means total of 4×3 = 12 blocks are made for the backup purpose.

This is because for running Hadoop we are using commodity hardware

(inexpensive system hardware) which can be crashed at any time. We are not
using the supercomputer for our Hadoop setup. That is why we need such a
feature in HDFS which can make copies of that file blocks for backup purposes,
this is known as fault tolerance.

Rack Awareness : The rack is nothing but just the physical collection of nodes in
our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so
many Racks . with the help of this Racks information Namenode chooses the
closest Datanode to achieve the maximum performance.
3. YARN(Yet Another Resource Negotiator):
YARN is a Framework on which MapReduce works. YARN performs 2 operations
that are Job scheduling and Resource Management. The Purpose of Job
schedular is to divide a big task into small jobs so that each job can be assigned
to various slaves in a Hadoop cluster and Processing can be Maximized. Job
Scheduler also keeps track of which job is important, which job has more priority,
dependencies between the jobs and all the other information like job timing, etc.
And
the use of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster.
Features of YARN
 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility
4. Hadoop common or Common Utilities :
Hadoop common or Common utilities are nothing but our java library and java
files or we can say the java scripts that we need for all the other components
present in a Hadoop cluster. Hadoop Common verify that Hardware failure in a
Hadoop cluster is common so it needs to be solved automatically in software by
Hadoop Framework.

Goals of HDFS :
 Store and manage large datasets: HDFS is designed to store and manage
large datasets, typically ranging from gigabytes to petabytes in size.
 Provide high-throughput data access: HDFS is designed to provide
high-throughput data access, which means that it can transfer large
amounts of data quickly.
 Be highly available: HDFS is designed to be highly available, meaning that it
is always accessible to users.
 Be scalable: HDFS is designed to be scalable, meaning that it can be easily
expanded to accommodate growing data volumes.
 Be cost-effective: HDFS is designed to be cost-effective, meaning that it can
be deployed and maintained on commodity hardware.
 Fault tolerance: HDFS is highly fault-tolerant, meaning that it can continue
to operate even if some of its nodes fail. This is achieved by replicating data
across multiple nodes in the cluster.
 Data streaming: HDFS is designed for streaming data access, which means
that data can be read and written to HDFS without having to load the entire
file into memory at once. This makes it well-suited for processing large
datasets.
 Commodity hardware: HDFS is designed to run on commodity hardware,
which makes it inexpensive to deploy and maintain.

Anatomy of File Read in HDFS :

Step 1: The client opens the file it wishes to read by calling open() on Distributed
system.
Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in the
file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block.The DFS returns an FSDataInputStream to the client for
it to read data from.
Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the info node addresses for the primary few blocks within the file, then
connects to the primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block.
Step 6: When the client has finished reading the file, a function is called, close()
on the FSDataInputStream.

Anatomy of File Write in HDFS :

Note: HDFS follows the Write once Read many times model. In HDFS we cannot
edit the files which are already stored in HDFS, but we can append data by
reopening the files.

Step 1: The client creates the file by calling create() on

DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has
the right permissions to create the file. If these checks pass, the name node
prepares a record of the new file; otherwise, the file can’t be created and
therefore the client is thrown an error i.e. IOException. The DFS returns an
FSDataOutputStream for the client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data queue
is consumed by the DataStreamer, which is liable for asking the name node to
allocate new blocks by picking an inventory of suitable data nodes to store the
replicas. The DataStreamer streams the packets to the primary data node within
the pipeline, which stores each packet and forwards it to the second data node
within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the
third (and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are
waiting to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline
and waits for acknowledgments before connecting to the name node to signal
whether the file is complete or not.

Replica placement strategy :

Replica placement strategy in HDFS determines how replicas of data blocks are
distributed across the cluster. This strategy is crucial for ensuring data
availability, fault tolerance, and performance in HDFS. The default replica
placement strategy in HDFS is as follows:
1. Local Placement: The first replica of a block is placed on the same
DataNode as the client that wrote the block. This is done to minimize write
latency and network traffic.
2. Rack-aware Placement: The second replica of a block is placed on a
different DataNode in the same rack as the first replica. This is done to
improve locality and reduce the impact of rack failures.
3. Another Rack Placement: The third replica of a block is placed on a
different DataNode in a different rack than the first two replicas. This is
done to further improve data distribution and fault tolerance.

Working With HDFS Commands :

--- ls: This command is used to list all the files. Use lsr for recursive approach. It
is useful when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed
File System) commands.

--- mkdir: To create a directory. In Hadoop dfs there is no home directory by

default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer.

--- touchz: It creates an empty file.

Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt

--- copyFromLocal (or) put: To copy files/folders from local file system to hdfs
store. This is the most important command. Local filesystem means the files
present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

--- cat: To print file contents.

Syntax:
bin/hdfs dfs -cat <path>

--- copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

--- cp: This command is used to copy files within hdfs.

Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
--- mv: This command is used to move files within hdfs.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>

--- rmr: This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>

--- du: It will give the size of each file in directory.

Syntax:
bin/hdfs dfs -du <dirName>

--- dus:: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>

Hadoop File System Interfaces :

Hadoop Distributed File System (HDFS) provides various interfaces for interacting
with its data. These interfaces allow applications to access, manage, and process
data stored in HDFS.
1. Java API
The primary interface for accessing HDFS is the Java API. It provides a
comprehensive set of classes and methods for performing various operations.
2. Native C API
HDFS also provides a native C API, which allows C programs to interact with HDFS.
The C API is similar to the Java API, but it offers lower-level access to HDFS
operations. The C API is particularly useful for applications that require high
performance or direct memory access to HDFS data.
3. WebHDFS
WebHDFS is a RESTful API that allows applications to interact with HDFS using
HTTP requests. WebHDFS is a convenient way to access HDFS from non-Java
environments, such as web applications or scripts. It is also useful for applications
that require a lightweight interface to HDFS.
4. FS Shell
FS Shell is a command-line interface for interacting with HDFS. It provides a
simple and intuitive way to perform basic file operations, such as creating,
reading, writing, and deleting files and directories. FS Shell is a useful tool for
administrators and users who are familiar with command-line interfaces.
5. Third-party APIs
In addition to the official APIs, there are also a number of third-party APIs that
provide access to HDFS. These APIs often offer additional features or functionality
that are not available in the official APIs.
Real life ex. -
Imagine you have a large library with many books stored in different rooms. To
manage and access these books, you have different ways to interact with the
library:

Librarian: The librarian is like the Java API. They have a deep understanding of the
library's organization and can perform various tasks, such as finding specific
books, checking them out, and adding new books.
Shelf Map: The shelf map is like the native C API. It provides a detailed layout of
the library, allowing you to locate specific books directly.
Website: The library's website is like WebHDFS. It allows you to access and
manage your books remotely using a web browser.
Card Catalog: The card catalog is like FS Shell. It's a simple way to find books by
title or author.
Special Requests Desk: The special requests desk is like third-party APIs. They can
provide additional services or access to rare books that may not be available
through regular channels.

Hadoop 1.0 V/S Hadoop 2.0

Hadoop 1
Master-slave architecture: A single master node controls all the slave nodes in
the cluster. This can create a bottleneck if the master node becomes overloaded.
MapReduce: Hadoop 1 uses MapReduce for both resource management and data
processing. This can make it difficult to scale and manage complex jobs.
Limited scalability: Hadoop 1 can only scale up to a few thousand nodes.

Hadoop 2
Resource Manager (RM) and Node Manager (NM): These two services manage
the resources in the cluster, freeing up the master node to focus on job
scheduling.
Yet Another Resource Negotiator (YARN): YARN allows multiple processing
frameworks, such as Spark and MapReduce, to run on the same cluster.

Improved scalability: Hadoop 2 can scale up to thousands of nodes.

In simpler terms, Hadoop 1 is like a small town with a single mayor, while
Hadoop 2 is like a large city with a mayor and a city council.

Here is a table that summarizes the key differences between Hadoop 1 and 2:
Hadoop Ecosystem :

Hadoop ecosystem is a collection of open-source software tools and frameworks

that provide a platform for storing, processing, and analyzing large datasets. It is a
popular choice for big data projects because it is scalable, cost-effective, and
fault-tolerant.
The core components of the Hadoop ecosystem are:
Hadoop Distributed File System (HDFS): HDFS is a distributed file system that
stores data across a cluster of commodity hardware. It is designed to handle
large datasets and to tolerate hardware failures.
Yet Another Resource Negotiator (YARN): YARN is a resource management
framework that coordinates the allocation of resources across the cluster. It
allows multiple processing frameworks, such as Spark and MapReduce, to run on
the same cluster.
MapReduce: MapReduce is a programming model for processing large datasets
in parallel. It divides a large job into smaller tasks that can be executed on
multiple nodes in the cluster.
In addition to these core components, there are many other tools and
frameworks in the Hadoop ecosystem that provide additional functionality, such
as:
Hive: Hive provides a SQL-like interface for querying data stored in HDFS.
Pig: Pig provides a scripting language for processing data stored in HDFS.
HBase: HBase is a NoSQL database that is designed for storing and retrieving
large amounts of structured data.
ZooKeeper: ZooKeeper is a coordination service that helps to maintain the state
of a distributed system.

Real life ex.

Imagine you have a massive library filled with books, each representing a piece of
data. The Hadoop ecosystem is like a group of librarians and assistants who work
together to organize, manage, and analyze these books.
Hadoop Distributed File System (HDFS):
The HDFS librarians are responsible for storing the books in a way that makes
them easy to find. They divide the books into smaller sections and spread them
across different shelves, like distributing data across multiple computers.
Yet Another Resource Negotiator (YARN):
The YARN assistants coordinate the librarians' work, making sure they are all
working together efficiently. They assign tasks to different librarians, such as
retrieving specific books or helping readers find the information they need.
MapReduce:
The MapReduce librarians are experts at analyzing large amounts of information.
They break down complex questions into smaller, more manageable tasks and
work simultaneously to find the answers.
Hive:
The Hive librarians speak the language of business users, translating their
questions into instructions that the MapReduce librarians can understand. They
make it easier for people without programming experience to access and analyze
data.
Pig:
The Pig librarians provide a different way to analyze data, using a simpler
language that is easier to learn. They allow users to manipulate and transform
data without having to understand the underlying technology.
HBase:
The HBase librarians manage a specialized section of the library, one that stores
information in a way that makes it easy to access individual pieces of data quickly.
They are like the librarians in charge of the reference section.
ZooKeeper:
The ZooKeeper librarians keep track of everything happening in the library,
ensuring that all the librarians and assistants are working together smoothly. They
keep track of which books are where, who is working on what, and how to handle
any problems that arise.

Together, these librarians and assistants form a powerful team that can manage
and analyze even the largest libraries. The Hadoop ecosystem is like this team,
providing the tools and framework needed to store, process, and understand big
data.

Data Streaming -
Data streaming in Hadoop refers to the process of continuously ingesting and
processing data as it is generated, rather than batch processing it after it has
been stored. This allows for near real-time analysis and decision-making.

Hadoop Streaming is a tool that enables you to write MapReduce programs in any
programming language that can read from standard input (STDIN) and write to
standard output (STDOUT). This makes it possible to use Hadoop to process data
from a variety of sources, including databases, web services, and sensors.

There are two main approaches to data streaming in Hadoop:

Stream processing: This approach uses a custom stream processing framework,

such as Apache Kafka or Apache Flink, to ingest and process data streams. These
frameworks are designed to handle the high volume and velocity of data streams,
and they can provide low-latency processing.

Micro-batching: This approach breaks down a data stream into small batches of
data, which are then processed using a traditional batch processing framework,
such as Hadoop MapReduce. This approach is less scalable than stream
processing, but it can be more efficient for certain types of workloads.

Data streaming is a powerful tool for analyzing real-time data. It can be used for a
variety of applications, such as fraud detection, anomaly detection, and real-time
recommendations.
Here are some of the benefits of data streaming in Hadoop:

Near real-time analysis: Data streaming allows you to analyze data as it is

generated, rather than waiting for it to be stored. This can provide you with
insights into your data much sooner, which can help you make better decisions.
Scalability: Data streaming frameworks are designed to handle the high volume
and velocity of data streams. This means that you can use them to analyze data
from a variety of sources, including sensors, social media, and web traffic.
Flexibility: Data streaming frameworks are designed to be flexible and extensible.
This means that you can use them to build a variety of applications, from simple
data pipelines to complex analytics applications.

Real life ex.

Imagine you're running a restaurant and want to track how many orders you
receive during the day. Instead of counting orders manually every hour, you could
use a data streaming tool to continuously count orders as they come in. This
would give you a real-time view of your order volume, allowing you to make
adjustments if needed.
Data streaming in Hadoop works similarly. Instead of processing large batches of
data all at once, it breaks the data down into smaller streams and processes them
as they arrive. This allows for faster analysis and decision-making.

Here's a simplified explanation of the process:

 Data arrives from various sources, such as sensors, web services, or social
media feeds.
 Data is divided into smaller streams.
 Each stream is processed by a separate "mapper" program.
 The mapper program breaks down the data into key-value pairs.
 The key-value pairs are shuffled and distributed to "reducer" programs.
 The reducer programs aggregate the key-value pairs and produce final
results.
 The final results are stored and can be analyzed or used for further
processing.

Data streaming in Hadoop is particularly useful for real-time applications where

you need to react quickly to changes in the data. For example, it can be used for
fraud detection, anomaly detection, or real-time recommendations.

Data Flow And Models -

Data flow refers to the path that data takes through a system, from ingestion to
processing and analysis. In data streaming, the data flow is continuous, as data is
ingested, processed, and analyzed as it arrives. This contrasts with batch
processing, where data is stored and processed in large batches.

Real life ex -
Imagine you're making a sandwich. You start by gathering all the ingredients –
bread, cheese, lettuce, tomato, and mayonnaise. This is like data ingestion, where
you collect data from various sources.
Next, you assemble the sandwich by placing the ingredients on the bread in the
desired order. This is like data transformation, where you prepare the data for
processing.
Then, you take a bite of the sandwich and enjoy the combination of flavors. This is
like data analysis, where you extract insights from the processed data.
Finally, you discard the wrapper and any leftover ingredients. This is like data
storage, where you save the analyzed data for future reference or disposal.
Data flow in data streaming is similar to making a sandwich. It involves a series of
steps that transform raw data into useful insights, just like turning simple
ingredients into a delicious meal.

The general data flow in data streaming in Hadoop involves the following steps:
Data Ingestion: Data is collected from various sources, such as sensors, web
services, or social media feeds.
Data Processing: The ingested data is transformed, cleaned, and enriched to
prepare it for analysis.
Data Analysis: Real-time insights are extracted from the processed data using
stream processing techniques.
Data Storage: The analyzed data is stored in a database or data lake for future
reference or further processing.
Data Models in Data Streaming
Data models define the structure and organization of data within a stream
processing system. They provide a framework for representing and manipulating
data in a way that is consistent and efficient.

Common data models used in data streaming include:

Tuple-based model: Data is represented as a collection of key-value pairs, where
the key identifies the data element and the value represents its actual content.
Event-based model: Data is represented as events, which are self-contained units
of information with timestamps and attributes.
Stream-based model: Data is represented as a continuous stream of data
elements, where each element may contain multiple attributes.

Real life Ex -
Tuple-based model:
Imagine a list of grocery items, where each item is represented by a tuple of
(name, quantity, price). For example, ("apple", 2, 1.29). This model is well-suited
for simple data structures, where each data element has a clear name and value.
Event-based model:
Think of a social media feed, where each post is represented as an event with a
timestamp, sender, and content. For instance, { "timestamp":
"2023-11-13T16:42:21Z", "sender": "Bard", "content": "Explaining data models in
simpler terms." }. This model is flexible for complex data with timestamps and
multiple attributes.
Stream-based model:
Consider a temperature sensor continuously sending readings. The data is
represented as a stream of temperature values, each accompanied by a
timestamp. For example, [23.4, 23.5, 23.6, 23.7]. This model is efficient for
handling large volumes of continuously generated data.

Apache Flume
Apache Flume is a distributed, reliable, and highly available service for efficiently
collecting, aggregating, and moving large amounts of streaming event data. It is a
flexible, scalable, and fault-tolerant platform that can handle a wide variety of
data sources and sinks.
Features of Apache Flume
Distributed and Scalable: Flume's architecture is designed for distributed
operation, allowing it to handle large volumes of data with ease. It can be easily
scaled up or down by adding or removing nodes.
Reliable and Fault-tolerant: Flume uses a variety of techniques to ensure data
reliability, including replication, checkpoints, and failover mechanisms. This
ensures that data is not lost even in the event of node failures.
Flexible and Extensible: Flume supports a wide variety of data sources and sinks,
including Kafka, HDFS, and Amazon S3. It also has a rich ecosystem of plugins that
extend its functionality.
Customizability: Flume's components are pluggable, allowing you to customize
data processing and routing.
Ease of Use: Flume is relatively easy to configure and manage, making it
accessible to a broad range of users
Architecture of Apache Flume:

Agents: Flume agents are the building blocks of the system. Each agent
represents a data flow and consists of three main components:
 Source: The source ingests data from external sources like Twitter feeds,
log files, or sensors.
 Channel: The channel temporarily buffers the ingested data before it is
consumed by sinks.
 Sink: The sink delivers the data to a destination system like HDFS, Kafka, or
Elasticsearch.
Channels: Channels act as message queues, providing temporary storage for data
in transit between sources and sinks. They ensure that data is not lost if a sink is
unavailable or overloaded.
Collectors: Collectors are optional components that aggregate data from multiple
agents and send it to a central sink. They are useful when multiple agents need to
deliver data to a single destination.

Installation OF Hadoop
1.Download Hadoop: Download the latest stable release of Hadoop from the
Apache Hadoop website.
2.Extract Hadoop: Extract the downloaded Hadoop archive to a directory of your
choice. For example, you can extract it to /opt/hadoop.
3.Configure Hadoop: Edit the /opt/hadoop/conf/core-site.xml file and set the
fs.defaultFS property to file:///. This tells Hadoop to use the local file system for
storage.
Set Hadoop environment variables: Edit your ~/.bashrc or ~/.zshrc file and add the
following lines to set the Hadoop environment variables:
Bash
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
4.Verify the installation: Start the Hadoop daemons using the following
commands:
Bash
$ $HADOOP_HOME/sbin/start-all.sh
$ $HADOOP_HOME/sbin/jps

NSSCO - Mathematics Paper 1 6131-1 - First Proof 08.04.2022
100% (2)
NSSCO - Mathematics Paper 1 6131-1 - First Proof 08.04.2022
16 pages
CE 2042 - Structural Analysis II PDF
No ratings yet
CE 2042 - Structural Analysis II PDF
3 pages
TDS-3S Operating Guide
No ratings yet
TDS-3S Operating Guide
20 pages
Chapter 1 Soils Investigation
No ratings yet
Chapter 1 Soils Investigation
45 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
shawn
No ratings yet
shawn
4 pages
HADOOP
No ratings yet
HADOOP
18 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
CC unit5
No ratings yet
CC unit5
27 pages
Hadoop
No ratings yet
Hadoop
11 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit 5
No ratings yet
Unit 5
7 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
HADOOP FRAME WORK
No ratings yet
HADOOP FRAME WORK
38 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
IMTC634_Data Science_Chapter 13
No ratings yet
IMTC634_Data Science_Chapter 13
16 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Hadoop Components
No ratings yet
Hadoop Components
5 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
BDM 2
No ratings yet
BDM 2
5 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
Hadoop Features 2
No ratings yet
Hadoop Features 2
3 pages
Hadoop
No ratings yet
Hadoop
5 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
HADOOP
No ratings yet
HADOOP
19 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
bda final sem 7
No ratings yet
bda final sem 7
120 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop
No ratings yet
Hadoop
14 pages
Module-2
No ratings yet
Module-2
23 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
No ratings yet
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
6 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Unit-5 -Hadoop.pptx
No ratings yet
Unit-5 -Hadoop.pptx
29 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
A Mobile App Platform for IoT
No ratings yet
A Mobile App Platform for IoT
9 pages
ACADEMIC CALENDAR 2024-25 (Revised )
No ratings yet
ACADEMIC CALENDAR 2024-25 (Revised )
1 page
Project Title
No ratings yet
Project Title
2 pages
unit1 i(1)
No ratings yet
unit1 i(1)
53 pages
JavaScript Objects
No ratings yet
JavaScript Objects
24 pages
2645903566
No ratings yet
2645903566
3 pages
Unit-4_Hive_
No ratings yet
Unit-4_Hive_
10 pages
Dates
No ratings yet
Dates
35 pages
Sagacious IP Company Profile
No ratings yet
Sagacious IP Company Profile
3 pages
Astm E1300-16 X1
100% (1)
Astm E1300-16 X1
1 page
2252 - Even Sem Supplementary
No ratings yet
2252 - Even Sem Supplementary
5 pages
Hitachi Zaxis 650lc 670lch 3 Technical Manual Operational Principle
100% (59)
Hitachi Zaxis 650lc 670lch 3 Technical Manual Operational Principle
20 pages
The Contribution of Horizontal Arching To Tunnel Face Stability 2012 Geotechnik
No ratings yet
The Contribution of Horizontal Arching To Tunnel Face Stability 2012 Geotechnik
11 pages
1042 Legacy Humidifier Installation & Owners Manual R
No ratings yet
1042 Legacy Humidifier Installation & Owners Manual R
8 pages
2_Torque
No ratings yet
2_Torque
33 pages
Ch5 Ch8 ALL Problems
No ratings yet
Ch5 Ch8 ALL Problems
92 pages
Intertie Protection System BE1-IPS100 PDF
No ratings yet
Intertie Protection System BE1-IPS100 PDF
536 pages
Cheng 2018
No ratings yet
Cheng 2018
1 page
Performance Evaluation of Bamboo Reinforced Concrete Beam
No ratings yet
Performance Evaluation of Bamboo Reinforced Concrete Beam
9 pages
Blm18ag102sn1 Ferrite Rs485
No ratings yet
Blm18ag102sn1 Ferrite Rs485
3 pages
Life Extension Measures
No ratings yet
Life Extension Measures
39 pages
Simulink - Get Start
No ratings yet
Simulink - Get Start
62 pages
Chapter 4 The Nano World
No ratings yet
Chapter 4 The Nano World
3 pages
12 Phy Assignment Ray Optics 1695915852
No ratings yet
12 Phy Assignment Ray Optics 1695915852
3 pages
ModelingDerivatives Galley
100% (1)
ModelingDerivatives Galley
268 pages
Properti
No ratings yet
Properti
31 pages
EOT Crane Load Calculations
No ratings yet
EOT Crane Load Calculations
8 pages
Isipathana College Physics: Answer Four Questions Only
No ratings yet
Isipathana College Physics: Answer Four Questions Only
11 pages
3 Sinusoids and Phasors
No ratings yet
3 Sinusoids and Phasors
19 pages
Forde Stiemer 1987 PDF
No ratings yet
Forde Stiemer 1987 PDF
6 pages
Satellite Substation
No ratings yet
Satellite Substation
37 pages
Information Retrieval 8 Term Weighting A
No ratings yet
Information Retrieval 8 Term Weighting A
11 pages
Adaptive Pairing Reversible Watermarking
No ratings yet
Adaptive Pairing Reversible Watermarking
3 pages
Easy Sudoku - 50 Printable Puzzles With Answers
100% (2)
Easy Sudoku - 50 Printable Puzzles With Answers
55 pages
Hnefatafl Rules
No ratings yet
Hnefatafl Rules
1 page

Unit-2-_Hadoop2_

Uploaded by

Unit-2-_Hadoop2_

Uploaded by

Unit - 2

Real life eg. -

Core Components of Hadoop:

Hadoop is a collection of several open-source software modules that work

MapReduce: MapReduce is a programming model for processing large data sets

YARN: YARN is a resource management platform that schedules and manages

Fraud detection: Hadoop can be used to analyze financial transactions to identify

Scientific research: Hadoop can be used to process large volumes of scientific

Open-source: Hadoop is an open-source software foundation. This means that

Support for multiple programming languages: Hadoop supports multiple

The Hadoop Architecture Mainly consists of 4 components.

In first phase, Map is utilized and in next phase Reduce is utilized.

This is because for running Hadoop we are using commodity hardware

Anatomy of File Read in HDFS :

Anatomy of File Write in HDFS :

Step 1: The client creates the file by calling create() on

Replica placement strategy :

Working With HDFS Commands :

--- mkdir: To create a directory. In Hadoop dfs there is no home directory by

--- touchz: It creates an empty file.

--- cat: To print file contents.

--- cp: This command is used to copy files within hdfs.

--- du: It will give the size of each file in directory.

Hadoop File System Interfaces :

Hadoop 1.0 V/S Hadoop 2.0

Improved scalability: Hadoop 2 can scale up to thousands of nodes.

Hadoop ecosystem is a collection of open-source software tools and frameworks

Real life ex.

There are two main approaches to data streaming in Hadoop:

Stream processing: This approach uses a custom stream processing framework,

Near real-time analysis: Data streaming allows you to analyze data as it is

Real life ex.

Here's a simplified explanation of the process:

Data streaming in Hadoop is particularly useful for real-time applications where

Data Flow And Models -

Common data models used in data streaming include:

You might also like