0% found this document useful (0 votes)
5 views

Unit-2-_Hadoop2_

Hadoop is an open-source framework designed for distributed storage and processing of large data sets across clusters of commodity hardware, making it a cost-effective solution for big data challenges. Its core components include HDFS for data storage, MapReduce for parallel processing, and YARN for resource management, which together enable efficient data handling and fault tolerance. Hadoop is widely used in various applications such as log analysis, web analytics, fraud detection, and scientific research, supported by a large community and multiple programming language compatibility.

Uploaded by

thaakuranujtomar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit-2-_Hadoop2_

Hadoop is an open-source framework designed for distributed storage and processing of large data sets across clusters of commodity hardware, making it a cost-effective solution for big data challenges. Its core components include HDFS for data storage, MapReduce for parallel processing, and YARN for resource management, which together enable efficient data handling and fault tolerance. Hadoop is widely used in various applications such as log analysis, web analytics, fraud detection, and scientific research, supported by a large community and multiple programming language compatibility.

Uploaded by

thaakuranujtomar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit - 2

Hadoop
Hadoop is an open-source framework that allows for the distributed storage and
processing of large data sets across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage. Rather than relying on
expensive, high-end hardware, Hadoop is designed to scale out using clusters of
commodity hardware, making it a very cost-effective solution for big data
problems.

Real life eg. -

Imagine you have a massive library filled with countless books, each representing
a piece of data. Instead of relying on a single librarian to manage this vast
collection, Hadoop acts like a team of librarians, each responsible for a specific
section of the library. This team works together to organize, store, and retrieve
books (data) efficiently, making it easier to find the information you need.

Hadoop
distributes the data across multiple computers, allowing for parallel
processing. Think of it as having multiple librarians working simultaneously on
different sections of the library, processing requests much faster than a single
librarian could.

Core Components of Hadoop:

Hadoop is a collection of several open-source software modules that work


together to provide a distributed storage and processing platform. The core
components of Hadoop are:
Hadoop Distributed File System (HDFS): HDFS is a distributed file system that
stores data across multiple nodes in a cluster. It is designed to be highly
fault-tolerant and can handle the failure of any node without losing data.

MapReduce: MapReduce is a programming model for processing large data sets


in parallel. It breaks down a computation into smaller tasks that can be run on
multiple nodes simultaneously.

YARN: YARN is a resource management platform that schedules and manages


MapReduce jobs. It also provides a framework for running other distributed
applications.

Uses of Hadoop:

Log analysis: Hadoop can be used to analyze large volumes of log data to identify
patterns and trends.

Web analytics: Hadoop can be used to analyze web traffic data to understand
user behavior and improve website performance.

Fraud detection: Hadoop can be used to analyze financial transactions to identify


fraudulent activity.

Scientific research: Hadoop can be used to process large volumes of scientific


data to make new discoveries.

Advantages of Hadoop :
Scalability: Hadoop is a highly scalable platform that can be easily expanded to
handle large amounts of data. This is because Hadoop is designed to distribute
data and processing across multiple nodes, or computers, in a cluster. As the
amount of data grows, more nodes can be added to the cluster to handle the
additional processing load.
Cost-effectiveness: Hadoop is a cost-effective solution for big data problems
because it can run on clusters of commodity hardware. Commodity hardware is
simply the type of hardware that is widely available and relatively inexpensive.
This means that you can build a Hadoop cluster using off-the-shelf hardware,
rather than having to purchase expensive, high-end hardware.

Fault tolerance: Hadoop is a highly fault-tolerant platform that can handle the
failure of any node in the cluster without losing data. This is because Hadoop
replicates data across multiple nodes. If a node fails, the data on that node can
be recovered from another node in the cluster. This ensures that your data is
always available, even if there are hardware failures.

Flexibility: Hadoop can be used to process a wide variety of data types, including
structured, semi-structured, and unstructured data. This makes Hadoop a very
versatile platform that can be used for a wide range of applications.

Open-source: Hadoop is an open-source software foundation. This means that


the source code for Hadoop is freely available for anyone to use, modify, and
distribute. This makes Hadoop a very cost-effective solution, as you do not have
to pay for licensing fees.

Large community: Hadoop has a large and active community of users and
developers. This means that there is a wealth of resources available online, such
as documentation, tutorials, and forums. This makes it easy to get help when you
need it.

Support for multiple programming languages: Hadoop supports multiple


programming languages, including Java, Python, and Scala. This makes it easy to
develop applications using the programming language that you are most
comfortable with.

Maturity: Hadoop is a mature platform that has been around for many years. This
means that it is a stable and reliable platform that you can trust to handle your
data processing needs.
RDMS V/S Hadoop :

Hadoop Architecture :
Hadoop works on MapReduce Programming Algorithm that was introduced by
Google. Today lots of Big Brand Companies are using Hadoop in their Organization
to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc.

The Hadoop Architecture Mainly consists of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

1. MapReduce:
MapReduce nothing but just like an Algorithm or a data structure.The major
feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster which Makes Hadoop working so fast. When you are dealing with
Big Data, serial processing is no more of any use. MapReduce has mainly 2 tasks
which are divided phase-wise:

In first phase, Map is utilized and in next phase Reduce is utilized.


Here, we can see that the Input is provided to the Map() function then it’s
output is used as an input to the Reduce function and after that, we receive our
final output.

2. HDFS:
It is utilized for storage permission. It is mainly designed for working on
commodity Hardware devices(inexpensive devices), working on a distributed file
system design. HDFS is designed in such a way that it believes more in storing
the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage
layer and the other devices present in that Hadoop cluster.
Data storage Nodes in HDFS.
 NameNode(Master)
 DataNode(Slave)
NameNode: NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the
user’s activity in a Hadoop cluster.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or
even more than that. The more number of DataNode, the Hadoop cluster will be
able to store more data. So it is advised that the DataNode should have High
storing capacity to store a large number of file blocks.

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single
block of data is divided into multiple blocks of size 128MB which is default and
you can also change it manually.
Replication In HDFS : Replication ensures the availability of the data. Replication
is making a copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor. As we have seen in
File blocks that the HDFS stores the data in the form of various blocks at the same
time Hadoop is also configured to make a copy of those file blocks.

By default, the Replication Factor for Hadoop is set to 3 which can be configured
means you can change it manually as per your requirement like in above example
we have made 4 file blocks which means that 3 Replica or copy of each file block
is made means total of 4×3 = 12 blocks are made for the backup purpose.

This is because for running Hadoop we are using commodity hardware


(inexpensive system hardware) which can be crashed at any time. We are not
using the supercomputer for our Hadoop setup. That is why we need such a
feature in HDFS which can make copies of that file blocks for backup purposes,
this is known as fault tolerance.

Rack Awareness : The rack is nothing but just the physical collection of nodes in
our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so
many Racks . with the help of this Racks information Namenode chooses the
closest Datanode to achieve the maximum performance.
3. YARN(Yet Another Resource Negotiator):
YARN is a Framework on which MapReduce works. YARN performs 2 operations
that are Job scheduling and Resource Management. The Purpose of Job
schedular is to divide a big task into small jobs so that each job can be assigned
to various slaves in a Hadoop cluster and Processing can be Maximized. Job
Scheduler also keeps track of which job is important, which job has more priority,
dependencies between the jobs and all the other information like job timing, etc.
And
the use of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster.
Features of YARN
 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility
4. Hadoop common or Common Utilities :
Hadoop common or Common utilities are nothing but our java library and java
files or we can say the java scripts that we need for all the other components
present in a Hadoop cluster. Hadoop Common verify that Hardware failure in a
Hadoop cluster is common so it needs to be solved automatically in software by
Hadoop Framework.

Goals of HDFS :
 Store and manage large datasets: HDFS is designed to store and manage
large datasets, typically ranging from gigabytes to petabytes in size.
 Provide high-throughput data access: HDFS is designed to provide
high-throughput data access, which means that it can transfer large
amounts of data quickly.
 Be highly available: HDFS is designed to be highly available, meaning that it
is always accessible to users.
 Be scalable: HDFS is designed to be scalable, meaning that it can be easily
expanded to accommodate growing data volumes.
 Be cost-effective: HDFS is designed to be cost-effective, meaning that it can
be deployed and maintained on commodity hardware.
 Fault tolerance: HDFS is highly fault-tolerant, meaning that it can continue
to operate even if some of its nodes fail. This is achieved by replicating data
across multiple nodes in the cluster.
 Data streaming: HDFS is designed for streaming data access, which means
that data can be read and written to HDFS without having to load the entire
file into memory at once. This makes it well-suited for processing large
datasets.
 Commodity hardware: HDFS is designed to run on commodity hardware,
which makes it inexpensive to deploy and maintain.

Anatomy of File Read in HDFS :


Step 1: The client opens the file it wishes to read by calling open() on Distributed
system.
Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in the
file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block.The DFS returns an FSDataInputStream to the client for
it to read data from.
Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the info node addresses for the primary few blocks within the file, then
connects to the primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block.
Step 6: When the client has finished reading the file, a function is called, close()
on the FSDataInputStream.

Anatomy of File Write in HDFS :


Note: HDFS follows the Write once Read many times model. In HDFS we cannot
edit the files which are already stored in HDFS, but we can append data by
reopening the files.

Step 1: The client creates the file by calling create() on


DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has
the right permissions to create the file. If these checks pass, the name node
prepares a record of the new file; otherwise, the file can’t be created and
therefore the client is thrown an error i.e. IOException. The DFS returns an
FSDataOutputStream for the client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data queue
is consumed by the DataStreamer, which is liable for asking the name node to
allocate new blocks by picking an inventory of suitable data nodes to store the
replicas. The DataStreamer streams the packets to the primary data node within
the pipeline, which stores each packet and forwards it to the second data node
within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the
third (and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are
waiting to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline
and waits for acknowledgments before connecting to the name node to signal
whether the file is complete or not.

Replica placement strategy :


Replica placement strategy in HDFS determines how replicas of data blocks are
distributed across the cluster. This strategy is crucial for ensuring data
availability, fault tolerance, and performance in HDFS. The default replica
placement strategy in HDFS is as follows:
1. Local Placement: The first replica of a block is placed on the same
DataNode as the client that wrote the block. This is done to minimize write
latency and network traffic.
2. Rack-aware Placement: The second replica of a block is placed on a
different DataNode in the same rack as the first replica. This is done to
improve locality and reduce the impact of rack failures.
3. Another Rack Placement: The third replica of a block is placed on a
different DataNode in a different rack than the first two replicas. This is
done to further improve data distribution and fault tolerance.

Working With HDFS Commands :


--- ls: This command is used to list all the files. Use lsr for recursive approach. It
is useful when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed
File System) commands.

--- mkdir: To create a directory. In Hadoop dfs there is no home directory by


default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer.

--- touchz: It creates an empty file.


Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt

--- copyFromLocal (or) put: To copy files/folders from local file system to hdfs
store. This is the most important command. Local filesystem means the files
present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

--- cat: To print file contents.


Syntax:
bin/hdfs dfs -cat <path>

--- copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

--- cp: This command is used to copy files within hdfs.


Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
--- mv: This command is used to move files within hdfs.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>

--- rmr: This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>

--- du: It will give the size of each file in directory.


Syntax:
bin/hdfs dfs -du <dirName>

--- dus:: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>

Hadoop File System Interfaces :


Hadoop Distributed File System (HDFS) provides various interfaces for interacting
with its data. These interfaces allow applications to access, manage, and process
data stored in HDFS.
1. Java API
The primary interface for accessing HDFS is the Java API. It provides a
comprehensive set of classes and methods for performing various operations.
2. Native C API
HDFS also provides a native C API, which allows C programs to interact with HDFS.
The C API is similar to the Java API, but it offers lower-level access to HDFS
operations. The C API is particularly useful for applications that require high
performance or direct memory access to HDFS data.
3. WebHDFS
WebHDFS is a RESTful API that allows applications to interact with HDFS using
HTTP requests. WebHDFS is a convenient way to access HDFS from non-Java
environments, such as web applications or scripts. It is also useful for applications
that require a lightweight interface to HDFS.
4. FS Shell
FS Shell is a command-line interface for interacting with HDFS. It provides a
simple and intuitive way to perform basic file operations, such as creating,
reading, writing, and deleting files and directories. FS Shell is a useful tool for
administrators and users who are familiar with command-line interfaces.
5. Third-party APIs
In addition to the official APIs, there are also a number of third-party APIs that
provide access to HDFS. These APIs often offer additional features or functionality
that are not available in the official APIs.
Real life ex. -
Imagine you have a large library with many books stored in different rooms. To
manage and access these books, you have different ways to interact with the
library:

Librarian: The librarian is like the Java API. They have a deep understanding of the
library's organization and can perform various tasks, such as finding specific
books, checking them out, and adding new books.
Shelf Map: The shelf map is like the native C API. It provides a detailed layout of
the library, allowing you to locate specific books directly.
Website: The library's website is like WebHDFS. It allows you to access and
manage your books remotely using a web browser.
Card Catalog: The card catalog is like FS Shell. It's a simple way to find books by
title or author.
Special Requests Desk: The special requests desk is like third-party APIs. They can
provide additional services or access to rare books that may not be available
through regular channels.

Hadoop 1.0 V/S Hadoop 2.0

Hadoop 1
Master-slave architecture: A single master node controls all the slave nodes in
the cluster. This can create a bottleneck if the master node becomes overloaded.
MapReduce: Hadoop 1 uses MapReduce for both resource management and data
processing. This can make it difficult to scale and manage complex jobs.
Limited scalability: Hadoop 1 can only scale up to a few thousand nodes.

Hadoop 2
Resource Manager (RM) and Node Manager (NM): These two services manage
the resources in the cluster, freeing up the master node to focus on job
scheduling.
Yet Another Resource Negotiator (YARN): YARN allows multiple processing
frameworks, such as Spark and MapReduce, to run on the same cluster.

Improved scalability: Hadoop 2 can scale up to thousands of nodes.

In simpler terms, Hadoop 1 is like a small town with a single mayor, while
Hadoop 2 is like a large city with a mayor and a city council.

Here is a table that summarizes the key differences between Hadoop 1 and 2:
Hadoop Ecosystem :

Hadoop ecosystem is a collection of open-source software tools and frameworks


that provide a platform for storing, processing, and analyzing large datasets. It is a
popular choice for big data projects because it is scalable, cost-effective, and
fault-tolerant.
The core components of the Hadoop ecosystem are:
Hadoop Distributed File System (HDFS): HDFS is a distributed file system that
stores data across a cluster of commodity hardware. It is designed to handle
large datasets and to tolerate hardware failures.
Yet Another Resource Negotiator (YARN): YARN is a resource management
framework that coordinates the allocation of resources across the cluster. It
allows multiple processing frameworks, such as Spark and MapReduce, to run on
the same cluster.
MapReduce: MapReduce is a programming model for processing large datasets
in parallel. It divides a large job into smaller tasks that can be executed on
multiple nodes in the cluster.
In addition to these core components, there are many other tools and
frameworks in the Hadoop ecosystem that provide additional functionality, such
as:
Hive: Hive provides a SQL-like interface for querying data stored in HDFS.
Pig: Pig provides a scripting language for processing data stored in HDFS.
HBase: HBase is a NoSQL database that is designed for storing and retrieving
large amounts of structured data.
ZooKeeper: ZooKeeper is a coordination service that helps to maintain the state
of a distributed system.

Real life ex.


Imagine you have a massive library filled with books, each representing a piece of
data. The Hadoop ecosystem is like a group of librarians and assistants who work
together to organize, manage, and analyze these books.
Hadoop Distributed File System (HDFS):
The HDFS librarians are responsible for storing the books in a way that makes
them easy to find. They divide the books into smaller sections and spread them
across different shelves, like distributing data across multiple computers.
Yet Another Resource Negotiator (YARN):
The YARN assistants coordinate the librarians' work, making sure they are all
working together efficiently. They assign tasks to different librarians, such as
retrieving specific books or helping readers find the information they need.
MapReduce:
The MapReduce librarians are experts at analyzing large amounts of information.
They break down complex questions into smaller, more manageable tasks and
work simultaneously to find the answers.
Hive:
The Hive librarians speak the language of business users, translating their
questions into instructions that the MapReduce librarians can understand. They
make it easier for people without programming experience to access and analyze
data.
Pig:
The Pig librarians provide a different way to analyze data, using a simpler
language that is easier to learn. They allow users to manipulate and transform
data without having to understand the underlying technology.
HBase:
The HBase librarians manage a specialized section of the library, one that stores
information in a way that makes it easy to access individual pieces of data quickly.
They are like the librarians in charge of the reference section.
ZooKeeper:
The ZooKeeper librarians keep track of everything happening in the library,
ensuring that all the librarians and assistants are working together smoothly. They
keep track of which books are where, who is working on what, and how to handle
any problems that arise.

Together, these librarians and assistants form a powerful team that can manage
and analyze even the largest libraries. The Hadoop ecosystem is like this team,
providing the tools and framework needed to store, process, and understand big
data.

Data Streaming -
Data streaming in Hadoop refers to the process of continuously ingesting and
processing data as it is generated, rather than batch processing it after it has
been stored. This allows for near real-time analysis and decision-making.

Hadoop Streaming is a tool that enables you to write MapReduce programs in any
programming language that can read from standard input (STDIN) and write to
standard output (STDOUT). This makes it possible to use Hadoop to process data
from a variety of sources, including databases, web services, and sensors.

There are two main approaches to data streaming in Hadoop:

Stream processing: This approach uses a custom stream processing framework,


such as Apache Kafka or Apache Flink, to ingest and process data streams. These
frameworks are designed to handle the high volume and velocity of data streams,
and they can provide low-latency processing.

Micro-batching: This approach breaks down a data stream into small batches of
data, which are then processed using a traditional batch processing framework,
such as Hadoop MapReduce. This approach is less scalable than stream
processing, but it can be more efficient for certain types of workloads.

Data streaming is a powerful tool for analyzing real-time data. It can be used for a
variety of applications, such as fraud detection, anomaly detection, and real-time
recommendations.
Here are some of the benefits of data streaming in Hadoop:

Near real-time analysis: Data streaming allows you to analyze data as it is


generated, rather than waiting for it to be stored. This can provide you with
insights into your data much sooner, which can help you make better decisions.
Scalability: Data streaming frameworks are designed to handle the high volume
and velocity of data streams. This means that you can use them to analyze data
from a variety of sources, including sensors, social media, and web traffic.
Flexibility: Data streaming frameworks are designed to be flexible and extensible.
This means that you can use them to build a variety of applications, from simple
data pipelines to complex analytics applications.

Real life ex.


Imagine you're running a restaurant and want to track how many orders you
receive during the day. Instead of counting orders manually every hour, you could
use a data streaming tool to continuously count orders as they come in. This
would give you a real-time view of your order volume, allowing you to make
adjustments if needed.
Data streaming in Hadoop works similarly. Instead of processing large batches of
data all at once, it breaks the data down into smaller streams and processes them
as they arrive. This allows for faster analysis and decision-making.

Here's a simplified explanation of the process:

 Data arrives from various sources, such as sensors, web services, or social
media feeds.
 Data is divided into smaller streams.
 Each stream is processed by a separate "mapper" program.
 The mapper program breaks down the data into key-value pairs.
 The key-value pairs are shuffled and distributed to "reducer" programs.
 The reducer programs aggregate the key-value pairs and produce final
results.
 The final results are stored and can be analyzed or used for further
processing.

Data streaming in Hadoop is particularly useful for real-time applications where


you need to react quickly to changes in the data. For example, it can be used for
fraud detection, anomaly detection, or real-time recommendations.

Data Flow And Models -


Data flow refers to the path that data takes through a system, from ingestion to
processing and analysis. In data streaming, the data flow is continuous, as data is
ingested, processed, and analyzed as it arrives. This contrasts with batch
processing, where data is stored and processed in large batches.

Real life ex -
Imagine you're making a sandwich. You start by gathering all the ingredients –
bread, cheese, lettuce, tomato, and mayonnaise. This is like data ingestion, where
you collect data from various sources.
Next, you assemble the sandwich by placing the ingredients on the bread in the
desired order. This is like data transformation, where you prepare the data for
processing.
Then, you take a bite of the sandwich and enjoy the combination of flavors. This is
like data analysis, where you extract insights from the processed data.
Finally, you discard the wrapper and any leftover ingredients. This is like data
storage, where you save the analyzed data for future reference or disposal.
Data flow in data streaming is similar to making a sandwich. It involves a series of
steps that transform raw data into useful insights, just like turning simple
ingredients into a delicious meal.

The general data flow in data streaming in Hadoop involves the following steps:
Data Ingestion: Data is collected from various sources, such as sensors, web
services, or social media feeds.
Data Processing: The ingested data is transformed, cleaned, and enriched to
prepare it for analysis.
Data Analysis: Real-time insights are extracted from the processed data using
stream processing techniques.
Data Storage: The analyzed data is stored in a database or data lake for future
reference or further processing.
Data Models in Data Streaming
Data models define the structure and organization of data within a stream
processing system. They provide a framework for representing and manipulating
data in a way that is consistent and efficient.

Common data models used in data streaming include:


Tuple-based model: Data is represented as a collection of key-value pairs, where
the key identifies the data element and the value represents its actual content.
Event-based model: Data is represented as events, which are self-contained units
of information with timestamps and attributes.
Stream-based model: Data is represented as a continuous stream of data
elements, where each element may contain multiple attributes.

Real life Ex -
Tuple-based model:
Imagine a list of grocery items, where each item is represented by a tuple of
(name, quantity, price). For example, ("apple", 2, 1.29). This model is well-suited
for simple data structures, where each data element has a clear name and value.
Event-based model:
Think of a social media feed, where each post is represented as an event with a
timestamp, sender, and content. For instance, { "timestamp":
"2023-11-13T16:42:21Z", "sender": "Bard", "content": "Explaining data models in
simpler terms." }. This model is flexible for complex data with timestamps and
multiple attributes.
Stream-based model:
Consider a temperature sensor continuously sending readings. The data is
represented as a stream of temperature values, each accompanied by a
timestamp. For example, [23.4, 23.5, 23.6, 23.7]. This model is efficient for
handling large volumes of continuously generated data.

Apache Flume
Apache Flume is a distributed, reliable, and highly available service for efficiently
collecting, aggregating, and moving large amounts of streaming event data. It is a
flexible, scalable, and fault-tolerant platform that can handle a wide variety of
data sources and sinks.
Features of Apache Flume
Distributed and Scalable: Flume's architecture is designed for distributed
operation, allowing it to handle large volumes of data with ease. It can be easily
scaled up or down by adding or removing nodes.
Reliable and Fault-tolerant: Flume uses a variety of techniques to ensure data
reliability, including replication, checkpoints, and failover mechanisms. This
ensures that data is not lost even in the event of node failures.
Flexible and Extensible: Flume supports a wide variety of data sources and sinks,
including Kafka, HDFS, and Amazon S3. It also has a rich ecosystem of plugins that
extend its functionality.
Customizability: Flume's components are pluggable, allowing you to customize
data processing and routing.
Ease of Use: Flume is relatively easy to configure and manage, making it
accessible to a broad range of users
Architecture of Apache Flume:

Agents: Flume agents are the building blocks of the system. Each agent
represents a data flow and consists of three main components:
 Source: The source ingests data from external sources like Twitter feeds,
log files, or sensors.
 Channel: The channel temporarily buffers the ingested data before it is
consumed by sinks.
 Sink: The sink delivers the data to a destination system like HDFS, Kafka, or
Elasticsearch.
Channels: Channels act as message queues, providing temporary storage for data
in transit between sources and sinks. They ensure that data is not lost if a sink is
unavailable or overloaded.
Collectors: Collectors are optional components that aggregate data from multiple
agents and send it to a central sink. They are useful when multiple agents need to
deliver data to a single destination.

Installation OF Hadoop
1.Download Hadoop: Download the latest stable release of Hadoop from the
Apache Hadoop website.
2.Extract Hadoop: Extract the downloaded Hadoop archive to a directory of your
choice. For example, you can extract it to /opt/hadoop.
3.Configure Hadoop: Edit the /opt/hadoop/conf/core-site.xml file and set the
fs.defaultFS property to file:///. This tells Hadoop to use the local file system for
storage.
Set Hadoop environment variables: Edit your ~/.bashrc or ~/.zshrc file and add the
following lines to set the Hadoop environment variables:
Bash
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
4.Verify the installation: Start the Hadoop daemons using the following
commands:
Bash
$ $HADOOP_HOME/sbin/start-all.sh
$ $HADOOP_HOME/sbin/jps

You might also like