Hadoop PPT
Hadoop PPT
MODULE 1. Introduction
MODULE
MODULE 3.
MODULE
MODULE
5. Introduction to pig
MODULE 6. Advanced
MODULE 7. Advanced
Hive
MODULE 8.
MODULE 9. Advanced
MODULE 10. Project
set-up Discussion
Big data is a popular term used to describe the exponential growth and availability of data, both structured and
unstructured
More data may lead to more accurate analyses. More accurate analyses may lead to more confident decision
making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
The hopeful vision is that organizations will be able to take data from any source, harness relevant data and
analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and
optimized offerings, and 4) smarter business decision making.
Big data analytics is often associated with cloud computing because the analysis of large data sets in real-time
requires a platform like Hadoop to store large data sets across a distributed cluster and Map Reduce to
coordinate, combine and process data from multiple sources.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a
distributed computing environment.
Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of
terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to
continue operating uninterrupted in case of a node failure.
This approach lowers the risk of system failure, even if a significant number of nodes become inoperative.
Companies that need to process large and varied data sets frequently look to Apache Hadoop as a potential
tool, because it offers the ability to process, store and manage huge amounts of both structured and
unstructured data.
The open source Hadoop framework is built on top of a distributed file system and a cluster architecture that
enable it to transfer data rapidly and continue operating even if one or more compute nodes fail.
But Hadoop isn't a cure-all system for big data application needs as a whole. And while big-name Internet
companies like Yahoo, Facebook, Twitter, eBay and Google are prominent users of the technology, Hadoop
projects are new undertakings for many other types of organizations.
WHAT IS
Big in volume
It is a moving data
Variety : unstructured data, web logs, audio, video, image, structured data
Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a
single technique or a tool, rather it involves many areas of business and technology.
The data in it will be of three types.
Using the information in the social media like preferences and product perception of their consumers, product
companies and retail organizations are planning their production.
Using the data regarding the previous medical history of patients, hospitals are providing better and quick
service.
Big data technologies are important in providing more accurate analysis, which may lead to more concrete
decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.
There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to
handle big data.
Storage
2.
Capture
3.
Sharing
4.
Visualization
5.
Curation
Storage : Some vendors are using increased memory and powerful parallel processing to crunch large
volumes of data extremely quickly. Another method is putting data in-memory but using a grid computing
approach, where many machines are used to solve a problem. Both approaches allow organizations to explore
huge data volumes and gain business insights in near-real time.
Capture : Even if you can capture and analyze data quickly and put it in the proper context for the audience
that will be consuming the information, the value of data for decision-making purposes will play an important
role, if the data is not accurate or timely. This is a challenge with any data analysis, but when considering the
volumes of information involved in big data projects, it becomes even more pronounced.
Sharing :
Real time data processing involves a continual input, process and output of data
Examples
Complex event processing (CEP) platform, which combines data fro multiple sources to detect patterns and
attempts to identify either opportunities or threats
Operational intelligence (OI) platform which use real time data processing and (CEP) to gain insight into
operations by running query analysis against live feeds and event data
OI is near real time analytics over operational data and provides variability over many data sources. The goal
is to obtain near real time insights using continuous analytics to allow the organization to take immediate
action
BATCH PROCESSING
Batch jobs can be stored up during working hours and then executed during the evening or whenever the
computer is idle.
Batch processing is an efficient and preferred way for processing high volume of data
Data processing programs are run over a group of transactions and collected over a business agreed time period
Data is collected, entered, processed and then the batch results are produced for every batch window. Batch
processing requires separate programs for input, process and output
Examples
An example of batch processing is the way that credit card companies process billing. The customer does not
receive a bill for each separate credit card purchase but one monthly bill for all of that months purchases. The
bill is created through batch processing, where all of the data are collected and held until the bill is processed
as a batch at the end of the billing cycle.
Forecasting
Extract
BI
Transform
Social data
Load
Historic data
Service data
Hadoop and Big Data, they became synonymous. But they are two different things. Hadoop is a parallel
programming model that is implemented on a bunch of low-cost clustered processors, and it's intended to
support data-intensive distributed applications. That's what Hadoop is all about.
Due to the advent of new technologies, devices, and communication means like social networking sites, the
amount of data produced by mankind is growing rapidly every year
An enterprise will have a computer to store and process big data. For storage purpose, the programmers will
take the help of their choice of database vendors such as Oracle, IBM, etc. In this approach, the user interacts
with the application, which in turn handles the part of data storage and analysis.
This approach works fine with those applications that process less voluminous data that can be accommodated
by standard database servers, or up to the limit of the processor that is processing the data. But when it comes to
dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database
bottleneck
Google solved this problem using an algorithm called Map Reduce. This algorithm divides the task into small
parts and assigns them to many computers, and collects the results from them which when integrated, form the
result dataset. Using the solution provided by Google, Doug Cutting and his team developed an Open Source
Project called HADOOP.
WHAT IS
Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides distributed storage and
computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each offering local computation
and storage.
As mentioned above Hadoop is a place to store and known to be a distributed file system
Concepts come under Hadoop is HDFS , Map reduce , Pig , Hive , Sqoop , Flume , HBase
HDFS
Map reduce
Scalable
Characteristics
Flexible
Economical
Accessible
Differentiating
Factors
Scalable
Simple
Hadoop is a system for large scale data processing. It has two main components:
HDFS
MapReduce
Clustered storage
The most common file system used by Hadoop is the Hadoop Distributed File System (HDFS).
It is designed to run on large clusters (thousands of computers) of small computer machines in a reliable,
fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single Name Node that manages the file
system metadata and one or more slave Data Nodes that store the actual data.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of Data Nodes.
The Name Node determines the mapping of blocks to the Data Nodes.
The Data Nodes takes care of read and write operation with the file system. They also take care of block
creation, deletion and replication based on instruction given by Name Node.
HDFS provides a shell like any other file system and a list of commands are available to interact with the file
system.
HDFS COMPONENTS
1.
Storage side acts as master of the system , HDFS has only one Name node
It maintains, manages and administers the data blocks present on the Data nodes
The Name Node determines the mapping of blocks to the Data Nodes.
Increasing the block size tends to decrease seek time and increase in streaming time finally the R/W operations
will be fast
If at all failure of the Name node occurs back up of name node is must done
Secondary name node will get data from name node for every one hour
Name node
Secondary
Name node
secure
HDFS ARCHITECTURE
Client
Read
Name node
Data nodes
Block ops
Replication
Data nodes
Write
Rack 1
Rack 2
Client
RACK AWARENESS
Rack 1
Rack 2
File 1
File 3
File 2
File 2
File 3
File 3
File 2
File 1
File 1
Rack 3
2. Create
Distributed File
System
4. Write data
3. Shows location
Name node
5. ack packet
4
Data node
4
Data node
Data node
4. Read
5. Close
Name node
FS Data
Input Stream
3. Read
Data node
3. Read
3. Read
Data node
Data node
Reducer
Map
Input key
value pair
Persistent
data
Map
Map
Reduce
Transient
data
Reduce
Reduce
Persistent
data
Output key
value pair