0% found this document useful (0 votes)
4 views

Unit 3 - Hadoop

Apache Hadoop is an open-source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. It uses the Hadoop Distributed File System to store data across clusters of commodity servers and MapReduce to distribute analytical jobs in parallel for scalability and fault tolerance.

Uploaded by

badaltanwarr
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 3 - Hadoop

Apache Hadoop is an open-source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. It uses the Hadoop Distributed File System to store data across clusters of commodity servers and MapReduce to distribute analytical jobs in parallel for scalability and fault tolerance.

Uploaded by

badaltanwarr
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Big Data Technology - Hadoop

Hadoop
• There are many Big Data technologies.

• Apache Hadoop is one of them

• Hadoop is an open-source framework for processing, storing, and


analyzing massive amounts of distributed, unstructured data.

• Hadoop was inspired by MapReduce, a user-defined function


developed by Google in the early 2000s for indexing the Web.

• The original creators of Hadoop are Doug Cutting and Mike Cafarella
Hadoop
• Hadoop is an open-source project administered by the Apache
Software Foundation.
• Unlike traditional, structured platforms, Hadoop is able to store
any kind of data in its native format and to perform a wide
variety of analyses and transformations on that data.
• Hadoop stores terabytes, and even petabytes, of data
inexpensively.
• It is robust and reliable and handles hardware and system failures
automatically, without losing data or interrupting data analyses.
• Hadoop runs on clusters of commodity servers and each of those
servers has local CPUs and disk storage that can be leveraged by
the system.
Importance
1. Gives organizations the flexibility to ask questions across
their structured and unstructured data that were previously
impossible to ask or solve.
2. The scalability and elasticity of free, open-source Hadoop
running on standard hardware allow organizations to hold
onto more data than ever before, at a lower cost than
proprietary solutions and thereby take advantage of all their
data to increase operational efficiency and gain a
competitive edge.
• At one-tenth the cost of traditional solutions, Hadoop excels at
supporting complex analyses— including detailed, special-
purpose computation—across large collections of data.
Task Handled
Hadoop handles a variety of workloads, including
 Search
 Recommendation systems
 Data warehousing and
 Video/image analysis.

Today’s explosion of data types and volumes means that Big Data
equals big opportunities and Apache Hadoop empowers
organizations to work on the most modern scale-out architectures
using a clean-sheet design data framework, without vendor lock-in.
Components of Hadoop
1. The Hadoop Distributed File System (HDFS).
 It is the storage system for a Hadoop cluster.
 When data lands in the cluster, HDFS breaks it into pieces and
distributes those pieces among the different servers
participating in the cluster.
 Each server stores just a small fragment of the complete data
set.
 and each piece of data is replicated on more than one server.
Components of Hadoop
2. MapReduce.
 Because Hadoop stores the entire dataset in small
pieces across a collection of servers,
 Analytical jobs can be distributed, in parallel, to
each of the servers storing part of the data.
 Each server evaluates the question against its local
fragment simultaneously and reports its results
back for collation into a comprehensive answer.
 MapReduce is the agent that distributes the work
and collects the results.
Working of Components
• HDFS continually monitors the data stored on the cluster. If a
server becomes unavailable, a disk drive fails, or data is damaged,
whether due to hardware or software problems, HDFS
automatically restores the data from one of the known good
replicas stored elsewhere on the cluster.
• MapReduce monitors progress of each of the servers participating
in the job. If one of them is slow in returning an answer or fails
before completing its work, MapReduce automatically starts
another instance of that task on another server that has a copy of
the data.
• Because of the way that HDFS and MapReduce work, Hadoop
provides scalable, reliable, and fault-tolerant services for data
storage and analysis at very low cost.
Working of Components
Hadoop – Some Limitations
• Hadoop’s components are immature and still developing.
• Hadoop clusters and performing advanced analytics on large
volumes of unstructured data requires significant expertise, skill,
and training. Unfortunately, there is currently a dearth of Hadoop
developers and data scientists available.
• Hadoop is a batch-oriented framework, meaning it does not support
real-time data processing and analysis

You might also like