hadoop hdfs
hadoop hdfs
• Hadoop
Advancements
• Hadoop Distributed File system (HDFS)
• Map Reduce
• Virtual Box
• Google App Engine
• Programming Environment for Google App Engine
• Open Stack
• Federation in the Cloud
• Four Levels of Federation
Hadoop
• the evolution of internet and related technologies, the high computational power,
• large volumes of data storage and faster data processing becomes the basic need for
most of the organizations and it has been significantly increased over the period of
time.
• Currently, organizations are producing huge amount of data at faster rate
• there is a need to acquire, analyze, process, handle and store such a huge amount of
data called big data
Volume : The Volume is related to Size of big data. The amount of data growing day by
day is very huge. According to IBM, in the year 2000, 8 lakh petabytes of data were
stored in the world.
Variety : The Variety is related to different formats of big data. Nowadays most of the
data stored by organizations have no proper structure called unstructured data
Hadoop
Velocity : The Velocity is related to speed of data generation which is
very fast. It is a rate at which data is captured, generated and shared.
The challenge here is how to react to massive information generated
in the time required by the application.
Veracity : The Veracity refers to uncertainty of data. The data stored
in database sometimes is not accurate or consistent that makes poor data
quality.
The inconsistent data requires lot of efforts to process such data.
Hadoop
The Apache Hadoop is an open source software project that
enables distributed processing of large data sets across clusters of
commodity servers using programming models.
It is designed to scale up from a single server to thousands of
machines, with a very high degree of fault tolerance.
It is a software framework for running the applications on
clusters of commodity hardware with massive storage, enormous
processing power and supporting limitless concurrent tasks or jobs.
Hadoop Ecosystem Components
The Hadoop core is divided into two fundamental components
called HDFS and MapReduce engine
Hadoop Ecosystem Components
Hadoop Ecosystem Components
HDFS
It is a Hadoop distributed file system which is used to split the data in to blocks and
stored amongst distributed servers for processing. It runs multiple clusters to store several
copies of data blocks those can be used in case failure occurs.
MapReduce
It’s a programming model to process the big data. It comprising of two programs written
in Java such as mapper and reducer. The mapper extracts data from HDFS and put in to
maps while reducer aggregate the results generated by mappers.
Zookeeper
It is a centralized service used for maintaining configuration information with
distributed synchronization and coordination.
Hadoop Ecosystem Components
Hbase
It is a Column-oriented database service used as NoSQL solution for big data
Pig
It is a platform used for analyzing the large data sets using a high-level language. It uses
dataflow language and provides parallel execution framework.
Flume
It provides distributed and reliable service for efficiently collecting, aggregating, and
moving large amounts of log data.
Scoop
It is a tool designed for efficiently transferring bulk data between Hadoop and structured
data stores such as relational databases.