Hadoop
Hadoop
What is Hadoop???
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets Terabytes or petabytes of data
Innovation
This technology was invented by Google back in their early days so they could usefully index all the textual and structural information they were collecting and then present meaningful results to the users. Hadoop is based on a simple data model, any data will fit.
Master node
Need to process big data. Need to parallelize computation across thousands of nodes. Commodity hardware Large number of low-end cheap machines working in parallel to solve a computing problem. Small number of high-end expensive machines. Fault tolerance and automatic recovery Nodes/tasks will fail and will recover automatically.
Users of Hadoop
Google: Inventors of MapReduce computing paradigm. Yahoo: index calculation for yahoo search engine. IBM, Microsoft, Oracle, Apple, HP, Twitter Facebook, Amazon, AOL, NetFlex Many others universities and research labs
Hadoop Architecture
Hadoop framework consists of two main layers
1.
2.
A small Hadoop cluster will include a single master and multiple slave nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. Job tracker is the master node. it receives the users job Hadoop requires Java Runtime Environment (JRE) 1.6 or higher.
HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. HDFS keeps different copies of data in different locations. The goal of HDFS is to reduce the impact of power failure or switch failure, so that even if these occur, the data can be available.
Properties of HDFS
Large: A HDFS instance may consist of thousands of server machines, each storing part of the file systems data Replication: Each data block is replicated many times (default is 3). Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
Hadoop is a framework which provides distributed storage and computational capabilities both. It is extremely scalable. HDFS uses large block size which eventually works best when manipulating large data sets. HDFS maintains different replicas of files ; fault tolerant. Hadoop uses Mapreduce framework which is batch-based, distributed computing framework.
Limitations of Hadoop
Security Inefficient for handling small files. Does not offer storage or network level encryption. Single master model-can result in single point of failure.