Unit_IV_Hadoop
Unit_IV_Hadoop
Map Reduce
Flume
Sqoop
Distributed
Scalable
Fault-tolerant
Open source
Hadoop’s History
Based on work done by Google in the early 2000s
The Google File System (2003)
MapReduce (2004)
Fault tolerant
Supportsefficient processing with
MapReduce
How files are stored (1)
How files are stored (2)
How files are stored (3)
How files are stored (3)
How files are stored (4)
Getting Data in and out of HDFS
Getting Data in and out of HDFS
Getting Data in and out of
HDFS
Getting Data in and out of
HDFS
Getting Data in and out of
HDFS
Getting Data in and out of HDFS
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Important points about
HDFS
HDFS performs best with a modest number of large files
Millions, rather than billion, of files
Each file typically 100 MB or more
Reduce
Basic Terminology
Job – A “full program” - an execution of a Mapper and
Reducer across a data set
MapReduce job
submitted by JobTracker
client computer
Postgres
Netezza
Teradata
Example :
SELECT * FROM purchases WHERE price
>10000 ORDER BY storeid
PIG
What is Pig?
Pig Latin : A dataflow language for
transforming large data sets
Example :
purchases = LOAD “/user/dave/purchases”
AS (itemID, price, storeID, purchaserID);
bigticket = FILTER purchases BY price >
10000;
…..
Hive Vs Pig
Hive Pig
Benefits of Hadoop
Handles less structured or unstructured data
Significantly lower time and cost
Can retain data indefinitely for much lower cost than
traditional solutions
Conclusion
Hadoop has become a valuable business
intelligence tool and will become an increasingly
important part of a BI infrastructure