0% found this document useful (0 votes)

679 views

Hadoop

Hadoop is a framework for distributed processing of large datasets across clusters of computers. It uses MapReduce as a programming model where users define map and reduce functions. The MapReduce framework automatically parallelizes the job and manages task execution and hardware failures. The Hadoop Distributed File System (HDFS) stores very large files reliably and provides high throughput access to application data. Major companies use Hadoop to analyze petabytes of data.

Uploaded by

forjunklikescribd

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

679 views

Hadoop

Uploaded by

forjunklikescribd

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Hadoop MapReduce

Felipe Meneses Besson

IME-USP, Brazil

July 7, 2010

Agenda

What is Hadoop? Hadoop Subprojects MapReduce HDFS Development and tools

What is Hadoop?
A framework for large-scale data processing (Tom White, 2009):

Project of Apache Software Foundation Most written in Java Inspired in Google MapReduce and GFS (Google File System)

A brief history

2004: Google published a paper that introduced MapReduce and GFS as a alternative to handle the volume of data to be processed 2005: Doug Cutting integrated MapReduce in the Hadoop 2006: Doug Cutting joins Yahoo! 2008: Cloudera was founded 2009: Hadoop cluster sort 100 terabyte in 173 minutes (on 3400 nodes) Nowadays, Cloudera company is an active contributor to the Hadoop project and provide Hadoop consulting and commercial products.
[1]Cloudera: http://www.cloudera.com [2] Sort Benchmark: http://sortbenchmark.org/

Hadoop Characteristics

A scalable and reliable system for shared storage and analyses. It automatically handles data replication and node failure It does the hard work developer can focus on processing data logic Enable applications to work of petabytes of data in parallel
5

Who's using Hadoop

Source: Hadoop wiki, September 2009

Hadoop Subprojects
Apache Hadoop is a collection of related subprojects that fall under the umbrella of infrastructure for distributed computing.

All projects are hosted by the Apache Software Foundation.

MapReduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets (Jeffrey Dean and Sanjay Ghemawat, 2004)

Based on a functional programming model A batch data processing system A clean abstraction for programmers Automatic parallelization & distribution Fault-tolerance

MapReduce
Programming model Users implement the interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

MapReduce
Map Function
Input:

Records from some data source (e.g., lines of files, rows of a databases, ) are associated in the (key, value) pair Example: (filename, content)

Output:

One or more intermediate values in the (key, value) format Example: (word, number_of_occurrences)

MapReduce
Map Function
map (in_key, in_value) (out_key, intermediate_value) list

Source: (Cloudera, 2010)

MapReduce
Map Function

Example: map (k, v): if (isPrime(v)) then emit (k, v) (foo, 7) (test, 10) (foo, 7) (nothing)

MapReduce
Reduce function
After map phase is over, all the intermediate values for a given output key are combined together into a list Input:

Intermediate values Example: (A, [42, 100, 312])

Output:

usually only one final value per key Example: (A, 454)

MapReduce
Reduce Function
reduce (out_key, intermediate_value list) out_value list

Source: (Cloudera, 2010)

MapReduce
Reduce Function
Example: reduce (k, vals): sum = 0 foreach int v in vals: sum += v emit (k, sum) (A, [42, 100, 312]) (B, [12, 6, -2]) (A, 454) (B, 16)
15

MapReduce
Terminology
Job: unit of work that the client wants to be performed

Input data + MapReduce program + configuration information

Task: part of the job

map and reduce tasks

Jobtracker: node that coordinates all the jobs in the system by scheduling tasks to run on tasktrackers
16

MapReduce
Terminology
Tasktracker: nodes that run tasks and send progress reports to the jobtracker Split: fixed-size piece of the input data

MapReduce
DataFlow

Source: (Cloudera, 2010)

MapReduce
Real Example

map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

MapReduce
Real Example

reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

MapReduce
Combiner function

Compress the intermediate values Run locally on mapper nodes after map phase It is like a mini-reduce Used to save bandwidth before sending data to the reducer

MapReduce
Combiner Function

Applied in a mapper machine

Source: (Cloudera, 2010)

HDFS
Hadoop Distributed Filesystem

Inspired on GFS Designed to work with very large files Run on commodity hardware Streaming data access Replication and locality

HDFS
Nodes

A Namenode (the master) Manages the filesystem namespace Knows all the blocks location Datanodes (workers) Keep blocks of data Report back to namenode its lists of blocks periodically
24

HDFS
Duplication
Input data is copied into HDFS is split into blocks

Each data blocks is replicated to multiple machines

HDFS
MapReduce Data flow

Source: (Tom White, 2009)

Hadoop filesystems

Source: (Tom White, 2009)

Development and Tools

Hadoop operation modes

Hadoop supports three modes of operation:

Standalone Pseudo-distributed Fully-distributed

More details:
http://oreilly.com/other-programming/excerpts/hadooptdg/installing-apache-hadoop.html
28

Development and Tools

Java example

Development and Tools

Java example

Development and Tools

Java example

Development and Tools

Guidelines to get started The basic steps for running a Hadoop job are:

Compile your job into a JAR file Copy input data into HDFS Execute hadoop passing the jar and relevant args Monitor tasks via Web interface (optional) Examine output when job is complete

Development and Tools

Api, tools and training

Do you want to use a scripting language?

http://wiki.apache.org/hadoop/HadoopStreaming http://hadoop.apache.org/core/docs/current/streaming.html

Eclipse plugin for MapReduce development

http://wiki.apache.org/hadoop/EclipsePlugIn

Hadoop training (videos, exercises, )

http://www.cloudera.com/developers/learn-hadoop/training/
33

Bibliography
Hadoop The definitive guide Tom White (2009). Hadoop The Definitive Guide. O'Reilly, San Francisco, 1st Edition Google Article Jeffrey Dean and Sanjay Ghemawat (2004). MapReduce: Simplified Data Processing on Large Clusters. Available on: http://labs.google.com/papers/mapreduce-osdi04.pdf Hadoop In 45 Minutes or Less Tom Wheeler. Large-Scale Data Processing for Everyone. Available on: http://www.tomwheeler.com/publications/2009/lambda_lounge_hadoop_200910/twheelerhadoop-20091001-handouts.pdf Cloudera Videos and Training http://www.cloudera.com/resources/?type=Training

Niagara AX - Developer Guide
100% (1)
Niagara AX - Developer Guide
158 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Hard Skill List
0% (1)
Hard Skill List
4 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
CC 2
No ratings yet
CC 2
25 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Unit 2
No ratings yet
Unit 2
21 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
ProgrammingHadoop ApacheConUS08
No ratings yet
ProgrammingHadoop ApacheConUS08
7 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
CLOUD UNIT 5
No ratings yet
CLOUD UNIT 5
52 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Tanmay Deshpande
No ratings yet
Basic Computer Engineeting - Unit 2
No ratings yet
Basic Computer Engineeting - Unit 2
71 pages
Question
No ratings yet
Question
13 pages
Aras Innovator Programmers Guide
No ratings yet
Aras Innovator Programmers Guide
105 pages
Ajp PR6
No ratings yet
Ajp PR6
4 pages
ServiceNow IT Business Management
100% (1)
ServiceNow IT Business Management
6 pages
Recruitment System
No ratings yet
Recruitment System
18 pages
Software Requirement Specifications: Nashtech Page 1 of 28 Offshore Software Development
No ratings yet
Software Requirement Specifications: Nashtech Page 1 of 28 Offshore Software Development
27 pages
Module 1 - Devops Foundation
No ratings yet
Module 1 - Devops Foundation
66 pages
Software Validation
No ratings yet
Software Validation
36 pages
Inside The Python Virtual Machine (En)
No ratings yet
Inside The Python Virtual Machine (En)
130 pages
WX27 Features
No ratings yet
WX27 Features
47 pages
Pat e Div Roblox
No ratings yet
Pat e Div Roblox
5 pages
Tutorial5 - Wizard - Data Access With ADO - NET - 2021
No ratings yet
Tutorial5 - Wizard - Data Access With ADO - NET - 2021
44 pages
Oop Lab Manual
100% (1)
Oop Lab Manual
12 pages
Aditya Engineering College (A) Aditya Engineering College (A)
No ratings yet
Aditya Engineering College (A) Aditya Engineering College (A)
76 pages
Automobile Gannt Chart
No ratings yet
Automobile Gannt Chart
6 pages
Appium Mobile Test Automation Tutorial
No ratings yet
Appium Mobile Test Automation Tutorial
10 pages
Readme
No ratings yet
Readme
3 pages
Online Bus Booking
No ratings yet
Online Bus Booking
13 pages
VBA Cheat-Sheet and Tutorial: VBA References To Information in Excel
100% (2)
VBA Cheat-Sheet and Tutorial: VBA References To Information in Excel
19 pages
University Institute of Engineering, Chandigarh University
No ratings yet
University Institute of Engineering, Chandigarh University
1 page
School Management System
No ratings yet
School Management System
5 pages
MODULE 1- FSD-Notes
No ratings yet
MODULE 1- FSD-Notes
13 pages
Assignment 1 Burhanuddin Sikandar Roll No: - 1511: Not Legal, Reserved Word
No ratings yet
Assignment 1 Burhanuddin Sikandar Roll No: - 1511: Not Legal, Reserved Word
8 pages
Education Management System (EMS)
No ratings yet
Education Management System (EMS)
161 pages
Guide On Configuring Various Timeouts in Playwright
No ratings yet
Guide On Configuring Various Timeouts in Playwright
6 pages
Tutorial_ Java Documentation - Basics _ CodeHS
No ratings yet
Tutorial_ Java Documentation - Basics _ CodeHS
5 pages
153 Shortcuts For QT Creator (Windows)
No ratings yet
153 Shortcuts For QT Creator (Windows)
2 pages

Hadoop

Uploaded by

Hadoop

Uploaded by

Hadoop MapReduce

Felipe Meneses Besson

What is Hadoop? Hadoop Subprojects MapReduce HDFS Development and tools

Who's using Hadoop

Source: Hadoop wiki, September 2009

All projects are hosted by the Apache Software Foundation.

Source: (Cloudera, 2010)

Intermediate values Example: (A, [42, 100, 312])

Source: (Cloudera, 2010)

Input data + MapReduce program + configuration information

Task: part of the job

map and reduce tasks

Source: (Cloudera, 2010)

Applied in a mapper machine

Source: (Cloudera, 2010)

Each data blocks is replicated to multiple machines

Source: (Tom White, 2009)

Source: (Tom White, 2009)

Development and Tools

Hadoop supports three modes of operation:

Standalone Pseudo-distributed Fully-distributed

Development and Tools

Development and Tools

Development and Tools

Development and Tools

Development and Tools

Do you want to use a scripting language?

Eclipse plugin for MapReduce development

Hadoop training (videos, exercises, )

You might also like