Another Intro To Hadoop
Another Intro To Hadoop
Fridays@5
Context Optional
April 2, 2010
By Adeel Ahmad
About Me
Follow me on Twitter @_adeel
Download core Hadoop
Can do everything we mentioned
Still needs user to play with config files and
create scripts
How to Run Hadoop
Cloudera Inc. provides their own distributions and
enterprise support and training for Hadoop
Core Hadoop plus patches
Bundled with command-line scripts, Hive, Pig
Publish AMI and scripts for EC2
Best option for your own cluster
How to Run Hadoop
Amazon Elastic MapReduce (EMR)
GUI or command-line cluster management
Supports Streaming, Hive, Pig
Grabs data and MapReduce code from S3 buckets and
puts it into HDFS
Auto-shutdown EC2 instances
Cloudera now has scripts for EMR
Easiest option
Pig
High-level scripting language developed by Yahoo
Describes multi-step jobs
Translated into MapReduce tasks
Grunt command-line interface
Ex: Find top 5 most visited pages by users aged 18 to 25
Users = LOAD 'users' AS (name, age);
Filtered = FILTER Users BY age >=18 AND age <= 25;
Pages = LOAD 'pages' AS (user, url);
Joined = JOIN Filtered BY name, Pages BY user;
Grouped = GROUP Joined BY url;
Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks;
Sorted = ORDER Summed BY clicks DESC
Hive
High-level interface created by Facebook
Gives db-like structure to data
HIveQL declarative language for querying
Queries get turned into MapReduce jobs
Command-line interface
ex.
CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING);
LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table;
SELECT … FROM … JOIN ...
Mahout
Machine-learning libraries for Hadoop
– Collaborative filtering
– Clustering
– Frequent pattern recognition
– Genetic algorithms
Applications
– Product/friend recommendation
– Classify content into defined groups
– Find associations, patterns, behaviors
– Identify important topics in conversations
More stuff
Hbase – database based on Google's Bigtable
Sqoop – database import tool
Zookeeper – coordination service for distributed
apps to keep track of servers, like a filesystem
Avro – data serialization system
Scribe – logging system developed by Facebook