Getting Started With Hadoop Planning Guide
Getting Started With Hadoop Planning Guide
Planning Guide
FEBRUARY 2013
Planning Guide
Contents
A Mountain of Data
1 Kilobyte (KB) = 1,000 Bytes
1 Megabyte (MB) = 1,000,000 Bytes
1 Gigabyte (GB) = 1,000,000,000 Bytes
1 Terabyte (TB) = 1,000,000,000,000 Bytes
1 Petabyte (PB) = 1,000,000,000,000,000 Bytes
1 Exabyte (EB) = 1,000,000,000,000,000,000 Bytes
1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 Bytes
1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000 Bytes
Big data is measured in terabytes, petabytes, and even exabytes. Put it all in perspective
with this handy conversion chart.
Working with data sets whose size and variety is beyond the
ability of typical database software to capture, store, manage,
and analyze.
Processing a steady stream of real-time data in order to make
time-sensitive decisions faster than ever before.
Distributed in nature. Analytics processing goes to where the
data is for greater speed and efficiency.
A new paradigm in which IT collaborates with business users and
data scientists to identify and implement analytics that will
increase operational efficiency and solve new business problems.
Moving decision making down in the organization and
empowering people to make better, faster decisions in real time.
Only about volume. Its also about variety and velocity. But
perhaps most important, its about value derived from the data.
Generated or used only by huge online companies like Google
or Amazon anymore. While Internet companies may have
pioneered the use of big data at web scale, applications touch
every industry.
About one-size-fits-all traditional relational databases built
on shared disk and memory architecture. Big data analytics
uses a grid of computing resources for massively parallel
processing (MPP).
Meant to replace relational databases or the data warehouse.
Structured data continues to be critically important to companies.
However, traditional systems may not be suitable for the new
sources and contexts of big data.
For organizations to realize the full potential of big data, they must
find a new approach to capturing, storing, and analyzing data.
Traditional tools and infrastructure arent as efficient working with
larger and more varied data sets coming in at high velocity.
The new shared nothing architecture can scale with the huge
volumes, variety, and speed requirements of big data by distributing
the work across dozens, hundreds, or even thousands of commodity
servers that process the data in parallel. First implemented by
large community research projects such as SETI@home and online
HARD
WA
RE
AR
Single Processor
CH
IT
EC
TU
RE
Multicore Computing
Massively Parallel Processing (MPP)
RCH
DATA A
Distributed Grid
CT
UR
Complex-Flexible/
Nonrelational
APP
LICATION ARCH
ITEC
TU
RE
Shared
Nothing
ITE
Parallel Layers
Distributed Frameworks
Parallel Algorithms
Multitasking/Multithreaded
Sequential
Shared nothing architecture is possible because of the convergence of advances in hardware, data management, and analytic
applications technologies.
Source: Data rEvolution. CSC Leading Edge Forum (2011).
Processing is pushed out to the nodes where the data resides. This
is completely different from a traditional approach, which retrieves
data for processing at a central point.
Ultimately, the data must be reintegrated to deliver meaningful
results. Distributed processing software frameworks make the
computing grid work by managing and pushing the data across
machines, sending instructions to the networked servers to work in
parallel, collecting individual results, and then reassembling them for
the payoff.
Pig*
Data Flow
Oozie
Coordination
HDFS*
HBase*
Hadoop MapReduce
ZooKeeper*
Flume* | Chukwa*
Hive*
Data Warehouse
Workow
Statistics
Distributed
Table Store
Mahout*
Machine Learning
Sqoop
Relational Database
Data Collector
The primary storage system that uses multiple replicas of data blocks, distributes them on nodes
throughout a cluster, and provides high-throughput access to application data
Apache Hadoop
MapReduce
A programming model and software framework for applications that perform distributed
processing of large data sets on compute clusters
Apache Hadoop
Common
Utilities that support the Hadoop framework, including FileSystem (an abstract base class
for a generic file system), remote-procedure call (RPC), and serialization libraries
Apache Cassandra*
Apache Chukwa*
A data collection system for monitoring large distributed systems built on HDFS and MapReduce;
includes a toolkit for displaying, monitoring, and analyzing results
Apache HBase*
A scalable, distributed database that supports structured data storage for large tables; used for
random, real-time read/write access to big data
Apache Hive*
A data warehouse infrastructure that provides data summarization, ad hoc querying, and the
analysis of large data sets in Hadoop-compatible file systems
Apache Mahout*
A scalable machine learning and data mining library with implementations of a wide range of
algorithms, including clustering, classification, collaborative filtering, and frequent-pattern mining
Apache Pig*
A high-level data-flow language and execution framework for expressing parallel data analytics
Apache ZooKeeper*
Hadoop* Adoption
As more and more enterprises recognize the value and advantages
associated with big data insights, adoption of Hadoop software is
growing. The Hadoop open-source technology stack includes an
open-source implementation of MapReduce, HDFS, and the Apache
HBase* distributed database that supports large, structured data tables.
After six years of refinements, Apache released the first full
production version of Apache Hadoop 1.0 software in January 2012.
Among the certified features supported in this version are HBase*,
Kerberos security enhancements, and a representational state
transfer (RESTful) API to access HDFS.7
Hadoop software can be downloaded from one of the Apache
download sites. Because Hadoop software is an open-source,
volunteer project, the Hadoop wiki provides information about
getting help from the community as well as links to tutorials and
user documentation for implementing, troubleshooting, and setting
up a cluster.
Category
Vendor/Offering
EMC* Greenplum*
HP* Big Data Solutions
IBM* InfoSphere*
Microsoft* Big Data Solution
Oracle* Big Data Appliance
Hadoop distributions
Cloud-based solutions
See Big Data Vendor Spotlights for some of the Intel partners who offer big data solutions.
Note: The Hadoop ecosystem is emerging rapidly. This list is adapted from two sources: Dumbill, Edd. Big Data Market Survey: Hadoop Solutions.
OReilly Radar (January 19, 2012). http://radar.oreilly.com/2012/01/big-data-ecosystem.html and Data rEvolution: CSC Leading Edge Forum. CSC
(2011). http://assets1.csc.com/lef/downloads/LEF_2011Data_rEvolution.pdf
NameNode
Master Node
JobTracker
Slave Node
Slave Node
Slave Node
Slave Node
TaskTracker
TaskTracker
TaskTracker
TaskTracker
DataNode
DataNode
DataNode
DataNode
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Source: Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms: Apache Hadoop*. Intel (February 2012).
intelcloudbuilders.com/docs/Intel_Cloud_Builders_Hadoop.pdf
NameNode
Master Node
JobTracker
nt
me s
ign ode
s
As aN
ta at
Da to D
Data Read
Data Write
Client
Metadata Operations
to Get Block Info
Ta
to sk A
Ta ss
sk ign
Tr m
ac en
ke t
rs
Slave Node
Slave Node
Slave Node
Slave Node
TaskTracker
TaskTracker
TaskTracker
TaskTracker
DataNode
DataNode
DataNode
DataNode
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Jobs are orchestrated by the master node and processed on the slave nodes.
Performance
Server
Networking
Storage
Good
Gigabit Ethernet
(GbE) or 10 GbE
Hard drives
Better
10 GbE
Best
10 GbE
SSDs
16
Intel IT Center Planning Guide | Big Data
Benchmarking
Benchmarking is the quantitative foundation for measuring the
efficiency of any computer system. Intel developed the HiBench
suite as a comprehensive set of benchmark tests for Hadoop
environments.8 The individual measures represent important Hadoop
workloads with a mix of hardware usage characteristics. HiBench
includes microbenchmarks as well as real-world Hadoop applications
representative of a wider range of data analytics such as search
indexing and machine learning. HiBench 2.1 is now available as open
source under Apache License 2.0 at https://github.com/hibench/
HiBench-2.1.
Category
Workload
Description
Microbenchmarks
Sort
This workload sorts its binary input data, which is generated using the Apache Hadoop*
RandomTextWriter example.
Representative of real-world MapReduce jobs that transform data from one format to another.
WordCount
This workload counts the occurrence of each word in the input data, which is generated using
Hadoop* RandomTextWriter.
Representative of real-world MapReduce jobs that extract a small amount of interesting data
from a large data set.
Web search
TeraSort
A standard benchmark for large-size data sorting that is generated by the TeraGen program.
Enhanced
DFSIO
Apache
Nutch*
Indexing
This workload tests the indexing subsystem in Nutch*, a popular Apache open-source search
engine. The crawler subsystem in the Nutch engine is used to crawl an in-house Wikipedia*
mirror and generates 8.4 GB of compressed data (for about 2.4 million web pages) total as
workload input.
Computes the aggregated bandwidth by sampling the number of bytes read or written at fixed
time intervals in each map task.
Large-scale indexing system is one of the most significant uses of MapReduce (for example, in
Google* and Facebook* platforms).
Machine learning
Page Rank
K-Means
Clustering
Typical application area of MapReduce for large-scale data mining and machine learning (for
example, in Google and Facebook platforms).
K-Means is a well-known clustering algorithm.
Bayesian
Classification
Typical application area of MapReduce for large-scale data mining and machine learning (for
example, in Google and Facebook platforms).
This workload tests the naive Bayesian (a well-known classification algorithm for knowledge
discovery and data mining) trainer in the Apache Mahout* open-source machine learning library.
Analytical query
Apache Hive*
Join
This workload models complex analytic queries of structured (relational) tables by computing
the sum of each group over a single read-only table.
Hive*
Aggregation
This workload models complex analytic queries of structured (relational) tables by computing
both the average and sum for each group by joining two different tables.
Step 1: Work with your business users to articulate the big opportunities.
Identify and collaborate with business users (analysts, data scientists, marketing professionals, and so on) to find the best business
opportunities for big data analytics in your organization. For example, consider an existing business problemespecially one that is
difficult, expensive, or impossible to accomplish with your current data sources and analytics systems. Or consider a problem that has
never been addressed before because the data sources are new and unstructured.
Prioritize your opportunity list and select a project with a discernible return on investment.
Determine the skills you need to successfully accomplish your initiative.
About Hadoop
Software
Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms:
Apache* Hadoop*
This reference architecture is for organizations that want to build their own cloud computing infrastructure,
including Apache Hadoop clusters to manage big data. It includes steps for setting up the deployment at
your data center lab environment and contains details on Hadoop topology, hardware, software, installation
and configuration, and testing. Implementing this reference architecture will help you get started building
and operating your own Hadoop infrastructure.
intelcloudbuilders.com/docs/Intel_Cloud_Builders_Hadoop.pdf
Additional
Resources
Endnotes
1. Gens, Frank. IDC Predictions 2012: Competing for 2020. IDC (December
2011). http://cdn.idc.com/research/Predictions12/Main/downloads/
IDCTOP10Predictions2012.pdf
2. Big Data Infographic and Gartner 2012 Top 10 Strategic Tech
Trends. Business Analytics 3.0 (blog) (November 11, 2011).
http://practicalanalytics.wordpress.com/2011/11/11/big-datainfographic-and-gartner-2012-top-10-strategic-tech-trends/
3. Global Internet Traffic Projected to Quadruple by 2015. The Network
(press release) (June 1, 2011). http://newsroom.cisco.com/press-releasecontent?type=webcontent&articleId=324003
4. Big Data: The Next Frontier for Innovation, Competition, and
Productivity. McKinsey Global Institute (May 2011). mckinsey.com/
Insights/MGI/Research/Technology_and_Innovation/Big_data_The_
next_frontier_for_innovation.pdf
5. Peer Research on Big Data Analytics: Intels IT Manager Survey on
How Organizations Are Using Big Data. Intel (August 2012). intel.com/
content/www/us/en/big-data/data-insights-peer-research-report.html
0213/RF/ME/PDF-USA 328687-001