Big Data and Hadoop
Big Data and Hadoop
scale
Data-intensive
HPC, cloud
Semantic
discovery
Automate (discovery)
web
Discover (intelligence)
Transact
Integrate
Interact
Inform
Publish
time
Ref: http://www.focus.com/fyi/operations/10-largest-databases-in-the-world/
7
Challenges
Alignment with the needs of the business / user / noncomputer specialists / community and society
Need to address the scalability issue: large scale
data, high performance computing, automation,
response time, rapid prototyping, and rapid time to
production
Need to effectively address (i) ever shortening cycle
of obsolescence, (ii) heterogeneity and (iii) rapid
changes in requirements
Transform data from diverse sources into intelligence
and deliver intelligence to right people/user/systems
What about providing all this in a cost-effective
manner?
to Industry (2005)
Emerging enabling technology.
Natural evolution of distributed systems and the Internet.
Middleware supporting network of systems to facilitate
sharing, standardization and openness.
Infrastructure and application model dealing with sharing of
compute cycles, data, storage and other resources.
Publicized by prominent industries as on-demand computing,
utility computing, etc.
Move towards delivering computing to masses similar to
other utilities (electricity and voice communication).
Now,
Hmmmsounds like the definition for cloud computing!!!!!
10
11
platform (PaaS),
software (SaaS),
infrastructure (IaaS),
Services-based application programming interface (API)
Enabling Technologies
Cloud
Cloudapplications:
applications:data-intensive,
data-intensive,
compute-intensive,
compute-intensive,storage-intensive
storage-intensive
Bandwidth
WS
Services interface
Web-services, SOA, WS standards
VM0
Storage
Models: S3,
BigTable,
BlobStore, ...
VM1
VMn
Production
Environment
Simple
storage
Table Store
<key, value>
Drives
Accessible through
Web services
14
Windows Azure
Enterprise-level on-demand capacity builder
Fabric of cycles and storage available on-request
for a cost
You have to use Azure API to work with the
infrastructure offered by Microsoft
Significant features: web role, worker role , blob
storage, table and drive-storage
15
Amazon EC2
Amazon EC2 is one large complex web service.
EC2 provided an API for instantiating computing
instances with any of the operating systems
supported.
It can facilitate computations through Amazon
Machine Images (AMIs) for various other models.
Signature features: S3, Cloud Management
Console, MapReduce Cloud, Amazon Machine
Image (AMI)
Excellent distribution, load balancing, cloud
monitoring tools
16
Demos
Amazon AWS: EC2 & S3 (among the many
infrastructure services)
o Linux machine
o Windows machine
o A three-tier enterprise application
Windows Azure
o Storage: blob store/container
o MS Visual Studio Azure development and production environment
18
19
What is Hadoop?
At Google MapReduce operation are run on a
special file system called Google File System (GFS)
that is highly optimized for this purpose.
GFS is not open source.
Doug Cutting and others at Yahoo! reverse
engineered the GFS and called it Hadoop
Distributed File System (HDFS).
The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
This is open source and distributed by Apache.
22
Fault tolerance
Failure is the norm rather than exception
A HDFS instance may consist of thousands of server
machines, each storing part of the file systems data.
Since we have huge number of components and that
each component has non-trivial probability of failure
means that there is always some component that is
non-functional.
Detection of faults and quick, automatic recovery
from them is a core architectural goal of HDFS.
23
HDFS Architecture
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Namenode
Client
Block ops
Read
Datanodes
Datanodes
replication
B
Blocks
Rack1
Write
Rack2
Client
24
Master node
HDFS Client
Application
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
Replicated
25
What is MapReduce?
MapReduce is a programming model Google has used
successfully is processing its big-data sets (~ 20000 peta
bytes per day)
A map function extracts some intelligence from raw data.
A reduce function aggregates according to some guides the
data output by the map.
Users specify the computation in terms of a map and a
reduce function,
Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
Underlying system also handles machine failures, efficient
communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified
data processing on large clusters. Communication of ACM 51, 1 (Jan.
2008), 107-113.
26
Parse-hash
Count
P-0000
, count1
Parse-hash
Count
P-0001
, count2
Parse-hash
Count
Parse-hash
P-0002
,count3
28
MapReduce Engine
MapReduce requires a distributed file system and
an engine that can distribute, coordinate, monitor
and gather the results.
Hadoop provides that engine through (the file
system we discussed earlier) and the JobTracker +
TaskTracker system.
JobTracker is simply a scheduler.
TaskTracker is assigned a Map or Reduce (or other
operations); Map or Reduce run on node and so is
the TaskTracker; each task is run on its own JVM on
a node.
29
Demos
Word count application: a simple foundation for
text-mining; with a small text corpus of inaugural
speeches by US presidents
Graph analytics is the core of analytics involving
linked structures (about 110 nodes): shortest path
30
A Case-study in Business:
Cloud Strategies
31
Problem / Motivation:
Identify special causes that relate to bad outcomes for the quality-
Solution:
semantic technologies to provide key insights into how outcomes and causes
are related
Develop a rich internet application that allows the user to evaluate process
outcomes and conditions at a high level and drill down to specific areas of
interest to address performance issues
32
33
34
35
Summary
37