0% found this document useful (0 votes)
13 views

Lecture8 -Big Data (Hadoop)

Big Data using Hadoop

Uploaded by

amirosama2121
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture8 -Big Data (Hadoop)

Big Data using Hadoop

Uploaded by

amirosama2121
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

CSIS22H

Advanced Database Systems

Lecture 8
Big Data
“I have been surprised and delighted over the years about how
many people are interested in working with data. There’s
definitely a new geek in town. And in 2015, this geek is a data
geek.”
Christian Chabot, founder and CEO - Tableau

“We have for the first time an economy based on a key


resource [information] that is not only renewable, but
self-generating. Running out of it is not a problem, but
drowning in it is.”
John Naisbitt, American author and public speaker “Big data = Crude oil …
But you need to refine the crude oil.
Enter Data Science”
“It’s a great time to be a data geek.” Carlos Somohano, Data Scientist - London
Roger Barga, Microsoft Research

“There is a big data revolution”


Prof. Gary King, Director for the IQSS - Harvard Univ.
Lecture Contents:
• Why Big Data?
• Definition – 3 & 4 Vs
• Tools for Big Data
• IBM’s Big Data Platform
• What is Hadoop
• Hadoop vs. Other Systems
• Some Hadoop Related Names to Know
Why Big Data?
• 2.5 quintillion (1018) bytes of data are generated every day!
• Social media sites
• Sensors
• Digital photos
• Business transactions
Website Social Media
• Location-based data
Billing
ERP Network Switches
Source: IBM http://www-01.ibm.com/software/data/bigdata/ CRM RFID
Why Big Data ?
• Big data itself isn’t new – its been here for a while and growing
exponentially. What is new is the technology to process and analyze it.
• Increase of storage capacities
• Increase of processing power Available technology can cost
effectively manage and analyze all
• Availability of data available data in its native form
unstructured, structured, streaming

It is all about deriving new insight for the business


Why Big Data ?
• Big data is about deriving new insight from previously untouched data &
integrating that insight into your business operation.

• Its about applying new tools to do more analytics on more data for more
people.
Glen Mules – Big Data University Glen Mules – Big Data University
Big Data - Definition

“Big Data is any data that is expensive to manage and hard


to extract value from.”
Michael Franklin
Thomas M. Siebel Professor of Computer Science
Director of the Algorithms, Machines and People Lab
University of Berkeley

Key idea: “Big” is relative! “Difficult Data” is perhaps more apt!

Bill Howe, UW
Big Data Scenario: Netflix
Big Data Scenario: Amazon
Big Data Characteristics: 3 V’s
• Volume Terabyte = 101 2
Exabyte = 101 8
Zettabyte = 1021
The size of the data Brontobyte = 1027

• Velocity
The speed at which new 1021

data is generated

• Variety
The diversity of sources,
formats, quality, structures
They could also be 4 V’s

© 2014 IBM Corporation


OR 6 V’s

© 2014 IBM Corporation


10 Vs
Traditional Data Warehouse Solution
Problem with Traditional DWH Solution
Tools for Big Data
• NoSQL Systems
MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable,
Voldemort, Riak, ZooKeeper , neo4j
• MapReduce
Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR,
Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
• Storage S3 ((Simple Storage Service),
Hadoop Distributed File System
Big Data is Not JUST Hadoop → Big Data is a platform
Understand and navigate
Federated Discovery and Navigation
federated big data sources

Manage & store huge volume of Hadoop File System


any data MapReduce

Structure and control data Data Warehousing

Manage streaming data Stream Computing

Analyze unstructured data Text Analytics Engine

Integrate and govern all data Integration, Data Quality, Security, Lifecycle
sources Management, MDM

Source: IBM http://www-01.ibm.com/software/data/bigdata/


IBM’s Big Data Platform
The key aspects of the platform are:

•Integration

•Analytics

•Visualization

•Development

•Workload Optimization

•Security and Governance

Source: IBM http://www-01.ibm.com/software/data/bigdata/


What is Hadoop
• Hadoop is a distributed file system and data processing engine that is designed to
handle extremely high volumes of data in any structure across large clusters of
computers.
• Hadoop has two components:
1. The Hadoop distributed file system (HDFS), which supports data in structured relational
form, in unstructured form, and in any form in between
2. The MapReduce programing paradigm for managing applications on multiple distributed
servers
• The focus is on supporting redundancy, distributed architectures, and parallel
processing
Scalability in Hadoop
What is Hadoop
Hadoop vs RDBMS
Bigger Picture: Hadoop vs. Other Systems
Distributed Databases Hadoop
Computing - Notion of transactions - Notion of jobs
Model - Transaction is the unit of work - Job is the unit of work
- ACID properties, Concurrency - No concurrency control
control
Data Model - Structured data with known - Any data will fit in any format
schema - (un)(semi)structured
- Read/Write mode - ReadOnly mode

Cost Model - Expensive servers - Cheap commodity machines

Fault Tolerance - Failures are rare - Failures are common over


- Recovery mechanisms thousands of machines
- Simple yet efficient fault
tolerance
Key - Efficiency, optimizations, fine- - Scalability, flexibility, fault
Characteristics tuning tolerance
Some Hadoop Related Names to Know
• Apache Avro: designed for communication between Hadoop nodes through data
serialization

• Cassandra and Hbase: a non-relational database designed for use with Hadoop

• Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop
• Mahout: an AI tool designed for machine learning; that is, to assist with filtering
data for analysis and exploration
• Pig Latin: A data-flow language and execution framework for parallel computation

• ZooKeeper: Keeps all the parts coordinated and working together

You might also like