0% found this document useful (0 votes)

87 views44 pages

New World Hadoop Architectures (& What Problems They Really Solve) For Dbas

1) Hadoop and NoSQL technologies were solving the problems of scaling data affordably and making data warehousing more agile. 2) Traditional data warehousing struggled with the cost and complexity of scaling to the size required to store vast amounts of new data sources. Hadoop provided a low-cost way to store and process data at massive scales. 3) Hadoop also enabled data to be stored in its original format and structure, applying schemas flexibly at query time. This addressed the need for faster analysis of new data types and sources without needing to define their structure up front.

Uploaded by

Anonymous VVSLkDOAC1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views44 pages

New World Hadoop Architectures (& What Problems They Really Solve) For Dbas

Uploaded by

Anonymous VVSLkDOAC1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

NEW WORLD HADOOP ARCHITECTURES (& WHAT

PROBLEMS THEY REALLY SOLVE) FOR DBAS

UKOUG DATABASE SIG MEETING
Mark Rittman, Oracle ACE Director
London, February 2017
About The Presenter
Oracle ACE Director, Independent Analyst
Past ODTUG Exec Board Member + Oracle Scene Editor
Author of two books on Oracle BI
Co-founder & CTO of Rittman Mead
15+ Years in Oracle BI, DW, ETL + now Big Data
Host of the Drill to Detail Podcast (www.drilltodetail.com)
Based in Brighton & work in London, UK

2
BACK IN FEBRUARY

3
Hi Mark, In things I have seen and read quite o6en people
start with a high-level overview of a product (e.g. Hadoop,
Ka@a), then describe the technical concepts (using all the
appropriate terminology)

but I am usually le6 missing something. I think it's around

the area of what problems these technologies are solving
and how they are doing it? Without that context I'm nding
it all very academic

Many people say tradiKonal systems will sKll be

needed. Are these new technologies solving completely
dierent problems to those handled by tradi=onal IT?
Is there an overlap?
20 Years in Old-school BI & Data Warehousing
Started back in 1996 on a bank Oracle DW project
Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL and shell scripts
Data warehouses provided a unified view of the business
Single place to store key data and metrics
Joined-up view of the business
Aggregates and conformed dimensions
ETL routines to load, cleanse and conform data
BI tools for simple, guided access to information
Tabular data access using SQL-generating tools
Drill paths, hierarchies, facts, attributes
Fast access to pre-computed aggregates
Packaged BI for fast-start ERP analytics

5
Data Warehousing and BI at Peak Oracle

7
Oracle Data Management Platform as of Today

8
What Happened?
Lets Go Back to 2003

10
Google File System and MapReduce
Google needed to store and query their vast amount of server log files
And wanted to do so using cheap, commodity hardware
Google File System and MapReduce designed together for this use

12
Google File System + MapReduce Key Innovations
GFS optimised for particular task at hand -
computing PageRank for sites
Streaming reads for PageRank calcs, block writes for
crawler whole-site dumps
Master node only holds metadata
Stops client/master I/O being bottleneck, also acts as
traffic controller for clients
Simple design, optimised for specific Google Need
MapReduce focused on simple computations on
abstraction framework
Select & filter (MAP) and reduce (aggregate) functions,
easily to distribute on cluster
MapReduce abstracted cluster compute, HDFS
abstracted cluster storage
Projects that inspired Apache Hadoop + HDFS
13
How Traditional RDBMS Data Warehousing Scaled-Up

Shared-Everything Architectures (i.e.

Oracle RAC, Exadata)

Shared-Nothing Architectures
(e.g. Teradata, Netezza)

14
Problem #1 That Hadoop / NoSQL Solved :

Scaling Affordably
Oracle scales infinitely and is free. Period
Cost and Complexity around Scaling DW Clusters
Enterprise High-End RDBMSs such as Oracle can scale
Clustering for single-instance DBs can scale to >PB
Exadata scales further by offloading queries to storage
Sharded databases (e.g. Netezza) can scale further
But cost (and complexity) become limiting factors
Typically $1m/node is not uncommon

17
Hadoops Original Appeal to Data Warehouse Owners
A way of storing (non-relational) data cheaply and easily expandable
Gave us a way of scaling beyond TB-size without paying $$$
First use-cases were offline storage, active archive of data

18
Hadoop Ecosystem Expanded Beyond MapReduce
Core Hadoop, MapReduce and HDFS
HBase and other NoSQL Databases
Apache Hive and SQL-on-Hadoop
Storm, Spark and Stream Processing
Apache YARN and Hadoop 2.0

19
Google BigTable, HBase and NoSQL Databases
Solution to the problem of storing semi-structured data at-scale
Built on Google File System
Scale for capacity e.g., webtable
100,000,000,000 pages,
10 versions per page,
20 KB / version = 20 PB of data
Scale for throughput
Hundreds of millions of users
Tens of thousands to millions of queries/sec
At low-latency with high-reliability

20
How BigTable Scaled Beyond Traditional RDBMSs
Optimised for a particular task - fast
lookups of ts-versioned web data
Data stored in multidimensional map keyed
on row, column + timestamp
Master + data tablets stored on GFS cluster
nodes
Simple key/value lookup with client doing
interpretation
Innovation - focus on single job with
different needs to OLTP
Formed inspiration for Apache HBase

21
Hive - Hadoop Discovers Set-Based Processing
Original developed at Facebook, now foundational within Hadoop
SQL-like language that compiles to MapReduce, Spark, HBase
Solved the problem of enabling non-programmers to access big data
And made Hadoop data transformation and aggregation code more productive
JDBC and ODBC drivers for tool integration

22
Apache Hive as SQL Access Engine For Everything
Hive is extensible to help with accessing and integrating new data sets
SerDes : Serializer-Deserializers that interpret semi-structured sources
UDFs + Hive Streaming : User-defined functions and streaming input
File Formats : make use of compressed and/or optimised file storage
Storage Handlers : use storage other than HDFS (e.g. MongoDB)

23
Common Hadoop/NoSQL Use-Case (c) 2014
Hadoop as low-cost ETL pre-processing engine - ETL-offload
NoSQL database for landing real-time data at high speed/low latency
Incoming data then aggregated and stored in RBDMS DW
Business
Intelligence

Hadoop

Online
Scalable
Flexible
Cost
Effective Data Warehouse Marts

24
Jump Ahead to 2012

25
Data Warehousing and ETL Needed Some Agility
Driven by pace of business, and user demands for more agility and control
Traditional IT-governed data loading not always appropriate
Not all data needed to be modelled right-away
Not all data suited storing in tabular form
New ways of analyzing data beyond SQL
Graph analysis
Machine learning

29
Problem #2 That Hadoop / NoSQL Solved :

Making Data Warehousing Agile

Advent of Schema-on-Read, and Data Lakes
Storing data in format it arrived in, and then applying schema at query time
Suits data that may be analysed in different ways by different tools
In addition, some datatypes may have schema embedded in file format
Key benefit - fast arriving data of unknown value can get to users earlier
Made possible by tools such as Apache Hive + SerDes,
Apache Drill and self-describing file formats, HDFS storage

31
Meet the New Data Warehouse : The Data Lake
Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage
Flexible data storage platform with cheap storage, flexible schema support + compute
Solves the problem of how to store new types of data + choose best time/way to process it
Hadoop/NoSQL increasingly used for all store/transform/query tasks

Hadoop Platform
Operational Data
Data Factory
Data Reservoir
Segments
Transactions
File Based
Integration Raw
Mapped
Business
Models
Customer Data
Customer Data Intelligence Tools
Customer
Master ata Data stored in Machine
Data streams Data sets Learning
ETL Based the original
produced by
Integration format (usually
mapping and
files) such as
transforming Marketing /
SS7, ASN.1,
Unstructured Data Stream JSON etc. raw data Sales Applications
Based
Integration
Data sets and Models and
samples programs
Voice + Chat
Transcripts Discovery & Development Labs
Safe & secure Discovery and Development
environment
Data Transfer Data Access

32
Hadoop 2.0 and YARN
(Yet Another Resource Negotiator)

Key Innovation : Separating how data is stored,

from how it is processed
Hadoop 2.0 - Enabling Multiple Query Engines
Hadoop started by being synonymous with MapReduce, and Java coding
But YARN (Yet another Resource Negotiator) broke this dependency
Hadoop now just handles resource management
Multiple different query engines can run against data in-place
General-purpose (e.g. MapReduce)
Graph processing
Machine Learning
Real-Time Processing

35
Technologies Emerged to Bridge Old/New World

36
FAST FORWARD TO NOW

37
Elastically-Scalable Data Warehouse-as-a-Service
New generation of big data platform services from Google, Amazon, Oracle
Combines three key innovations from earlier technologies:
Organising of data into tables and columns (from RDBMS DWs)
Massively-scalable and distributed storage and query (from Big Data)
Elastically-scalable Platform-as-a-Service (from Cloud)

38
Which Is What Im Working On Right Now

39
Example Architecture : Google BigQuery

40
41
What Problem Did Analytics-as-a-Service Solve?
On-premise Hadoop, even with simple resilient clustering, will hit limits
Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc
Scale limits are encountered way beyond those for DWs
but future is elastically-scaled, query and compute-as-a-service

Oracle Big Data Cloud Compute Edition

Free $300 developer credit at:
https://cloud.oracle.com/en_US/tryit

42
BigQuery : Big Data Meets Data Warehousing
And things come full-circle analytics
typically requires tabular data
Google BigQuery based-on DremelX
massively-parallel query engine
But stores data columnar and provides SQL
interface
Solves the problem of providing DW-like
functionality at scale, as-a-service
This is the future ;-)

43
NEW WORLD HADOOP ARCHITECTURES (& WHAT
PROBLEMS THEY REALLY SOLVE) FOR DBAS
UKOUG DATABASE SIG MEETING
Mark Rittman, Oracle ACE Director
London, February 2017

Data Lakes in A Modern Data Architecture
88% (8)
Data Lakes in A Modern Data Architecture
23 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Introduction To Big Data, Hadoop and Spark
No ratings yet
Introduction To Big Data, Hadoop and Spark
40 pages
Deutsche Telekom Perspective On HADOOP and Big Data Technologies
No ratings yet
Deutsche Telekom Perspective On HADOOP and Big Data Technologies
19 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Big Data
No ratings yet
Big Data
28 pages
Top 10 Big Data Trends
No ratings yet
Top 10 Big Data Trends
13 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
194 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Reference Architecture Big Data
100% (1)
Reference Architecture Big Data
3 pages
Bring Data Lakes and Data Warehouses Together
100% (1)
Bring Data Lakes and Data Warehouses Together
19 pages
GCP - DataPlex - Building A Data Lakehouse
No ratings yet
GCP - DataPlex - Building A Data Lakehouse
19 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
DWH
No ratings yet
DWH
7 pages
BD Imp Ques 1
No ratings yet
BD Imp Ques 1
22 pages
Big Data
No ratings yet
Big Data
5 pages
MapR OptimizeEnterpriseArchit Hadoop and NoSQL
No ratings yet
MapR OptimizeEnterpriseArchit Hadoop and NoSQL
7 pages
Data Science
No ratings yet
Data Science
87 pages
Unit II
No ratings yet
Unit II
60 pages
CS8091 BDA Unit 1
No ratings yet
CS8091 BDA Unit 1
118 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
155928-Turn Big Data
No ratings yet
155928-Turn Big Data
8 pages
Storage Emulated 0 Download Modern-Data-Architecture-Apache-Hadoop
No ratings yet
Storage Emulated 0 Download Modern-Data-Architecture-Apache-Hadoop
18 pages
Go Bigwith Data Lake Architecture
No ratings yet
Go Bigwith Data Lake Architecture
35 pages
Big Data Architectures and The Data Lake: James Serra
No ratings yet
Big Data Architectures and The Data Lake: James Serra
53 pages
Lecture 2 - Introduction To Big Data Analytics - 1691894427998
No ratings yet
Lecture 2 - Introduction To Big Data Analytics - 1691894427998
55 pages
Bigdata
No ratings yet
Bigdata
12 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Designing A Modern Data Warehouse in Azure
100% (1)
Designing A Modern Data Warehouse in Azure
25 pages
01 - Intro To Big Data
No ratings yet
01 - Intro To Big Data
26 pages
BIG DATA AND ANALYTICS Presentation
No ratings yet
BIG DATA AND ANALYTICS Presentation
31 pages
Wa0003.
No ratings yet
Wa0003.
23 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Big Data Overview
No ratings yet
Big Data Overview
19 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
DATA228 Lecture Notes Week 1
No ratings yet
DATA228 Lecture Notes Week 1
20 pages
Big Data Technology
No ratings yet
Big Data Technology
9 pages
Day 06
No ratings yet
Day 06
34 pages
Big Data
No ratings yet
Big Data
31 pages
DW Vs Data Lake
No ratings yet
DW Vs Data Lake
5 pages
02 - Introduction To Data Lakehouse Open-Source Technologies
No ratings yet
02 - Introduction To Data Lakehouse Open-Source Technologies
42 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Hadoop: Extending Your Data Warehouse: An Ovum White Paper For Cloudera
No ratings yet
Hadoop: Extending Your Data Warehouse: An Ovum White Paper For Cloudera
15 pages
Digitization Week 3
No ratings yet
Digitization Week 3
13 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Introduction To Business Analytics
No ratings yet
Introduction To Business Analytics
41 pages
15 Big Data Tools and Technologies To Know About in 2021
No ratings yet
15 Big Data Tools and Technologies To Know About in 2021
7 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
DATA WAREHOUSE - Pertemuan01
No ratings yet
DATA WAREHOUSE - Pertemuan01
20 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Session 1
No ratings yet
Session 1
48 pages
DBA's Guide to NoSQL
From Everand
DBA's Guide to NoSQL
The Enlightened DBA
5/5 (1)
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Survey Conflict Styles
100% (1)
Survey Conflict Styles
3 pages
AFM482 Case2 Team8-Enager
No ratings yet
AFM482 Case2 Team8-Enager
16 pages
Global Supply Chain M11207
33% (3)
Global Supply Chain M11207
4 pages
Prices MBA in Business IT 2014
No ratings yet
Prices MBA in Business IT 2014
1 page
Transformational Leadership Final Paper
No ratings yet
Transformational Leadership Final Paper
18 pages
Deutsche Finan Excel
No ratings yet
Deutsche Finan Excel
6 pages
Gregory Hall April 1, 2009: Deutsche Brauerei
No ratings yet
Gregory Hall April 1, 2009: Deutsche Brauerei
4 pages
2d Arcade Game
No ratings yet
2d Arcade Game
20 pages
Chapter Three
No ratings yet
Chapter Three
12 pages
Json Relational Duality Developers Guide
No ratings yet
Json Relational Duality Developers Guide
97 pages
Soa QB
No ratings yet
Soa QB
14 pages
Fourth Edition: Descriptive Analytics II: Business Intelligence and Data Warehousing
No ratings yet
Fourth Edition: Descriptive Analytics II: Business Intelligence and Data Warehousing
64 pages
Oracle Golden Gate (GG) vs. Oracle Stream: Sanjay Naik
No ratings yet
Oracle Golden Gate (GG) vs. Oracle Stream: Sanjay Naik
4 pages
IT Dept TimeTable Aug To Dec 2024 - 24 - 09
No ratings yet
IT Dept TimeTable Aug To Dec 2024 - 24 - 09
11 pages
Documentation
No ratings yet
Documentation
2 pages
Microsoft Official Course: Automating Active Directory Domain Services Administration
No ratings yet
Microsoft Official Course: Automating Active Directory Domain Services Administration
23 pages
Project Report of DISA 2.0 Course
No ratings yet
Project Report of DISA 2.0 Course
12 pages
Netact Summary
No ratings yet
Netact Summary
63 pages
Aman Suthar Resume
No ratings yet
Aman Suthar Resume
2 pages
Stuvia 421971 Summary Information Management e Commerce by Laudon en Traver
No ratings yet
Stuvia 421971 Summary Information Management e Commerce by Laudon en Traver
70 pages
Design of Healthbot Using AI For Medical Assistance
No ratings yet
Design of Healthbot Using AI For Medical Assistance
7 pages
BA Process Template
No ratings yet
BA Process Template
7 pages
LAN and WAN Network
No ratings yet
LAN and WAN Network
1 page
Data-at-Rest Capability Package v5.0
No ratings yet
Data-at-Rest Capability Package v5.0
84 pages
Makerere University ICT Strategic Plan 2020 2030
No ratings yet
Makerere University ICT Strategic Plan 2020 2030
34 pages
My Activities in Module 2
100% (6)
My Activities in Module 2
7 pages
Time Table For Individual Staff - Even
No ratings yet
Time Table For Individual Staff - Even
5 pages
Safe Secure Campus Architecture Guide
No ratings yet
Safe Secure Campus Architecture Guide
33 pages
Information Age
No ratings yet
Information Age
33 pages
Sun Storagetek 6140: Une Brève Introduction Aux Baies
No ratings yet
Sun Storagetek 6140: Une Brève Introduction Aux Baies
40 pages
Apex
No ratings yet
Apex
18 pages
WSC2019 39 IT Network Systems Administration Marking Scheme
No ratings yet
WSC2019 39 IT Network Systems Administration Marking Scheme
26 pages
NoFollow & DoFollow BKLNS
No ratings yet
NoFollow & DoFollow BKLNS
2 pages
20A91A04C0 Internship Document
No ratings yet
20A91A04C0 Internship Document
38 pages
Upload A Document - Socribd
No ratings yet
Upload A Document - Socribd
2 pages
Acn Module 5
No ratings yet
Acn Module 5
13 pages
Ahunem Nigussie CV
No ratings yet
Ahunem Nigussie CV
1 page

New World Hadoop Architectures (& What Problems They Really Solve) For Dbas

Uploaded by

New World Hadoop Architectures (& What Problems They Really Solve) For Dbas

Uploaded by

NEW WORLD HADOOP ARCHITECTURES (& WHAT

PROBLEMS THEY REALLY SOLVE) FOR DBAS

but I am usually le6 missing something. I think it's around

Many people say tradiKonal systems will sKll be

Shared-Everything Architectures (i.e.

Making Data Warehousing Agile

Key Innovation : Separating how data is stored,

Oracle Big Data Cloud Compute Edition

You might also like