0% found this document useful (0 votes)
87 views44 pages

New World Hadoop Architectures (& What Problems They Really Solve) For Dbas

1) Hadoop and NoSQL technologies were solving the problems of scaling data affordably and making data warehousing more agile. 2) Traditional data warehousing struggled with the cost and complexity of scaling to the size required to store vast amounts of new data sources. Hadoop provided a low-cost way to store and process data at massive scales. 3) Hadoop also enabled data to be stored in its original format and structure, applying schemas flexibly at query time. This addressed the need for faster analysis of new data types and sources without needing to define their structure up front.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views44 pages

New World Hadoop Architectures (& What Problems They Really Solve) For Dbas

1) Hadoop and NoSQL technologies were solving the problems of scaling data affordably and making data warehousing more agile. 2) Traditional data warehousing struggled with the cost and complexity of scaling to the size required to store vast amounts of new data sources. Hadoop provided a low-cost way to store and process data at massive scales. 3) Hadoop also enabled data to be stored in its original format and structure, applying schemas flexibly at query time. This addressed the need for faster analysis of new data types and sources without needing to define their structure up front.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

NEW WORLD HADOOP ARCHITECTURES (& WHAT

PROBLEMS THEY REALLY SOLVE) FOR DBAS


UKOUG DATABASE SIG MEETING
Mark Rittman, Oracle ACE Director
London, February 2017
About The Presenter
Oracle ACE Director, Independent Analyst
Past ODTUG Exec Board Member + Oracle Scene Editor
Author of two books on Oracle BI
Co-founder & CTO of Rittman Mead
15+ Years in Oracle BI, DW, ETL + now Big Data
Host of the Drill to Detail Podcast (www.drilltodetail.com)
Based in Brighton & work in London, UK

2
BACK IN FEBRUARY

3
Hi Mark, In things I have seen and read quite o6en people
start with a high-level overview of a product (e.g. Hadoop,
Ka@a), then describe the technical concepts (using all the
appropriate terminology)

but I am usually le6 missing something. I think it's around


the area of what problems these technologies are solving
and how they are doing it? Without that context I'm nding
it all very academic

Many people say tradiKonal systems will sKll be


needed. Are these new technologies solving completely
dierent problems to those handled by tradi=onal IT?
Is there an overlap?
20 Years in Old-school BI & Data Warehousing
Started back in 1996 on a bank Oracle DW project
Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL and shell scripts
Data warehouses provided a unified view of the business
Single place to store key data and metrics
Joined-up view of the business
Aggregates and conformed dimensions
ETL routines to load, cleanse and conform data
BI tools for simple, guided access to information
Tabular data access using SQL-generating tools
Drill paths, hierarchies, facts, attributes
Fast access to pre-computed aggregates
Packaged BI for fast-start ERP analytics

5
Data Warehousing and BI at Peak Oracle

7
Oracle Data Management Platform as of Today

8
What Happened?
Lets Go Back to 2003

10
Google File System and MapReduce
Google needed to store and query their vast amount of server log files
And wanted to do so using cheap, commodity hardware
Google File System and MapReduce designed together for this use

12
Google File System + MapReduce Key Innovations
GFS optimised for particular task at hand -
computing PageRank for sites
Streaming reads for PageRank calcs, block writes for
crawler whole-site dumps
Master node only holds metadata
Stops client/master I/O being bottleneck, also acts as
traffic controller for clients
Simple design, optimised for specific Google Need
MapReduce focused on simple computations on
abstraction framework
Select & filter (MAP) and reduce (aggregate) functions,
easily to distribute on cluster
MapReduce abstracted cluster compute, HDFS
abstracted cluster storage
Projects that inspired Apache Hadoop + HDFS
13
How Traditional RDBMS Data Warehousing Scaled-Up

Shared-Everything Architectures (i.e.


Oracle RAC, Exadata)

Shared-Nothing Architectures
(e.g. Teradata, Netezza)

14
Problem #1 That Hadoop / NoSQL Solved :

Scaling Affordably
Oracle scales infinitely and is free. Period
Cost and Complexity around Scaling DW Clusters
Enterprise High-End RDBMSs such as Oracle can scale
Clustering for single-instance DBs can scale to >PB
Exadata scales further by offloading queries to storage
Sharded databases (e.g. Netezza) can scale further
But cost (and complexity) become limiting factors
Typically $1m/node is not uncommon

17
Hadoops Original Appeal to Data Warehouse Owners
A way of storing (non-relational) data cheaply and easily expandable
Gave us a way of scaling beyond TB-size without paying $$$
First use-cases were offline storage, active archive of data

(c) 2013

18
Hadoop Ecosystem Expanded Beyond MapReduce
Core Hadoop, MapReduce and HDFS
HBase and other NoSQL Databases
Apache Hive and SQL-on-Hadoop
Storm, Spark and Stream Processing
Apache YARN and Hadoop 2.0

19
Google BigTable, HBase and NoSQL Databases
Solution to the problem of storing semi-structured data at-scale
Built on Google File System
Scale for capacity e.g., webtable
100,000,000,000 pages,
10 versions per page,
20 KB / version = 20 PB of data
Scale for throughput
Hundreds of millions of users
Tens of thousands to millions of queries/sec
At low-latency with high-reliability

20
How BigTable Scaled Beyond Traditional RDBMSs
Optimised for a particular task - fast
lookups of ts-versioned web data
Data stored in multidimensional map keyed
on row, column + timestamp
Master + data tablets stored on GFS cluster
nodes
Simple key/value lookup with client doing
interpretation
Innovation - focus on single job with
different needs to OLTP
Formed inspiration for Apache HBase

21
Hive - Hadoop Discovers Set-Based Processing
Original developed at Facebook, now foundational within Hadoop
SQL-like language that compiles to MapReduce, Spark, HBase
Solved the problem of enabling non-programmers to access big data
And made Hadoop data transformation and aggregation code more productive
JDBC and ODBC drivers for tool integration

22
Apache Hive as SQL Access Engine For Everything
Hive is extensible to help with accessing and integrating new data sets
SerDes : Serializer-Deserializers that interpret semi-structured sources
UDFs + Hive Streaming : User-defined functions and streaming input
File Formats : make use of compressed and/or optimised file storage
Storage Handlers : use storage other than HDFS (e.g. MongoDB)

23
Common Hadoop/NoSQL Use-Case (c) 2014
Hadoop as low-cost ETL pre-processing engine - ETL-offload
NoSQL database for landing real-time data at high speed/low latency
Incoming data then aggregated and stored in RBDMS DW
Business
Intelligence

Hadoop

Online
Scalable
Flexible
Cost
Effective Data Warehouse Marts

24
Jump Ahead to 2012

25
Data Warehousing and ETL Needed Some Agility
Driven by pace of business, and user demands for more agility and control
Traditional IT-governed data loading not always appropriate
Not all data needed to be modelled right-away
Not all data suited storing in tabular form
New ways of analyzing data beyond SQL
Graph analysis
Machine learning

29
Problem #2 That Hadoop / NoSQL Solved :

Making Data Warehousing Agile


Advent of Schema-on-Read, and Data Lakes
Storing data in format it arrived in, and then applying schema at query time
Suits data that may be analysed in different ways by different tools
In addition, some datatypes may have schema embedded in file format
Key benefit - fast arriving data of unknown value can get to users earlier
Made possible by tools such as Apache Hive + SerDes,
Apache Drill and self-describing file formats, HDFS storage

31
Meet the New Data Warehouse : The Data Lake
Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage
Flexible data storage platform with cheap storage, flexible schema support + compute
Solves the problem of how to store new types of data + choose best time/way to process it
Hadoop/NoSQL increasingly used for all store/transform/query tasks

Hadoop Platform
Operational Data
Data Factory
Data Reservoir
Segments
Transactions
File Based
Integration Raw
Mapped
Business
Models
Customer Data
Customer Data Intelligence Tools
Customer
Master ata Data stored in Machine
Data streams Data sets Learning
ETL Based the original
produced by
Integration format (usually
mapping and
files) such as
transforming Marketing /
SS7, ASN.1,
Unstructured Data Stream JSON etc. raw data Sales Applications
Based
Integration
Data sets and Models and
samples programs
Voice + Chat
Transcripts Discovery & Development Labs
Safe & secure Discovery and Development
environment
Data Transfer Data Access

32
Hadoop 2.0 and YARN
(Yet Another Resource Negotiator)

Key Innovation : Separating how data is stored,


from how it is processed
Hadoop 2.0 - Enabling Multiple Query Engines
Hadoop started by being synonymous with MapReduce, and Java coding
But YARN (Yet another Resource Negotiator) broke this dependency
Hadoop now just handles resource management
Multiple different query engines can run against data in-place
General-purpose (e.g. MapReduce)
Graph processing
Machine Learning
Real-Time Processing

35
Technologies Emerged to Bridge Old/New World

36
FAST FORWARD TO NOW

37
Elastically-Scalable Data Warehouse-as-a-Service
New generation of big data platform services from Google, Amazon, Oracle
Combines three key innovations from earlier technologies:
Organising of data into tables and columns (from RDBMS DWs)
Massively-scalable and distributed storage and query (from Big Data)
Elastically-scalable Platform-as-a-Service (from Cloud)

38
Which Is What Im Working On Right Now

39
Example Architecture : Google BigQuery

40
41
What Problem Did Analytics-as-a-Service Solve?
On-premise Hadoop, even with simple resilient clustering, will hit limits
Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc
Scale limits are encountered way beyond those for DWs
but future is elastically-scaled, query and compute-as-a-service

Oracle Big Data Cloud Compute Edition


Free $300 developer credit at:
https://cloud.oracle.com/en_US/tryit

42
BigQuery : Big Data Meets Data Warehousing
And things come full-circle analytics
typically requires tabular data
Google BigQuery based-on DremelX
massively-parallel query engine
But stores data columnar and provides SQL
interface
Solves the problem of providing DW-like
functionality at scale, as-a-service
This is the future ;-)

43
NEW WORLD HADOOP ARCHITECTURES (& WHAT
PROBLEMS THEY REALLY SOLVE) FOR DBAS
UKOUG DATABASE SIG MEETING
Mark Rittman, Oracle ACE Director
London, February 2017

You might also like