Oracle 10g DW Guide
Oracle 10g DW Guide
December 2003
Oracle Database Data Warehousing Guide, 10g Release 1 (10.1)
Contributors: Patrick Amor, Hermann Baer, Mark Bauer, Subhransu Basu, Srikanth Bellamkonda,
Randy Bello, Tolga Bozkaya, Lucy Burgess, Rushan Chen, Benoit Dageville, John Haydu, Lilian Hobbs,
Hakan Jakobsson, George Lumpkin, Alex Melidis, Valarie Moore, Cetin Ozbutun, Ananth Raghavan,
Jack Raitto, Ray Roccaforte, Sankar Subramanian, Gregory Smith, Murali Thiyagarajan, Ashish Thusoo,
Thomas Tong, Jean-Francois Verrier, Gary Vincent, Andreas Walter, Andy Witkowski, Min Xiao,
Tsae-Feng Yu
The Programs (which include both the software and documentation) contain proprietary information of
Oracle Corporation; they are provided under a license agreement containing restrictions on use and
disclosure and are also protected by copyright, patent and other intellectual and industrial property
laws. Reverse engineering, disassembly or decompilation of the Programs, except to the extent required
to obtain interoperability with other independently created software or as specified by law, is prohibited.
The information contained in this document is subject to change without notice. If you find any problems
in the documentation, please report them to us in writing. Oracle Corporation does not warrant that this
document is error-free. Except as may be expressly permitted in your license agreement for these
Programs, no part of these Programs may be reproduced or transmitted in any form or by any means,
electronic or mechanical, for any purpose, without the express written permission of Oracle Corporation.
If the Programs are delivered to the U.S. Government or anyone licensing or using the programs on
behalf of the U.S. Government, the following notice is applicable:
Restricted Rights Notice Programs delivered subject to the DOD FAR Supplement are "commercial
computer software" and use, duplication, and disclosure of the Programs, including documentation,
shall be subject to the licensing restrictions set forth in the applicable Oracle license agreement.
Otherwise, Programs delivered subject to the Federal Acquisition Regulations are "restricted computer
software" and use, duplication, and disclosure of the Programs shall be subject to the restrictions in FAR
52.227-19, Commercial Computer Software - Restricted Rights (June, 1987). Oracle Corporation, 500
Oracle Parkway, Redwood City, CA 94065.
The Programs are not intended for use in any nuclear, aviation, mass transit, medical, or other inherently
dangerous applications. It shall be the licensee's responsibility to take all appropriate fail-safe, backup,
redundancy, and other measures to ensure the safe use of such applications if the Programs are used for
such purposes, and Oracle Corporation disclaims liability for any damages caused by such use of the
Programs.
Oracle is a registered trademark, and Express, Oracle8i, Oracle9i, Oracle Store, PL/SQL, Pro*C, and
SQL*Plus are trademarks or registered trademarks of Oracle Corporation. Other names may be
trademarks of their respective owners.
Contents
Preface........................................................................................................................................................ xxxi
Audience ............................................................................................................................................ xxxii
Organization...................................................................................................................................... xxxii
Related Documentation .................................................................................................................. xxxiv
Conventions....................................................................................................................................... xxxv
Documentation Accessibility ....................................................................................................... xxxviii
Volume 1
Part I Concepts
iii
Data Warehouse Architecture (Basic) ........................................................................................ 1-5
Data Warehouse Architecture (with a Staging Area) .............................................................. 1-6
Data Warehouse Architecture (with a Staging Area and Data Marts) ................................. 1-6
iv
Materialized Views....................................................................................................................... 3-6
Dimensions .................................................................................................................................... 3-6
v
Partition Bounds for Range Partitioning ................................................................................. 5-31
Comparing Partitioning Keys with Partition Bounds.................................................... 5-31
MAXVALUE......................................................................................................................... 5-31
Nulls ...................................................................................................................................... 5-32
DATE Datatypes ................................................................................................................. 5-32
Multicolumn Partitioning Keys ......................................................................................... 5-33
Implicit Constraints Imposed by Partition Bounds ........................................................ 5-33
Index Partitioning ....................................................................................................................... 5-33
Local Partitioned Indexes ................................................................................................... 5-34
Global Partitioned Indexes................................................................................................. 5-37
Summary of Partitioned Index Types............................................................................... 5-39
The Importance of Nonprefixed Indexes ......................................................................... 5-40
Performance Implications of Prefixed and Nonprefixed Indexes ................................ 5-40
Guidelines for Partitioning Indexes.................................................................................. 5-41
Physical Attributes of Index Partitions............................................................................. 5-42
6 Indexes
Using Bitmap Indexes in Data Warehouses................................................................................... 6-2
Benefits for Data Warehousing Applications .......................................................................... 6-2
Cardinality ..................................................................................................................................... 6-3
Bitmap Indexes and Nulls ........................................................................................................... 6-5
Bitmap Indexes on Partitioned Tables ....................................................................................... 6-6
Using Bitmap Join Indexes in Data Warehouses...................................................................... 6-6
Four Join Models for Bitmap Join Indexes......................................................................... 6-6
Bitmap Join Index Restrictions and Requirements ........................................................... 6-9
Using B-Tree Indexes in Data Warehouses .................................................................................. 6-10
Using Index Compression ............................................................................................................... 6-10
Choosing Between Local Indexes and Global Indexes ............................................................. 6-11
7 Integrity Constraints
Why Integrity Constraints are Useful in a Data Warehouse ...................................................... 7-2
Overview of Constraint States.......................................................................................................... 7-3
Typical Data Warehouse Integrity Constraints ............................................................................. 7-3
UNIQUE Constraints in a Data Warehouse ............................................................................. 7-4
FOREIGN KEY Constraints in a Data Warehouse................................................................... 7-5
vi
RELY Constraints ......................................................................................................................... 7-6
Integrity Constraints and Parallelism........................................................................................ 7-6
Integrity Constraints and Partitioning ...................................................................................... 7-7
View Constraints .......................................................................................................................... 7-7
vii
Materialized View Restrictions.......................................................................................... 8-24
General Query Rewrite Restrictions ................................................................................. 8-25
Refresh Options........................................................................................................................... 8-25
General Restrictions on Fast Refresh ................................................................................ 8-27
Restrictions on Fast Refresh on Materialized Views with Joins Only ......................... 8-27
Restrictions on Fast Refresh on Materialized Views with Aggregates........................ 8-27
Restrictions on Fast Refresh on Materialized Views with UNION ALL..................... 8-29
Achieving Refresh Goals .................................................................................................... 8-30
Refreshing Nested Materialized Views ............................................................................ 8-30
ORDER BY Clause ...................................................................................................................... 8-31
Materialized View Logs ............................................................................................................. 8-31
Using the FORCE Option with Materialized View Logs............................................... 8-33
Using Oracle Enterprise Manager ............................................................................................ 8-33
Using Materialized Views with NLS Parameters .................................................................. 8-33
Adding Comments to Materialized Views ............................................................................. 8-33
Registering Existing Materialized Views..................................................................................... 8-34
Choosing Indexes for Materialized Views................................................................................... 8-36
Dropping Materialized Views........................................................................................................ 8-37
Analyzing Materialized View Capabilities ................................................................................. 8-37
Using the DBMS_MVIEW.EXPLAIN_MVIEW Procedure ................................................... 8-37
DBMS_MVIEW.EXPLAIN_MVIEW Declarations.......................................................... 8-38
Using MV_CAPABILITIES_TABLE.................................................................................. 8-38
MV_CAPABILITIES_TABLE.CAPABILITY_NAME Details ............................................... 8-40
MV_CAPABILITIES_TABLE Column Details........................................................................ 8-42
viii
Rolling Materialized Views......................................................................................................... 9-9
Materialized Views in OLAP Environments................................................................................. 9-9
OLAP Cubes .................................................................................................................................. 9-9
Partitioning Materialized Views for OLAP ............................................................................ 9-10
Compressing Materialized Views for OLAP.......................................................................... 9-11
Materialized Views with Set Operators .................................................................................. 9-11
Examples of Materialized Views Using UNION ALL ................................................... 9-11
Materialized Views and Models .................................................................................................... 9-13
Invalidating Materialized Views ................................................................................................... 9-14
Security Issues with Materialized Views..................................................................................... 9-14
Querying Materialized Views with Virtual Private Database ............................................. 9-15
Using Query Rewrite with Virtual Private Database..................................................... 9-16
Restrictions with Materialized Views and Virtual Private Database .......................... 9-16
Altering Materialized Views .......................................................................................................... 9-17
10 Dimensions
What are Dimensions? ..................................................................................................................... 10-2
Creating Dimensions ....................................................................................................................... 10-4
Dropping and Creating Attributes with Columns ................................................................ 10-8
Multiple Hierarchies .................................................................................................................. 10-9
Using Normalized Dimension Tables ................................................................................... 10-10
Viewing Dimensions...................................................................................................................... 10-11
Using Oracle Enterprise Manager.......................................................................................... 10-11
Using the DESCRIBE_DIMENSION Procedure................................................................... 10-11
Using Dimensions with Constraints........................................................................................... 10-12
Validating Dimensions .................................................................................................................. 10-12
Altering Dimensions...................................................................................................................... 10-14
Deleting Dimensions ..................................................................................................................... 10-14
ix
Daily Operations in Data Warehouses .................................................................................... 11-3
Evolution of the Data Warehouse ............................................................................................ 11-4
x
14 Loading and Transformation
Overview of Loading and Transformation in Data Warehouses ............................................. 14-2
Transformation Flow.................................................................................................................. 14-2
Multistage Data Transformation....................................................................................... 14-2
Pipelined Data Transformation ......................................................................................... 14-3
Loading Mechanisms ....................................................................................................................... 14-4
Loading a Data Warehouse with SQL*Loader ....................................................................... 14-5
Loading a Data Warehouse with External Tables.................................................................. 14-5
Loading a Data Warehouse with OCI and Direct-Path APIs............................................... 14-7
Loading a Data Warehouse with Export/Import .................................................................. 14-7
Transformation Mechanisms .......................................................................................................... 14-8
Transformation Using SQL ....................................................................................................... 14-8
CREATE TABLE ... AS SELECT And INSERT /*+APPEND*/ AS SELECT.............. 14-8
Transformation Using UPDATE ....................................................................................... 14-9
Transformation Using MERGE ......................................................................................... 14-9
Transformation Using Multitable INSERT .................................................................... 14-10
Transformation Using PL/SQL .............................................................................................. 14-12
Transformation Using Table Functions................................................................................. 14-13
What is a Table Function? ................................................................................................ 14-13
Loading and Transformation Scenarios...................................................................................... 14-21
Key Lookup Scenario ............................................................................................................... 14-21
Exception Handling Scenario ................................................................................................. 14-22
Pivoting Scenarios .................................................................................................................... 14-23
xi
Complete Refresh...................................................................................................................... 15-15
Fast Refresh................................................................................................................................ 15-15
Partition Change Tracking (PCT) Refresh............................................................................. 15-15
ON COMMIT Refresh .............................................................................................................. 15-16
Manual Refresh Using the DBMS_MVIEW Package........................................................... 15-16
Refresh Specific Materialized Views with REFRESH.......................................................... 15-17
Refresh All Materialized Views with REFRESH_ALL_MVIEWS ..................................... 15-18
Refresh Dependent Materialized Views with REFRESH_DEPENDENT......................... 15-19
Using Job Queues for Refresh ................................................................................................. 15-20
When Fast Refresh is Possible................................................................................................. 15-21
Recommended Initialization Parameters for Parallelism ................................................... 15-21
Monitoring a Refresh................................................................................................................ 15-21
Checking the Status of a Materialized View......................................................................... 15-22
Scheduling Refresh ................................................................................................................... 15-22
Tips for Refreshing Materialized Views with Aggregates.................................................. 15-23
Tips for Refreshing Materialized Views Without Aggregates ........................................... 15-26
Tips for Refreshing Nested Materialized Views .................................................................. 15-27
Tips for Fast Refresh with UNION ALL ............................................................................... 15-28
Tips After Refreshing Materialized Views............................................................................ 15-28
Using Materialized Views with Partitioned Tables ................................................................. 15-29
Fast Refresh with Partition Change Tracking....................................................................... 15-29
PCT Fast Refresh Scenario 1............................................................................................. 15-29
PCT Fast Refresh Scenario 2............................................................................................. 15-31
PCT Fast Refresh Scenario 3............................................................................................. 15-32
Fast Refresh with CONSIDER FRESH................................................................................... 15-33
xii
Asynchronous ........................................................................................................................... 16-11
HotLog ................................................................................................................................ 16-12
AutoLog .............................................................................................................................. 16-13
Change Sets...................................................................................................................................... 16-14
Valid Combinations of Change Sources and Change Sets ................................................. 16-15
Change Tables.................................................................................................................................. 16-16
Getting Information About the Change Data Capture Environment .................................. 16-16
Preparing to Publish Change Data.............................................................................................. 16-18
Creating a User to Serve As a Publisher ............................................................................... 16-18
Granting Privileges and Roles to the Publisher ............................................................ 16-19
Creating a Default Tablespace for the Publisher .......................................................... 16-19
Password Files and Setting the REMOTE_LOGIN_PASSWORDFILE Parameter . 16-19
Determining the Mode in Which to Capture Data .............................................................. 16-20
Setting Initialization Parameters for Change Data Capture Publishing........................... 16-21
Initialization Parameters for Synchronous Publishing ................................................ 16-21
Initialization Parameters for Asynchronous HotLog Publishing............................... 16-21
Initialization Parameters for Asynchronous AutoLog Publishing ............................ 16-22
Determining the Current Setting of an Initialization Parameter................................ 16-25
Retaining Initialization Parameter Values When a Database Is Restarted ............... 16-25
Adjusting Initialization Parameter Values When Oracle Streams Values Change . 16-25
Publishing Change Data ............................................................................................................... 16-27
Performing Synchronous Publishing..................................................................................... 16-27
Performing Asynchronous HotLog Publishing ................................................................... 16-30
Performing Asynchronous AutoLog Publishing ................................................................. 16-35
Subscribing to Change Data......................................................................................................... 16-42
Considerations for Asynchronous Change Data Capture ...................................................... 16-47
Asynchronous Change Data Capture and Redo Log Files................................................. 16-48
Asynchronous Change Data Capture and Supplemental Logging................................... 16-50
Datatypes and Table Structures Supported for Asynchronous Change Data Capture . 16-51
Managing Published Data ............................................................................................................ 16-52
Managing Asynchronous Change Sets.................................................................................. 16-52
Creating Asynchronous Change Sets with Starting and Ending Dates .................... 16-52
Enabling and Disabling Asynchronous Change Sets................................................... 16-53
Stopping Capture on DDL for Asynchronous Change Sets........................................ 16-54
Recovering from Errors Returned on Asynchronous Change Sets............................ 16-55
xiii
Managing Change Tables ........................................................................................................ 16-58
Creating Change Tables.................................................................................................... 16-59
Understanding Change Table Control Columns .......................................................... 16-60
Understanding TARGET_COLMAP$ and SOURCE_COLMAP$ Values ................ 16-62
Controlling Subscriber Access to Change Tables ......................................................... 16-64
Purging Change Tables of Unneeded Data ................................................................... 16-65
Dropping Change Tables.................................................................................................. 16-67
Considerations for Exporting and Importing Change Data Capture Objects ................. 16-67
Impact on Subscriptions When the Publisher Makes Changes ......................................... 16-70
Implementation and System Configuration .............................................................................. 16-71
Synchronous Change Data Capture Restriction on Direct-Path INSERT......................... 16-72
17 SQLAccess Advisor
Overview of the SQLAccess Advisor in the DBMS_ADVISOR Package ............................. 17-2
Overview of Using the SQLAccess Advisor ........................................................................... 17-4
SQLAccess Advisor Repository......................................................................................... 17-7
Using the SQLAccess Advisor........................................................................................................ 17-7
SQLAccess Advisor Flowchart ................................................................................................. 17-8
SQLAccess Advisor Privileges.................................................................................................. 17-9
Creating Tasks........................................................................................................................... 17-10
SQLAccess Advisor Templates............................................................................................... 17-10
Creating Templates................................................................................................................... 17-11
Workload Objects...................................................................................................................... 17-12
Managing Workloads............................................................................................................... 17-12
Linking a Task and a Workload.............................................................................................. 17-13
Defining the Contents of a Workload .................................................................................... 17-14
SQL Tuning Set .................................................................................................................. 17-14
Loading a User-Defined Workload................................................................................. 17-15
Loading a SQL Cache Workload ..................................................................................... 17-16
Using a Hypothetical Workload...................................................................................... 17-17
Using a Summary Advisor 9i Workload........................................................................ 17-18
SQLAccess Advisor Workload Parameters ................................................................... 17-19
SQL Workload Journal............................................................................................................. 17-20
Adding SQL Statements to a Workload ................................................................................ 17-20
Deleting SQL Statements from a Workload.......................................................................... 17-21
xiv
Changing SQL Statements in a Workload ............................................................................ 17-22
Maintaining Workloads........................................................................................................... 17-22
Setting Workload Attributes............................................................................................ 17-23
Resetting Workloads ......................................................................................................... 17-23
Removing a Link Between a Workload and a Task ..................................................... 17-23
Removing Workloads .............................................................................................................. 17-24
Recommendation Options....................................................................................................... 17-24
Generating Recommendations ............................................................................................... 17-25
EXECUTE_TASK Procedure............................................................................................ 17-26
Viewing the Recommendations.............................................................................................. 17-26
Access Advisor Journal............................................................................................................ 17-32
Stopping the Recommendation Process................................................................................ 17-32
Canceling Tasks ................................................................................................................. 17-32
Marking Recommendations.................................................................................................... 17-33
Modifying Recommendations ................................................................................................ 17-33
Generating SQL Scripts............................................................................................................ 17-34
When Recommendations are No Longer Required............................................................. 17-36
Performing a Quick Tune ........................................................................................................ 17-36
Managing Tasks ........................................................................................................................ 17-37
Updating Task Attributes................................................................................................. 17-37
Deleting Tasks.................................................................................................................... 17-38
Setting DAYS_TO_EXPIRE .............................................................................................. 17-38
Using SQLAccess Advisor Constants.................................................................................... 17-38
Examples of Using the SQLAccess Advisor ......................................................................... 17-39
Recommendations From a User-Defined Workload.................................................... 17-39
Generate Recommendations Using a Task Template .................................................. 17-42
Filter a Workload from the SQL Cache .......................................................................... 17-44
Evaluate Current Usage of Indexes and Materialized Views ..................................... 17-46
Tuning Materialized Views for Fast Refresh and Query Rewrite......................................... 17-47
DBMS_ADVISOR.TUNE_MVIEW Procedure ..................................................................... 17-48
TUNE_MVIEW Syntax and Operations......................................................................... 17-48
Accessing TUNE_MVIEW Output Results.................................................................... 17-50
USER_TUNE_MVIEW and DBA_TUNE_MVIEW Views........................................... 17-50
Script Generation DBMS_ADVISOR Function and Procedure .................................. 17-50
Fast Refreshable with Optimized Sub-Materialized View .......................................... 17-56
xv
Volume 2
18 Query Rewrite
Overview of Query Rewrite............................................................................................................ 18-2
Cost-Based Rewrite..................................................................................................................... 18-3
When Does Oracle Rewrite a Query? ...................................................................................... 18-4
Enabling Query Rewrite.................................................................................................................. 18-5
Initialization Parameters for Query Rewrite .......................................................................... 18-5
Controlling Query Rewrite........................................................................................................ 18-6
Accuracy of Query Rewrite ....................................................................................................... 18-7
Query Rewrite Hints ........................................................................................................... 18-8
Privileges for Enabling Query Rewrite.................................................................................... 18-9
Sample Schema and Materialized Views ................................................................................ 18-9
How Oracle Rewrites Queries ...................................................................................................... 18-11
Text Match Rewrite Methods.................................................................................................. 18-11
Text Match Capabilities .................................................................................................... 18-13
General Query Rewrite Methods............................................................................................ 18-13
When are Constraints and Dimensions Needed? ......................................................... 18-13
Join Back.............................................................................................................................. 18-14
Rollup Using a Dimension ............................................................................................... 18-16
Compute Aggregates ........................................................................................................ 18-17
Filtering the Data ............................................................................................................... 18-18
Dropping Selections in the Rewritten Query ................................................................ 18-24
Handling of HAVING Clause in Query Rewrite.......................................................... 18-25
Handling Expressions in Query Rewrite ....................................................................... 18-25
Handling IN-Lists in Query Rewrite .............................................................................. 18-26
Checks Made by Query Rewrite............................................................................................. 18-28
Join Compatibility Check ................................................................................................. 18-28
Data Sufficiency Check ..................................................................................................... 18-33
Grouping Compatibility Check ....................................................................................... 18-34
Aggregate Computability Check..................................................................................... 18-34
Other Cases for Query Rewrite............................................................................................... 18-34
Query Rewrite Using Partially Stale Materialized Views ........................................... 18-35
xvi
Query Rewrite Using Nested Materialized Views ....................................................... 18-38
Query Rewrite When Using GROUP BY Extensions ................................................... 18-39
Hint for Queries with Extended GROUP BY ................................................................ 18-44
Query Rewrite with Inline Views ................................................................................... 18-44
Query Rewrite with Selfjoins........................................................................................... 18-45
Query Rewrite and View Constraints ............................................................................ 18-46
Query Rewrite and Expression Matching...................................................................... 18-49
Date Folding Rewrite ........................................................................................................ 18-49
Partition Change Tracking (PCT) Rewrite ............................................................................ 18-52
PCT Rewrite Based on LIST Partitioned Tables............................................................ 18-52
PCT and PMARKER ......................................................................................................... 18-55
PCT Rewrite with Materialized Views Based on Range-List Partitioned Tables .... 18-57
PCT Rewrite Using Rowid as Pmarker .......................................................................... 18-59
Query Rewrite and Bind Variables ........................................................................................ 18-61
Query Rewrite Using Set Operator Materialized Views..................................................... 18-62
UNION ALL Marker......................................................................................................... 18-64
Did Query Rewrite Occur?............................................................................................................ 18-65
Explain Plan............................................................................................................................... 18-65
DBMS_MVIEW.EXPLAIN_REWRITE Procedure ............................................................... 18-66
DBMS_MVIEW.EXPLAIN_REWRITE Syntax .............................................................. 18-66
Using REWRITE_TABLE ................................................................................................. 18-67
Using a Varray ................................................................................................................... 18-69
EXPLAIN_REWRITE Benefit Statistics .......................................................................... 18-71
Support for Query Text Larger than 32KB in EXPLAIN_REWRITE ......................... 18-71
Design Considerations for Improving Query Rewrite Capabilities .................................... 18-72
Query Rewrite Considerations: Constraints......................................................................... 18-72
Query Rewrite Considerations: Dimensions ........................................................................ 18-73
Query Rewrite Considerations: Outer Joins ......................................................................... 18-73
Query Rewrite Considerations: Text Match ......................................................................... 18-73
Query Rewrite Considerations: Aggregates ......................................................................... 18-73
Query Rewrite Considerations: Grouping Conditions ....................................................... 18-74
Query Rewrite Considerations: Expression Matching........................................................ 18-74
Query Rewrite Considerations: Date Folding ...................................................................... 18-74
Query Rewrite Considerations: Statistics.............................................................................. 18-74
Advanced Rewrite Using Equivalences ..................................................................................... 18-75
xvii
19 Schema Modeling Techniques
Schemas in Data Warehouses ......................................................................................................... 19-2
Third Normal Form .......................................................................................................................... 19-2
Optimizing Third Normal Form Queries................................................................................ 19-3
Star Schemas ...................................................................................................................................... 19-3
Snowflake Schemas .................................................................................................................... 19-5
Optimizing Star Queries ................................................................................................................. 19-5
Tuning Star Queries.................................................................................................................... 19-6
Using Star Transformation ........................................................................................................ 19-6
Star Transformation with a Bitmap Index ....................................................................... 19-6
Execution Plan for a Star Transformation with a Bitmap Index................................... 19-9
Star Transformation with a Bitmap Join Index ............................................................. 19-10
Execution Plan for a Star Transformation with a Bitmap Join Index......................... 19-10
How Oracle Chooses to Use Star Transformation ........................................................ 19-11
Star Transformation Restrictions..................................................................................... 19-11
xviii
GROUP_ID Function................................................................................................................ 20-16
GROUPING SETS Expression ..................................................................................................... 20-17
GROUPING SETS Syntax........................................................................................................ 20-19
Composite Columns....................................................................................................................... 20-20
Concatenated Groupings............................................................................................................... 20-22
Concatenated Groupings and Hierarchical Data Cubes..................................................... 20-24
Considerations when Using Aggregation.................................................................................. 20-26
Hierarchy Handling in ROLLUP and CUBE........................................................................ 20-26
Column Capacity in ROLLUP and CUBE............................................................................. 20-27
HAVING Clause Used with GROUP BY Extensions .......................................................... 20-27
ORDER BY Clause Used with GROUP BY Extensions ....................................................... 20-28
Using Other Aggregate Functions with ROLLUP and CUBE ........................................... 20-28
Computation Using the WITH Clause ....................................................................................... 20-28
Working with Hierarchical Cubes in SQL ................................................................................. 20-29
Specifying Hierarchical Cubes in SQL .................................................................................. 20-29
Querying Hierarchical Cubes in SQL .................................................................................... 20-30
SQL for Creating Materialized Views to Store Hierarchical Cubes ........................... 20-31
Examples of Hierarchical Cube Materialized Views.................................................... 20-32
xix
Treatment of NULLs as Input to Window Functions.......................................................... 21-16
Windowing Functions with Logical Offset ........................................................................... 21-16
Centered Aggregate Function................................................................................................. 21-18
Windowing Aggregate Functions in the Presence of Duplicates ...................................... 21-19
Varying Window Size for Each Row ..................................................................................... 21-20
Windowing Aggregate Functions with Physical Offsets.................................................... 21-21
FIRST_VALUE and LAST_VALUE Functions ..................................................................... 21-21
Reporting Aggregate Functions ................................................................................................... 21-22
RATIO_TO_REPORT Function .............................................................................................. 21-24
LAG/LEAD Functions .................................................................................................................... 21-25
LAG/LEAD Syntax .................................................................................................................. 21-25
FIRST/LAST Functions.................................................................................................................. 21-26
FIRST/LAST Syntax ................................................................................................................. 21-26
FIRST/LAST As Regular Aggregates .................................................................................... 21-26
FIRST/LAST As Reporting Aggregates ................................................................................ 21-27
Inverse Percentile Functions......................................................................................................... 21-28
Normal Aggregate Syntax ....................................................................................................... 21-28
Inverse Percentile Example Basis .................................................................................... 21-28
As Reporting Aggregates ................................................................................................. 21-30
Inverse Percentile Restrictions ................................................................................................ 21-31
Hypothetical Rank and Distribution Functions ....................................................................... 21-32
Hypothetical Rank and Distribution Syntax......................................................................... 21-32
Linear Regression Functions......................................................................................................... 21-33
REGR_COUNT Function......................................................................................................... 21-34
REGR_AVGY and REGR_AVGX Functions ......................................................................... 21-34
REGR_SLOPE and REGR_INTERCEPT Functions ............................................................. 21-34
REGR_R2 Function ................................................................................................................... 21-35
REGR_SXX, REGR_SYY, and REGR_SXY Functions .......................................................... 21-35
Linear Regression Statistics Examples................................................................................... 21-35
Sample Linear Regression Calculation .................................................................................. 21-35
Frequent Itemsets............................................................................................................................ 21-36
Other Statistical Functions............................................................................................................ 21-37
Descriptive Statistics................................................................................................................. 21-37
Hypothesis Testing - Parametric Tests .................................................................................. 21-37
Crosstab Statistics ..................................................................................................................... 21-38
xx
Hypothesis Testing - Non-Parametric Tests ......................................................................... 21-38
Non-Parametric Correlation ................................................................................................... 21-39
WIDTH_BUCKET Function ......................................................................................................... 21-39
WIDTH_BUCKET Syntax........................................................................................................ 21-39
User-Defined Aggregate Functions ............................................................................................. 21-42
CASE Expressions........................................................................................................................... 21-43
Creating Histograms With User-Defined Buckets............................................................... 21-44
Data Densification for Reporting ................................................................................................ 21-45
Partition Join Syntax................................................................................................................. 21-45
Sample of Sparse Data ............................................................................................................. 21-46
Filling Gaps in Data.................................................................................................................. 21-47
Filling Gaps in Two Dimensions ............................................................................................ 21-48
Filling Gaps in an Inventory Table ........................................................................................ 21-50
Computing Data Values to Fill Gaps..................................................................................... 21-52
Time Series Calculations on Densified Data............................................................................. 21-53
Period-to-Period Comparison for One Time Level: Example............................................ 21-55
Period-to-Period Comparison for Multiple Time Levels: Example .................................. 21-56
Creating a Custom Member in a Dimension: Example ...................................................... 21-62
xxi
Rules ........................................................................................................................................... 22-17
Single Cell References ....................................................................................................... 22-18
Multi-Cell References on the Right Side ........................................................................ 22-18
Multi-Cell References on the Left Side ........................................................................... 22-19
Use of the ANY Wildcard................................................................................................. 22-20
Nested Cell References ..................................................................................................... 22-20
Order of Evaluation of Rules .................................................................................................. 22-21
Differences Between Update and Upsert .............................................................................. 22-22
Treatment of NULLs and Missing Cells................................................................................ 22-23
Use Defaults for Missing Cells and NULLs................................................................... 22-25
Qualifying NULLs for a Dimension ............................................................................... 22-26
Reference Models...................................................................................................................... 22-26
Advanced Topics in SQL Modeling ............................................................................................ 22-30
FOR Loops ................................................................................................................................. 22-30
Iterative Models ........................................................................................................................ 22-34
Rule Dependency in AUTOMATIC ORDER Models.......................................................... 22-35
Ordered Rules ........................................................................................................................... 22-37
Unique Dimensions Versus Unique Single References....................................................... 22-38
Rules and Restrictions when Using SQL for Modeling ...................................................... 22-40
Performance Considerations with SQL Modeling ................................................................... 22-42
Parallel Execution ..................................................................................................................... 22-42
Aggregate Computation .......................................................................................................... 22-43
Using EXPLAIN PLAN to Understand Model Queries...................................................... 22-45
Using ORDERED FAST: Example................................................................................... 22-45
Using ORDERED: Example ............................................................................................. 22-45
Using ACYCLIC FAST: Example .................................................................................... 22-46
Using ACYCLIC: Example ............................................................................................... 22-46
Using CYCLIC: Example .................................................................................................. 22-47
Examples of SQL Modeling .......................................................................................................... 22-47
xxii
Manageability ...................................................................................................................... 23-3
Backup and Recovery ......................................................................................................... 23-3
Security ................................................................................................................................. 23-4
Oracle Data Mining Overview....................................................................................................... 23-4
Enabling Data Mining Applications ........................................................................................ 23-5
Data Mining in the Database .................................................................................................... 23-5
Data Preparation.................................................................................................................. 23-6
Model Building .................................................................................................................... 23-6
Model Evaluation ................................................................................................................ 23-7
Model Apply (Scoring) ....................................................................................................... 23-7
ODM Programmatic Interfaces ......................................................................................... 23-7
ODM Java API ..................................................................................................................... 23-7
ODM PL/SQL Packages..................................................................................................... 23-8
ODM Sequence Similarity Search (BLAST) ............................................................................ 23-8
xxiii
Parallel Queries on Object Types .................................................................................... 24-15
Parallel DDL .............................................................................................................................. 24-15
DDL Statements That Can Be Parallelized..................................................................... 24-15
CREATE TABLE ... AS SELECT in Parallel ................................................................... 24-16
Recoverability and Parallel DDL..................................................................................... 24-17
Space Management for Parallel DDL ............................................................................. 24-17
Storage Space When Using Dictionary-Managed Tablespaces .................................. 24-18
Free Space and Parallel DDL ........................................................................................... 24-18
Parallel DML.............................................................................................................................. 24-19
Advantages of Parallel DML over Manual Parallelism ............................................... 24-20
When to Use Parallel DML............................................................................................... 24-21
Enabling Parallel DML...................................................................................................... 24-22
Transaction Restrictions for Parallel DML..................................................................... 24-23
Rollback Segments............................................................................................................. 24-24
Recovery for Parallel DML............................................................................................... 24-24
Space Considerations for Parallel DML ......................................................................... 24-24
Lock and Enqueue Resources for Parallel DML ........................................................... 24-25
Restrictions on Parallel DML ........................................................................................... 24-25
Data Integrity Restrictions................................................................................................ 24-26
Trigger Restrictions ........................................................................................................... 24-27
Distributed Transaction Restrictions .............................................................................. 24-27
Examples of Distributed Transaction Parallelization................................................... 24-27
Parallel Execution of Functions .............................................................................................. 24-28
Functions in Parallel Queries ........................................................................................... 24-29
Functions in Parallel DML and DDL Statements.......................................................... 24-29
Other Types of Parallelism ...................................................................................................... 24-29
Initializing and Tuning Parameters for Parallel Execution .................................................... 24-30
Using Default Parameter Settings .......................................................................................... 24-31
Setting the Degree of Parallelism for Parallel Execution .................................................... 24-32
How Oracle Determines the Degree of Parallelism for Operations .................................. 24-33
Hints and Degree of Parallelism...................................................................................... 24-33
Table and Index Definitions............................................................................................. 24-34
Default Degree of Parallelism .......................................................................................... 24-34
Adaptive Multiuser Algorithm ....................................................................................... 24-35
Minimum Number of Parallel Execution Servers......................................................... 24-35
xxiv
Limiting the Number of Available Instances ................................................................ 24-35
Balancing the Workload .......................................................................................................... 24-36
Parallelization Rules for SQL Statements.............................................................................. 24-37
Rules for Parallelizing Queries........................................................................................ 24-37
Rules for UPDATE, MERGE, and DELETE................................................................... 24-38
Rules for INSERT ... SELECT........................................................................................... 24-40
Rules for DDL Statements ................................................................................................ 24-41
Rules for [CREATE | REBUILD] INDEX or [MOVE | SPLIT] PARTITION ........... 24-41
Rules for CREATE TABLE AS SELECT ......................................................................... 24-42
Summary of Parallelization Rules................................................................................... 24-43
Enabling Parallelism for Tables and Queries ....................................................................... 24-45
Degree of Parallelism and Adaptive Multiuser: How They Interact ................................ 24-45
How the Adaptive Multiuser Algorithm Works .......................................................... 24-46
Forcing Parallel Execution for a Session ............................................................................... 24-46
Controlling Performance with the Degree of Parallelism .................................................. 24-47
Tuning General Parameters for Parallel Execution .................................................................. 24-47
Parameters Establishing Resource Limits for Parallel Operations.................................... 24-47
PARALLEL_MAX_SERVERS .......................................................................................... 24-48
Increasing the Number of Concurrent Users ................................................................ 24-49
Limiting the Number of Resources for a User .............................................................. 24-49
PARALLEL_MIN_SERVERS ........................................................................................... 24-49
SHARED_POOL_SIZE ..................................................................................................... 24-50
Computing Additional Memory Requirements for Message Buffers ....................... 24-51
Adjusting Memory After Processing Begins ................................................................. 24-53
PARALLEL_MIN_PERCENT.......................................................................................... 24-55
Parameters Affecting Resource Consumption ..................................................................... 24-55
PGA_AGGREGATE_TARGET........................................................................................ 24-56
PARALLEL_EXECUTION_MESSAGE_SIZE ............................................................... 24-56
Parameters Affecting Resource Consumption for Parallel DML and Parallel DDL 24-56
Parameters Related to I/O ...................................................................................................... 24-59
DB_CACHE_SIZE ............................................................................................................. 24-60
DB_BLOCK_SIZE .............................................................................................................. 24-60
DB_FILE_MULTIBLOCK_READ_COUNT................................................................... 24-60
DISK_ASYNCH_IO and TAPE_ASYNCH_IO.............................................................. 24-60
Monitoring and Diagnosing Parallel Execution Performance............................................... 24-61
xxv
Is There Regression? ................................................................................................................. 24-62
Is There a Plan Change?........................................................................................................... 24-63
Is There a Parallel Plan? ........................................................................................................... 24-63
Is There a Serial Plan? .............................................................................................................. 24-63
Is There Parallel Execution? .................................................................................................... 24-64
Is the Workload Evenly Distributed?..................................................................................... 24-64
Monitoring Parallel Execution Performance with Dynamic Performance Views........... 24-65
V$PX_BUFFER_ADVICE ................................................................................................. 24-65
V$PX_SESSION.................................................................................................................. 24-65
V$PX_SESSTAT ................................................................................................................. 24-65
V$PX_PROCESS ................................................................................................................ 24-65
V$PX_PROCESS_SYSSTAT ............................................................................................. 24-66
V$PQ_SESSTAT................................................................................................................. 24-66
V$FILESTAT....................................................................................................................... 24-66
V$PARAMETER ................................................................................................................ 24-66
V$PQ_TQSTAT .................................................................................................................. 24-67
V$SESSTAT and V$SYSSTAT.......................................................................................... 24-68
Monitoring Session Statistics .................................................................................................. 24-68
Monitoring System Statistics................................................................................................... 24-70
Monitoring Operating System Statistics................................................................................ 24-71
Affinity and Parallel Operations.................................................................................................. 24-71
Affinity and Parallel Queries .................................................................................................. 24-72
Affinity and Parallel DML....................................................................................................... 24-72
Miscellaneous Parallel Execution Tuning Tips ......................................................................... 24-73
Setting Buffer Cache Size for Parallel Operations................................................................ 24-74
Overriding the Default Degree of Parallelism...................................................................... 24-74
Rewriting SQL Statements....................................................................................................... 24-74
Creating and Populating Tables in Parallel .......................................................................... 24-75
Creating Temporary Tablespaces for Parallel Sort and Hash Join .................................... 24-76
Size of Temporary Extents ............................................................................................... 24-76
Executing Parallel SQL Statements ........................................................................................ 24-77
Using EXPLAIN PLAN to Show Parallel Operations Plans............................................... 24-77
Additional Considerations for Parallel DML ....................................................................... 24-78
PDML and Direct-Path Restrictions................................................................................ 24-78
Limitation on the Degree of Parallelism ........................................................................ 24-79
xxvi
Using Local and Global Striping ..................................................................................... 24-79
Increasing INITRANS....................................................................................................... 24-79
Limitation on Available Number of Transaction Free Lists for Segments ............... 24-79
Using Multiple Archivers................................................................................................. 24-80
Database Writer Process (DBWn) Workload................................................................. 24-80
[NO]LOGGING Clause .................................................................................................... 24-80
Creating Indexes in Parallel .................................................................................................... 24-81
Parallel DML Tips..................................................................................................................... 24-83
Parallel DML Tip 1: INSERT............................................................................................ 24-83
Parallel DML Tip 2: Direct-Path INSERT....................................................................... 24-83
Parallel DML Tip 3: Parallelizing INSERT, MERGE, UPDATE, and DELETE ........ 24-84
Incremental Data Loading in Parallel.................................................................................... 24-85
Updating the Table in Parallel......................................................................................... 24-86
Inserting the New Rows into the Table in Parallel....................................................... 24-87
Merging in Parallel............................................................................................................ 24-87
Using Hints with Query Optimization.................................................................................. 24-87
FIRST_ROWS(n) Hint .............................................................................................................. 24-88
Enabling Dynamic Sampling .................................................................................................. 24-88
Glossary
Index
xxvii
xxviii
Send Us Your Comments
Oracle Database Data Warehousing Guide, 10g Release 1 (10.1)
Part No. B10736-01
Oracle Corporation welcomes your comments and suggestions on the quality and usefulness of this
publication. Your input is an important part of the information used for revision.
■ Did you find any errors?
■ Is the information clearly presented?
■ Do you need more information? If so, where?
■ Are the examples correct? Do you need more examples?
■ What features did you like most about this manual?
If you find any errors or have any other suggestions for improvement, please indicate the title and
part number of the documentation and the chapter, section, and page number (if available). You can
send comments to us in the following ways:
■ Electronic mail: [email protected]
■ FAX: (650)506-7227. Attn: Server Technologies Documentation Manager
■ Postal service:
Oracle Corporation
Server Technologies Documentation
500 Oracle Parkway, Mailstop 4op11
Redwood Shores, CA 94065
USA
If you would like a reply, please give your name, address, telephone number, and electronic mail
address (optional).
If you have problems with the software, please contact your local Oracle Support Services.
xxix
xxx
Preface
xxxi
Audience
This guide is intended for database administrators, system administrators, and
database application developers who design, maintain, and use data warehouses.
To use this document, you need to be familiar with relational database concepts,
basic Oracle server concepts, and the operating system environment under which
you are running Oracle.
Organization
This document contains:
Part 1: Concepts
Chapter 6, "Indexes"
This chapter describes how to use indexes in data warehouses.
xxxii
Chapter 7, "Integrity Constraints"
This chapter describes how to use integrity constraints in data warehouses.
xxxiii
Part 5: Data Warehouse Performance
Glossary
The glossary defines important terms used in this guide.
Related Documentation
For more information, see these Oracle resources:
■ Oracle Database Performance Tuning Guide
Many of the examples in this book use the sample schemas of the seed database,
which is installed by default when you install Oracle. Refer to Oracle Database
Sample Schemas for information on how these schemas were created and how you
can use them yourself.
Printed documentation is available for sale in the Oracle Store at
xxxiv
http://oraclestore.oracle.com/
If you already have a username and password for OTN, then you can go directly to
the documentation section of the OTN Web site at
http://otn.oracle.com/documentation
Conventions
This section describes the conventions used in the text and code examples of this
documentation set. It describes:
■ Conventions in Text
■ Conventions in Code Examples
Conventions in Text
We use various conventions in text to help you more quickly identify special terms.
The following table describes those conventions and provides examples of their use.
xxxv
Convention Meaning Example
UPPERCASE Uppercase monospace typeface indicates You can specify this clause only for a NUMBER
monospace elements supplied by the system. Such column.
(fixed-width) elements include parameters, privileges,
You can back up the database by using the
font datatypes, RMAN keywords, SQL
BACKUP command.
keywords, SQL*Plus or utility commands,
packages and methods, as well as Query the TABLE_NAME column in the USER_
system-supplied column names, database TABLES data dictionary view.
objects and structures, usernames, and
Use the DBMS_STATS.GENERATE_STATS
roles.
procedure.
lowercase Lowercase monospace typeface indicates Enter sqlplus to open SQL*Plus.
monospace executables, filenames, directory names,
The password is specified in the orapwd file.
(fixed-width) and sample user-supplied elements. Such
font elements include computer and database Back up the datafiles and control files in the
names, net service names, and connect /disk1/oracle/dbs directory.
identifiers, as well as user-supplied
The department_id, department_name, and
database objects and structures, column
location_id columns are in the
names, packages and classes, usernames
hr.departments table.
and roles, program units, and parameter
values. Set the QUERY_REWRITE_ENABLED initialization
parameter to TRUE.
Note: Some programmatic elements use a
mixture of UPPERCASE and lowercase. Connect as oe user.
Enter these elements as shown.
The JRepUtil class implements these methods.
lowercase Lowercase italic monospace font You can specify the parallel_clause.
italic represents placeholders or variables.
Run Uold_release.SQL where old_release
monospace
refers to the release you installed prior to
(fixed-width)
upgrading.
font
The following table describes typographic conventions used in code examples and
provides examples of their use.
xxxvi
Convention Meaning Example
[ ] Brackets enclose one or more optional DECIMAL (digits [ , precision ])
items. Do not enter the brackets.
{ } Braces enclose two or more items, one of {ENABLE | DISABLE}
which is required. Do not enter the
braces.
| A vertical bar represents a choice of two {ENABLE | DISABLE}
or more options within brackets or braces. [COMPRESS | NOCOMPRESS]
Enter one of the options. Do not enter the
vertical bar.
... Horizontal ellipsis points indicate either:
■ That we have omitted parts of the CREATE TABLE ... AS subquery;
code that are not directly related to
the example SELECT col1, col2, ... , coln FROM
employees;
■ That you can repeat a portion of the
code
. Vertical ellipsis points indicate that we SQL> SELECT NAME FROM V$DATAFILE;
. have omitted several lines of code not NAME
. directly related to the example. ------------------------------------
/fsl/dbs/tbs_01.dbf
/fs1/dbs/tbs_02.dbf
.
.
.
/fsl/dbs/tbs_09.dbf
9 rows selected.
Other notation You must enter symbols other than acctbal NUMBER(11,2);
brackets, braces, vertical bars, and ellipsis acct CONSTANT NUMBER(4) := 3;
points as shown.
Italics Italicized text indicates placeholders or CONNECT SYSTEM/system_password
variables for which you must supply DB_NAME = database_name
particular values.
UPPERCASE Uppercase typeface indicates elements SELECT last_name, employee_id FROM
supplied by the system. We show these employees;
terms in uppercase in order to distinguish SELECT * FROM USER_TABLES;
them from terms you define. Unless terms DROP TABLE hr.employees;
appear in brackets, enter them in the
order and with the spelling shown.
However, because these terms are not
case sensitive, you can enter them in
lowercase.
xxxvii
Convention Meaning Example
lowercase Lowercase typeface indicates SELECT last_name, employee_id FROM
programmatic elements that you supply. employees;
For example, lowercase indicates names sqlplus hr/hr
of tables, columns, or files. CREATE USER mjones IDENTIFIED BY ty3MU9;
Note: Some programmatic elements use a
mixture of UPPERCASE and lowercase.
Enter these elements as shown.
Documentation Accessibility
Our goal is to make Oracle products, services, and supporting documentation
accessible, with good usability, to the disabled community. To that end, our
documentation includes features that make information available to users of
assistive technology. This documentation is available in HTML format, and contains
markup to facilitate access by the disabled community. Standards will continue to
evolve over time, and Oracle is actively engaged with other market-leading
technology vendors to address technical obstacles so that our documentation can be
accessible to all of our customers. For additional information, visit the Oracle
Accessibility Program Web site at
http://www.oracle.com/accessibility/
xxxviii
What's New in Oracle Database?
This section describes the new features of Oracle Database 10g Release 1 (10.1) and
provides pointers to additional information. New features information from
previous releases is also retained to help those users migrating to the current
release.
The following section describse new features in Oracle Database:
■ Oracle Database 10g Release 1 (10.1) New Features in Data Warehousing
xxxix
Oracle Database 10g Release 1 (10.1) New Features in Data
Warehousing
■ SQL Model Calculations
The MODEL clause enables you to specify complex formulas while avoiding
multiple joins and UNION clauses. This clause supports OLAP queries such as
share of ancestor and prior period comparisons, as well as calculations typically
done in large spreadsheets. The MODEL clause provides building blocks for
budgeting, forecasting, and statistical applications.
■ SQLAccess Advisor
The SQLAccess Advisor tool and its related DBMS_ADVISOR package offer
improved capabilities for recommending indexing and materialized view
strategies.
■ Materialized Views
The TUNE_MVIEW procedure shows how to specify a materialized view so that
it is fast refreshable and can use advanced types of query rewrite.
■ Materialized View Refresh Enhancements
Materialized view refresh has new optimizations for data warehousing and
OLAP environments. The enhancements include more efficient calculation and
update techniques, support for nested refresh, along with improved cost
analysis.
■ Partitioning Enhancements
xl
You can now use partitioning with index-organized tables. Also, materialized
views in OLAP are able to use partitioning. You can now use hash-partitioned
global indexes.
■ ETL Enhancements
Oracle's extraction, transformation, and loading capabilities have been
improved with several MERGE improvements and better external table
capabilities.
■ Storage Management
Oracle Managed Files has simplified the administration of a database by
providing functionality to automatically create and manage files, so the
database administrator no longer needs to manage each database file.
Automatic Storage Management provides additional functionality for
managing not only files, but also the disks. In addition, you can now use
ultralarge data files.
xli
xlii
Part I
Concepts
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more
about your company's sales data, you can build a data warehouse that concentrates
on sales. Using this data warehouse, you can answer questions such as "Who was
our best customer for this item last year?" This ability to define a data warehouse by
subject matter, sales in this case, makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems
as naming conflicts and inconsistencies among units of measure. When they achieve
this, they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered into the data warehouse, data should not
change. This is logical because the purpose of a data warehouse is to enable you to
analyze what has occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A
data warehouse's focus on change over time is what is meant by the term time
variant.
Complex data
structures Multidimensional
(3NF databases) data structures
One major difference between the types of system is that data warehouses are not
usually in third normal form (3NF), a type of data normalization common in OLTP
environments.
Data warehouses and OLTP systems have very different requirements. Here are
some examples of differences between typical data warehouses and OLTP systems:
■ Workload
Data warehouses are designed to accommodate ad hoc queries. You might not
know the workload of your data warehouse in advance, so a data warehouse
should be optimized to perform well for a wide variety of possible query
operations.
OLTP systems support only predefined operations. Your applications might be
specifically tuned or designed to support only these operations.
■ Data modifications
A data warehouse is updated on a regular basis by the ETL process (run nightly
or weekly) using bulk data modification techniques. The end users of a data
warehouse do not directly update the data warehouse.
In OLTP systems, end users routinely issue individual data modification
statements to the database. The OLTP database is always up to date, and reflects
the current state of each business transaction.
■ Schema design
Data warehouses often use denormalized or partially denormalized schemas
(such as a star schema) to optimize query performance.
OLTP systems often use fully normalized schemas to optimize
update/insert/delete performance, and to guarantee data consistency.
■ Typical operations
A typical data warehouse query scans thousands or millions of rows. For
example, "Find the total sales for all customers last month."
A typical OLTP operation accesses only a handful of records. For example,
"Retrieve the current order for this customer."
■ Historical data
Data warehouses usually store many months or years of data. This is to support
historical analysis.
OLTP systems usually store data from only a few weeks or months. The OLTP
system stores only historical data as needed to successfully meet the
requirements of the current transaction.
Operational Analysis
System
Metadata
In Figure 1–2, the metadata and raw data of a traditional OLTP system is present, as
is an additional type of data, summary data. Summaries are very valuable in data
warehouses because they pre-compute long operations in advance. For example, a
typical data warehouse query is to retrieve something such as August sales. A
summary in an Oracle database is called a materialized view.
Data Staging
Sources Area Warehouse Users
Operational Analysis
System
Metadata
Figure 1–4 Architecture of a Data Warehouse with a Staging Area and Data Marts
Metadata
This section deals with the issues in logical design in a data warehouse.
It contains the following chapter:
■ Chapter 2, "Logical Design in Data Warehouses"
2
Logical Design in Data Warehouses
This chapter explains how to create a logical design for a data warehousing
environment and includes the following topics:
■ Logical Versus Physical Design in Data Warehouses
■ Creating a Logical Design
■ Data Warehousing Schemas
■ Data Warehousing Objects
Your logical design should result in (1) a set of entities and attributes corresponding
to fact tables and dimension tables and (2) a model of operational data from your
source into subject-oriented information in your target data warehouse schema.
You can create the logical design using a pen and paper, or you can use a design
tool such as Oracle Warehouse Builder (specifically designed to support modeling
the ETL process) or Oracle Designer (a general purpose modeling tool).
Star Schemas
The star schema is the simplest data warehouse schema. It is called a star schema
because the diagram resembles a star, with points radiating from a center. The
center of the star consists of one or more fact tables and the points of the star are the
dimension tables, as shown in Figure 2–1.
products times
sales
(amount_sold,
quantity_sold)
Fact Table
customers channels
The most natural way to model a data warehouse is as a star schema, where only
one join establishes the relationship between the fact table and any one of the
dimension tables.
A star schema optimizes performance by keeping queries simple and providing fast
response time. All the information about each level is stored in one row.
Other Schemas
Some schemas in data warehousing environments use third normal form rather
than star schemas. Another schema that is sometimes useful is the snowflake
schema, which is a star schema with normalized dimensions in a tree structure.
Fact Tables
A fact table typically has two types of columns: those that contain numeric facts
(often called measurements), and those that are foreign keys to dimension tables. A
fact table contains either detail-level facts or facts that have been aggregated. Fact
tables that contain aggregated facts are often called summary tables. A fact table
usually contains facts with the same level of aggregation. Though most facts are
additive, they can also be semi-additive or non-additive. Additive facts can be
aggregated by simple arithmetical addition. A common example of this is sales.
Non-additive facts cannot be added at all. An example of this is averages.
Semi-additive facts can be aggregated along some of the dimensions and not along
others. An example of this is inventory levels, where you cannot tell what a level
means simply by looking at it.
Dimension Tables
A dimension is a structure, often composed of one or more hierarchies, that
categorizes data. Dimensional attributes help to describe the dimensional value.
They are normally descriptive, textual values. Several distinct dimensions,
combined with facts, enable you to answer business questions. Commonly used
dimensions are customers, products, and time.
Dimension data is typically collected at the lowest level of detail and then
aggregated into higher level totals that are more useful for analysis. These natural
rollups or aggregations within a dimension table are called hierarchies.
Hierarchies
Hierarchies are logical structures that use ordered levels as a means of organizing
data. A hierarchy can be used to define data aggregation. For example, in a time
dimension, a hierarchy might aggregate data from the month level to the quarter
level to the year level. A hierarchy can also be used to define a navigational drill
path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and below it.
Data values at lower levels aggregate into the data values at higher levels. A
dimension can be composed of more than one hierarchy. For example, in the
product dimension, there might be two hierarchies—one for product categories and
one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools use
hierarchies to enable you to drill down into your data to view different levels of
granularity. This is one of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business
structures. For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular level
value, a value at the next higher level is its parent, and values at the next lower level
are its children. These familial relationships enable analysts to access data quickly.
Hierarchies are also essential components in enabling more complex rewrites. For
example, the database can aggregate an existing sales revenue on a quarterly base to
a yearly aggregation when the dimensional dependencies between quarter and year
are known.
region
subregion
country_name
customer
Unique Identifiers
Unique identifiers are specified for one distinct record in a dimension table.
Artificial unique identifiers are often used to avoid the potential problem of unique
identifiers changing. Unique identifiers are represented with the # character. For
example, #customer_id.
Relationships
Relationships guarantee business integrity. An example is that if a business sells
something, there is obviously a customer and a product. Designing a relationship
between the sales information in the fact table and the dimension tables products
and customers enforces the business rules in databases.
Relationship
products customers
Fact Table #cust_id
#prod_id
sales cust_last_name
cust_id cust_city
cust_state_province Hierarchy
prod_id
times channels
promotions
Dimension Table Dimension Table
Dimension Table
This chapter describes the physical design of a data warehousing environment, and
includes the following topics:
■ Moving from Logical to Physical Design
■ Physical Design
See Also:
■ Chapter 5, "Parallelism and Partitioning in Data Warehouses"
for further information regarding partitioning
■ Oracle Database Concepts for further conceptual material
regarding all design matters
Physical Design
During the logical design phase, you defined a model for your data warehouse
consisting of entities, attributes, and relationships. The entities are linked together
using relationships. Attributes are used to describe the entities. The unique
identifier (UID) distinguishes between one instance of an entity and another.
Figure 3–1 illustrates a graphical way of distinguishing between logical and
physical designs.
Integrity Materialized
Relationships Constraints Views
- Primary Key
- Foreign Key
- Not Null
Attributes Dimensions
Columns
Unique
Identifiers
During the physical design process, you translate the expected schemas into actual
database structures. At this time, you have to map:
■ Entities to tables
■ Relationships to foreign key constraints
■ Attributes to columns
■ Primary unique identifiers to primary key constraints
■ Unique identifiers to unique key constraints
Some of these structures require disk space. Others exist only in the data dictionary.
Additionally, the following structures may be created for performance
improvement:
■ Indexes and Partitioned Indexes
■ Materialized Views
Tablespaces
A tablespace consists of one or more datafiles, which are physical structures within
the operating system you are using. A datafile is associated with only one
tablespace. From a design perspective, tablespaces are containers for physical
design structures.
Tablespaces need to be separated by differences. For example, tables should be
separated from their indexes and small tables should be separated from large tables.
Tablespaces should also represent logical business units if possible. Because a
tablespace is the coarsest granularity for backup and recovery or the transportable
tablespaces mechanism, the logical business design affects availability and
maintenance operations.
You can now use ultralarge data files, a significant improvement in very large
databases.
Table Compression
You can save disk space by compressing heap-organized tables. A typical type of
heap-organized table you should consider for table compression is partitioned
tables.
To reduce disk use and memory use (specifically, the buffer cache), you can store
tables and partitioned tables in a compressed format inside the database. This often
leads to a better scaleup for read-only operations. Table compression can also speed
up query execution. There is, however, a cost in CPU overhead.
Table compression should be used with highly redundant data, such as tables with
many foreign keys. You should avoid compressing tables with much update or
other DML activity. Although compressed tables or partitions are updatable, there
is some overhead in updating these tables, and high update activity may work
against compression by causing some space to be wasted.
Views
A view is a tailored presentation of the data contained in one or more tables or
other views. A view takes the output of a query and treats it as a table. Views do not
require any space in the database.
Integrity Constraints
Integrity constraints are used to enforce business rules associated with your
database and to prevent having invalid information in the tables. Integrity
constraints in data warehousing differ from constraints in OLTP environments. In
OLTP environments, they primarily prevent the insertion of invalid data into a
record, which is not a big problem in data warehousing environments because
accuracy has already been guaranteed. In data warehousing environments,
constraints are only used for query rewrite. NOT NULL constraints are particularly
common in data warehouses. Under some specific circumstances, constraints need
space in the database. These constraints are in the form of the underlying unique
index.
Materialized Views
Materialized views are query results that have been stored in advance so
long-running calculations are not necessary when you actually execute your SQL
statements. From a physical design point of view, materialized views resemble
tables or partitioned tables and behave like indexes in that they are used
transparently and improve performance.
Dimensions
A dimension is a schema object that defines hierarchical relationships between
columns or column sets. A hierarchical relationship is a functional dependency
from one level of a hierarchy to the next one. A dimension is a container of logical
relationships and does not require any space in the database. A typical dimension is
city, state (or province), region, and country.
This chapter explains some of the hardware and I/O issues in a data warehousing
environment and includes the following topics:
■ Overview of Hardware and I/O Considerations in Data Warehouses
■ Storage Management
As an example, consider a 200GB data mart. Using 72GB drives, this data mart
could be built with as few as six drives in a fully-mirrored environment. However,
six drives might not provide enough I/O bandwidth to handle a medium number
of concurrent users on a 4-CPU server. Thus, even though six drives provide
sufficient storage, a larger number of drives may be required to provide acceptable
performance for this system.
While it may not be practical to estimate the I/O bandwidth that will be required by
a data warehouse before a system is built, it is generally practical with the guidance
of the hardware manufacturer to estimate how much I/O bandwidth a given server
can potentially utilize, and ensure that the selected I/O configuration will be able to
successfully feed the server. There are many variables in sizing the I/O systems, but
one basic rule of thumb is that your data warehouse system should have multiple
disks for each CPU (at least two disks for each CPU at a bare minimum) in order to
achieve optimal performance.
Use Redundancy
Because data warehouses are often the largest database systems in a company, they
have the most disks and thus are also the most susceptible to the failure of a single
disk. Therefore, disk redundancy is a requirement for data warehouses to protect
against a hardware failure. Like disk-striping, redundancy can be achieved in many
ways using software or hardware.
A key consideration is that occasionally a balance must be made between
redundancy and performance. For example, a storage system in a RAID-5
configuration may be less expensive than a RAID-0+1 configuration, but it may not
perform as well, either. Redundancy is necessary for any data warehouse, but the
approach to redundancy may vary depending upon the performance and cost
constraints of each data warehouse.
Storage Management
Two features to consider for managing disks are Oracle Managed Files and
Automatic Storage Management. Without these features, a database administrator
must manage the database files, which, in a data warehouse, can be hundreds or
even thousands of files. Oracle Managed Files simplifies the administration of a
database by providing functionality to automatically create and manage files, so the
database administrator no longer needs to manage each database file. Automatic
Storage Management provides additional functionality for managing not only files
but also the disks. With Automatic Storage Management, the database
administrator would administer a small number of disk groups. Automatic Storage
Management handles the tasks of striping and providing disk redundancy,
including rebalancing the database files when new disks are added to the system.
Data warehouses often contain large tables and require techniques both for
managing these large tables and for providing good query performance across these
large tables. This chapter discusses two key methodologies for addressing these
needs: parallelism and partitioning.
These topics are discussed:
■ Overview of Parallel Execution
■ Granules of Parallelism
■ Partitioning Design Considerations
Granules of Parallelism
Different parallel operations use different types of parallelism. The optimal physical
database layout depends on the parallel operations that are most prevalent in your
application or even of the necessity of using partitions.
The basic unit of work in parallelism is a called a granule. Oracle Database divides
the operation being parallelized (for example, a table scan, table update, or index
creation) into granules. Parallel execution processes execute the operation one
granule at a time. The number of granules and their size correlates with the degree
of parallelism (DOP). It also affects how well the work is balanced across query
server processes. There is no way you can enforce a specific granule strategy as
Oracle Database makes this decision internally.
deleting portions of data) might influence partition layout more than performance
considerations.
Partition Granules
When partition granules are used, a query server process works on an entire
partition or subpartition of a table or index. Because partition granules are statically
determined by the structure of the table or index when a table or index is created,
partition granules do not give you the flexibility in parallelizing an operation that
block granules do. The maximum allowable DOP is the number of partitions. This
might limit the utilization of the system and the load balancing across parallel
execution servers.
When partition granules are used for parallel access to a table or index, you should
use a relatively large number of partitions (ideally, three times the DOP), so that
Oracle can effectively balance work across the query server processes.
Partition granules are the basic unit of parallel index range scans and of parallel
operations that modify multiple partitions of a partitioned table or index. These
operations include parallel creation of partitioned indexes, and parallel creation of
partitioned tables.
Types of Partitioning
This section describes the partitioning features that significantly enhance data
access and improve overall application performance. This is especially true for
applications that access tables and indexes with millions of rows and many
gigabytes of data.
Partitioning Methods
Oracle offers four partitioning methods:
■ Range Partitioning
■ Hash Partitioning
■ List Partitioning
■ Composite Partitioning
Each partitioning method has different advantages and design considerations.
Thus, each method is more appropriate for a particular situation.
Note: This table was created with the COMPRESS keyword, thus
all partitions inherit this attribute.
List Partitioning List partitioning enables you to explicitly control how rows map to
partitions. You do this by specifying a list of discrete values for the partitioning
column in the description for each partition. This is different from range
partitioning, where a range of values is associated with a partition and with hash
partitioning, where you have no control of the row-to-partition mapping. The
advantage of list partitioning is that you can group and organize unordered and
unrelated sets of data in a natural way. The following example creates a list
partitioned table grouping states according to their sales regions:
CREATE TABLE sales_list
(salesman_id NUMBER(5),
salesman_name VARCHAR2(30),
sales_state VARCHAR2(20),
sales_amount NUMBER(10),
sales_date DATE)
PARTITION BY LIST(sales_state)
(PARTITION sales_west VALUES('California', 'Hawaii') COMPRESS,
PARTITION sales_east VALUES('New York', 'Virginia', 'Florida'),
PARTITION sales_central VALUES('Texas', 'Illinois'));
Index Partitioning
You can choose whether or not to inherit the partitioning strategy of the underlying
tables. You can create both local and global indexes on a table partitioned by range,
hash, or composite methods. Local indexes inherit the partitioning attributes of
their related tables. For example, if you create a local index on a composite table,
Oracle automatically partitions the local index using the composite method. See
Chapter 6, "Indexes" for more information.
When to Use Hash Partitioning The way Oracle Database distributes data in hash
partitions does not correspond to a business or a logical view of the data, as it does
in range partitioning. Consequently, hash partitioning is not an effective way to
manage historical data. However, hash partitions share some performance
characteristics with range partitions. For example, partition pruning is limited to
equality predicates. You can also use partition-wise joins, parallel index access, and
parallel DML. See "Partition-Wise Joins" on page 5-20 for more information.
As a general rule, use hash partitioning for these purposes:
■ To improve the availability and manageability of large tables or to enable
parallel DML in tables that do not store historical data.
■ To avoid data skew among partitions. Hash partitioning is an effective means of
distributing data because Oracle hashes the data into a number of partitions,
each of which can reside on a separate device. Thus, data is evenly spread over
a sufficient number of devices to maximize I/O throughput. Similarly, you can
use hash partitioning to distribute evenly data among the nodes of an MPP
platform that uses Oracle Real Application Clusters.
■ If it is important to use partition pruning and partition-wise joins according to a
partitioning key that is mostly constrained by a distinct value or value list.
If you add or merge a hashed partition, Oracle automatically rearranges the rows to
reflect the change in the number of partitions and subpartitions. The hash function
that Oracle uses is especially designed to limit the cost of this reorganization.
Instead of reshuffling all the rows in the table, Oracles uses an "add partition" logic
that splits one and only one of the existing hashed partitions. Conversely, Oracle
coalesces a partition by merging two existing hashed partitions.
Although the hash function's use of "add partition" logic dramatically improves the
manageability of hash partitioned tables, it means that the hash function can cause a
skew if the number of partitions of a hash partitioned table, or the number of
subpartitions in each partition of a composite table, is not a power of two. In the
worst case, the largest partition can be twice the size of the smallest. So for optimal
performance, create a number of partitions and subpartitions for each partition that
is a power of two. For example, 2, 4, 8, 16, 32, 64, 128, and so on.
The following example creates four hashed partitions for the table sales_hash
using the column s_productid as the partition key:
CREATE TABLE sales_hash
(s_productid NUMBER,
s_saledate DATE,
s_custid NUMBER,
s_totalprice NUMBER)
PARTITION BY HASH(s_productid)
PARTITIONS 4;
Specify partition names if you want to choose the names of the partitions.
Otherwise, Oracle automatically generates internal names for the partitions. Also,
you can use the STORE IN clause to assign hash partitions to tablespaces in a
round-robin manner.
When to Use List Partitioning You should use list partitioning when you want to
specifically map rows to partitions based on discrete values.
Unlike range and hash partitioning, multi-column partition keys are not supported
for list partitioning. If a table is partitioned by list, the partitioning key can only
consist of a single column of the table.
Most large tables in a data warehouse should use range partitioning. Composite
partitioning should be used for very large tables or for data warehouses with a
well-defined need for these conditions. When using the composite method, Oracle
stores each subpartition on a different segment. Thus, the subpartitions may have
properties that differ from the properties of the table or from the partition to which
the subpartitions belong.
The following example partitions the table sales_range_hash by range on the
column s_saledate to create four partitions that order data by time. Then, within
each range partition, the data is further subdivided into 16 subpartitions by hash on
the column s_productid:
CREATE TABLE sales_range_hash(
s_productid NUMBER,
s_saledate DATE,
s_custid NUMBER,
s_totalprice NUMBER)
PARTITION BY RANGE (s_saledate)
SUBPARTITION BY HASH (s_productid) SUBPARTITIONS 8
(PARTITION sal99q1 VALUES LESS THAN (TO_DATE('01-APR-1999', 'DD-MON-YYYY')),
PARTITION sal99q2 VALUES LESS THAN (TO_DATE('01-JUL-1999', 'DD-MON-YYYY')),
PARTITION sal99q3 VALUES LESS THAN (TO_DATE('01-OCT-1999', 'DD-MON-YYYY')),
PARTITION sal99q4 VALUES LESS THAN (TO_DATE('01-JAN-2000', 'DD-MON-YYYY')));
Each hashed subpartition contains sales data for a single quarter ordered by
product code. The total number of subpartitions is 4x8 or 32.
In addition to this syntax, you can create subpartitions by using a subpartition
template. This offers better ease in naming and control of location for tablespaces
and subpartitions. The following statement illustrates this:
CREATE TABLE sales_range_hash(
s_productid NUMBER,
s_saledate DATE,
s_custid NUMBER,
s_totalprice NUMBER)
PARTITION BY RANGE (s_saledate)
SUBPARTITION BY HASH (s_productid)
SUBPARTITION TEMPLATE(
SUBPARTITION sp1 TABLESPACE tbs1,
SUBPARTITION sp2 TABLESPACE tbs2,
SUBPARTITION sp3 TABLESPACE tbs3,
SUBPARTITION sp4 TABLESPACE tbs4,
SUBPARTITION sp5 TABLESPACE tbs5,
SUBPARTITION sp6 TABLESPACE tbs6,
In this example, every partition has the same number of subpartitions. A sample
mapping for sal99q1 is illustrated in Table 5–1. Similar mappings exist for
sal99q2 through sal99q4.
inherit the attribute from the table definition or, if nothing is specified on table level,
from the tablespace definition.
To decide whether or not a partition should be compressed or stay uncompressed
adheres to the same rules as a nonpartitioned table. However, due to the capability
of range and composite partitioning to separate data logically into distinct
partitions, such a partitioned table is an ideal candidate for compressing parts of the
data (partitions) that are mainly read-only. It is, for example, beneficial in all rolling
window operations as a kind of intermediate stage before aging out old data. With
data segment compression, you can keep more old data online, minimizing the
burden of additional storage consumption.
You can also change any existing uncompressed table partition later on, add new
compressed and uncompressed partitions, or change the compression attribute as
part of any partition maintenance operation that requires data movement, such as
MERGE PARTITION, SPLIT PARTITION, or MOVE PARTITION. The partitions can
contain data or can be empty.
The access and maintenance of a partially or fully compressed partitioned table are
the same as for a fully uncompressed partitioned table. Everything that applies to
fully uncompressed partitioned tables is also valid for partially or fully compressed
partitioned tables.
operation that causes one or more compressed partitions to become part of the
table. This does not apply to a partitioned table having B-tree indexes only.
This rebuilding of the bitmap index structures is necessary to accommodate the
potentially higher number of rows stored for each data block with table
compression enabled and must be done only for the first time. All subsequent
operations, whether they affect compressed or uncompressed partitions, or change
the compression attribute, behave identically for uncompressed, partially
compressed, or fully compressed partitioned tables.
To avoid the recreation of any bitmap index structure, Oracle recommends creating
every partitioned table with at least one compressed partition whenever you plan to
partially or fully compress the partitioned table in the future. This compressed
partition can stay empty or even can be dropped after the partition table creation.
Having a partitioned table with compressed partitions can lead to slightly larger
bitmap index structures for the uncompressed partitions. The bitmap index
structures for the compressed partitions, however, are in most cases smaller than
the appropriate bitmap index structure before table compression. This highly
depends on the achieved compression rates.
If you use the MOVE statement, the local indexes for partition sales_q1_1998
become unusable. You have to rebuild them afterward, as follows:
ALTER TABLE sales
MODIFY PARTITION sales_q1_1998 REBUILD UNUSABLE LOCAL INDEXES;
The following statement merges two existing partitions into a new, compressed
partition, residing in a separate tablespace. The local bitmap indexes have to be
rebuilt afterward, as follows:
ALTER TABLE sales MERGE PARTITIONS sales_q1_1998, sales_q2_1998
Partition Pruning
Partition pruning is an essential performance feature for data warehouses. In
partition pruning, the optimizer analyzes FROM and WHERE clauses in SQL
statements to eliminate unneeded partitions when building the partition access list.
This enables Oracle Database to perform operations only on those partitions that are
relevant to the SQL statement. Oracle prunes partitions when you use range, LIKE,
equality, and IN-list predicates on the range or list partitioning columns, and when
you use equality and IN-list predicates on the hash partitioning columns.
Partition pruning dramatically reduces the amount of data retrieved from disk and
shortens the use of processing time, improving query performance and resource
utilization. If you partition the index and table on different columns (with a global,
partitioned index), partition pruning also eliminates index partitions even when the
partitions of the underlying table cannot be eliminated.
On composite partitioned objects, Oracle can prune at both the range partition level
and at the hash or list subpartition level using the relevant predicates. Refer to the
table sales_range_hash earlier, partitioned by range on the column s_
salesdate and subpartitioned by hash on column s_productid, and consider
the following example:
SELECT * FROM sales_range_hash
WHERE s_saledate BETWEEN (TO_DATE('01-JUL-1999', 'DD-MON-YYYY')) AND
(TO_DATE('01-OCT-1999', 'DD-MON-YYYY')) AND s_productid = 1200;
Oracle uses the predicate on the partitioning columns to perform partition pruning
as follows:
■ When using range partitioning, Oracle accesses only partitions sal99q2 and
sal99q3.
■ When using hash subpartitioning, Oracle accesses only the one subpartition in
each partition that stores the rows with s_productid=1200. The mapping
between the subpartition and the predicate is calculated based on Oracle's
internal hash distribution function.
Although this uses the DD-MON-RR format, which is not the same as the base
partition, the optimizer can still prune properly.
If you execute an EXPLAIN PLAN statement on the query, the PARTITION_START
and PARTITION_STOP columns of the output table do not specify which partitions
Oracle is accessing. Instead, you see the keyword KEY for both columns. The
keyword KEY for both columns means that partition pruning occurs at run-time. It
can also affect the execution plan because the information about the pruned
partitions is missing compared to the same statement using the same TO_DATE
function than the partition table definition.
Partition-Wise Joins
Partition-wise joins reduce query response time by minimizing the amount of data
exchanged among parallel execution servers when joins execute in parallel. This
significantly reduces response time and improves the use of both CPU and memory
resources. In Oracle Real Application Clusters environments, partition-wise joins
also avoid or at least limit the data traffic over the interconnect, which is the key to
achieving good scalability for massive join operations.
Partition-wise joins can be full or partial. Oracle decides which type of join to use.
table and a customer table on the column customerid. The query "find the records of
all customers who bought more than 100 articles in Quarter 3 of 1999" is a typical
example of a SQL statement performing such a join. The following is an example of
this:
SELECT c.cust_last_name, COUNT(*)
FROM sales s, customers c
WHERE s.cust_id = c.cust_id AND
s.time_id BETWEEN TO_DATE('01-JUL-1999', 'DD-MON-YYYY') AND
(TO_DATE('01-OCT-1999', 'DD-MON-YYYY'))
GROUP BY c.cust_last_name HAVING COUNT(*) > 100;
This large join is typical in data warehousing environments. The entire customer
table is joined with one quarter of the sales data. In large data warehouse
applications, this might mean joining millions of rows. The join method to use in
that case is obviously a hash join. You can reduce the processing time for this hash
join even more if both tables are equipartitioned on the customerid column. This
enables a full partition-wise join.
When you execute a full partition-wise join in parallel, the granule of parallelism, as
described under "Granules of Parallelism" on page 5-3, is a partition. As a result, the
degree of parallelism is limited to the number of partitions. For example, you
require at least 16 partitions to set the degree of parallelism of the query to 16.
You can use various partitioning methods to equipartition both tables on the
column customerid with 16 partitions. These methods are described in these
subsections.
Hash-Hash This is the simplest method: the customers and sales tables are both
partitioned by hash into 16 partitions, on the s_customerid and c_customerid
columns. This partitioning method enables full partition-wise join when the tables
are joined on c_customerid and s_customerid, both representing the same
customer identification number. Because you are using the same hash function to
distribute the same information (customer ID) into the same number of hash
partitions, you can join the equivalent partitions. They are storing the same values.
In serial, this join is performed between pairs of matching hash partitions, one at a
time. When one partition pair has been joined, the join of another partition pair
begins. The join completes when the 16 partition pairs have been processed.
sales P1 P2 P3 P16
...
customers P1 P2 P3 P16
Parallel
Execution Server Server Server Server
Servers
In Figure 5–1, assume that the degree of parallelism and the number of partitions
are the same, in other words, 16 for both. Defining more partitions than the degree
of parallelism may improve load balancing and limit possible skew in the execution.
If you have more partitions than query servers, when one query server completes
the join of one pair of partitions, it requests that the query coordinator give it
another pair to join. This process repeats until all pairs have been processed. This
method enables the load to be balanced dynamically when the number of partition
pairs is greater than the degree of parallelism, for example, 64 partitions with a
degree of parallelism of 16.
customerid
1999 - Q1
1999 - Q2
1999 - Q3
salesdate
1999 - Q4
2000 - Q1
2000 - Q2
2000 - Q3
2000 - Q4
Hash partition #9
■ The rules for data placement on MPP systems apply here. The only difference is
that a hash partition is now a collection of subpartitions. You must ensure that
all these subpartitions are placed on the same node as the matching hash
partition from the other table. For example, in Figure 5–2, store hash partition 9
of the sales table shown by the eight circled subpartitions, on the same node
as hash partition 9 of the customers table.
Range-Range and List-List You can also join range partitioned tables with range
partitioned tables and list partitioned tables with list partitioned tables in a
partition-wise manner, but this is relatively uncommon. This is more complex to
implement because you must know the distribution of the data before performing
the join. Furthermore, if you do not correctly identify the partition bounds so that
you have partitions of equal size, data skew during the execution may result.
The basic principle for using range-range and list-list is the same as for using
hash-hash: you must equipartition both tables. This means that the number of
partitions must be the same and the partition bounds must be identical. For
example, assume that you know in advance that you have 10 million customers,
and that the values for customerid vary from 1 to 10,000,000. In other words, you
have 10 million possible different values. To create 16 partitions, you can range
partition both tables, sales on s_customerid and customers on c_
customerid. You should define partition bounds for both tables in order to
generate partitions of the same size. In this example, partition bounds should be
defined as 625001, 1250001, 1875001, ... 10000001, so that each partition contains
625000 rows.
example, all rows in customers that could have matching rows in partition P1 of
sales are sent to query server 1 in the second set. Rows received by the second set
of query servers are joined with the rows from the corresponding partitions in
sales. Query server number 1 in the second set joins all customers rows that it
receives with partition P1 of sales.
...
Parallel
execution JOIN
server
set 2
Parallel
execution
server
Server Server ... Server re-distribution
hash(c_customerid)
set 1
customers SELECT
Considerations for full partition-wise joins also apply to partial partition-wise joins:
■ The degree of parallelism does not need to equal the number of partitions. In
Figure 5–3, the query executes with two sets of 16 query servers. In this case,
Oracle assigns 1 partition to each query server of the second set. Again, the
number of partitions should always be a multiple of the degree of parallelism.
Composite As with full partition-wise joins, the prime partitioning method for the
sales table is to use the range method on column s_salesdate. This is because
sales is a typical example of a table that stores historical data. To enable a partial
partition-wise join while preserving this range partitioning, subpartition sales by
hash on column s_customerid using 16 subpartitions for each partition. Pruning
and partial partition-wise joins can be used together if a query joins customers
and sales and if the query has a selection predicate on s_salesdate.
When sales is composite, the granule of parallelism for a partial partition-wise
join is a hash partition and not a subpartition. Refer to Figure 5–2 for an illustration
of a hash partition in a composite table. Again, the number of hash partitions
should be a multiple of the degree of parallelism. Also, on an MPP system, ensure
that each hash partition has affinity to a single node. In the previous example, the
eight subpartitions composing a hash partition should have affinity to the same
node.
Range Finally, you can use range partitioning on s_customerid to enable a partial
partition-wise join. This works similarly to the hash method, but a side effect of
range partitioning is that the resulting data distribution could be skewed if the size
of the partitions differs. Moreover, this method is more complex to implement
because it requires prior knowledge of the values of the partitioning column that is
also a join key.
Reduction of Memory Requirements Partition-wise joins require less memory than the
equivalent join operation of the complete data set of the tables being joined.
In the case of serial joins, the join is performed at the same time on a pair of
matching partitions. If data is evenly distributed across partitions, the memory
requirement is divided by the number of partitions. There is no skew.
In the parallel case, memory requirements depend on the number of partition pairs
that are joined in parallel. For example, if the degree of parallelism is 20 and the
number of partitions is 100, 5 times less memory is required because only 20 joins of
two partitions are performed at the same time. The fact that partition-wise joins
require less memory has a direct effect on performance. For example, the join
probably does not need to write blocks to disk during the build phase of a hash join.
MAXVALUE
You can specify the keyword MAXVALUE for any value in the partition bound value_
list. This keyword represents a virtual infinite value that sorts higher than any other
value for the data type, including the NULL value.
For example, you might partition the OFFICE table on STATE (a CHAR(10)
column) into three partitions with the following partition bounds:
■ VALUES LESS THAN ('I'): States whose names start with A through H
■ VALUES LESS THAN ('S'): States whose names start with I through R
■ VALUES LESS THAN (MAXVALUE): States whose names start with S
through Z, plus special codes for non-U.S. regions
Nulls
NULL cannot be specified as a value in a partition bound value_list. An empty string
also cannot be specified as a value in a partition bound value_list, because it is
treated as NULL within the database server.
For the purpose of assigning rows to partitions, Oracle Database sorts nulls greater
than all other values except MAXVALUE. Nulls sort less than MAXVALUE.
This means that if a table is partitioned on a nullable column, and the column is to
contain nulls, then the highest partition should have a partition bound of
MAXVALUE for that column. Otherwise the rows that contain nulls will map above
the highest partition in the table and the insert will fail.
DATE Datatypes
If the partition key includes a column that has the DATE datatype and the NLS date
format does not specify the century with the year, you must specify partition
bounds using the TO_DATE function with a 4-character format mask for the year.
Otherwise, you will not be able to create the table or index. For example, with the
sales_range table using a DATE column:
CREATE TABLE sales_range
(salesman_id NUMBER(5),
salesman_name VARCHAR2(30),
sales_amount NUMBER(10),
sales_date DATE)
COMPRESS
PARTITION BY RANGE(sales_date)
(PARTITION sales_jan2000 VALUES LESS THAN(TO_DATE('02/01/2000','DD/MM/YYYY')),
PARTITION sales_feb2000 VALUES LESS THAN(TO_DATE('03/01/2000','DD/MM/YYYY')),
PARTITION sales_mar2000 VALUES LESS THAN(TO_DATE('04/01/2000','DD/MM/YYYY')),
PARTITION sales_apr2000 VALUES LESS THAN(TO_DATE('05/01/2000','DD/MM/YYYY')));
When you query or modify data, it is recommended that you use the TO_DATE
function in the WHERE clause so that the value of the date information can be
determined at compile time. However, the optimizer can prune partitions using a
selection criterion on partitioning columns of type DATE when you use another
format, as in the following examples:
SELECT * FROM sales_range
WHERE sales_date BETWEEN TO_DATE('01-JUL-00', 'DD-MON-YY')
AND TO_DATE('01-OCT-00', 'DD-MON-YY');
In this case, the date value will be complete only at runtime. Therefore you will not
be able to see which partitions Oracle is accessing as is usually shown on the
partition_start and partition_stop columns of the EXPLAIN PLAN
statement output on the SQL statement. Instead, you will see the keyword KEY for
both columns.
Index Partitioning
The rules for partitioning indexes are similar to those for tables:
■ An index can be partitioned unless:
– The index is a cluster index
Index IX1 on DEPTNO DEPTNO 0-9 DEPTNO 10-19 ... DEPTNO 90-99
Range Partitioned
on DEPTNO
Table EMP
Range Partitioned ...
on DEPTNO
Table CHECKS
Range Partitioned ...
on CHKDATE
The highest partition of a global index must have a partition bound all of whose
values are MAXVALUE. This insures that all rows in the underlying table can be
represented in the index.
EMPNO 73
EMPNO 15 EMPNO 82
Index IX3 on EMPNO EMPNO 31 EMPNO 54 ... EMPNO 96
Range Partitioned
on EMPNO
Table EMP
Range Partitioned DEPTNO DEPTNO ... DEPTNO
on DEPTNO 0-9 10-19 90-99
Note 1: For a unique local nonprefixed index, the partitioning key must be a subset
of the index key.
Note 2: Although a global partitioned index may be equipartitioned with the
underlying table, Oracle does not take advantage of the partitioning or maintain
equipartitioning after partition maintenance operations such as DROP or SPLIT
PARTITION.
Table 5–3 Comparing Prefixed Local, Nonprefixed Local, and Global Indexes
Index Nonprefixed
Characteristics Prefixed Local Local Global
Unique possible? Yes Yes Yes. Must be global if using
indexes on columns other than
the partitioning columns
Manageability Easy to manage Easy to manage Harder to manage
OLTP Good Bad Good
Long Running Good Good Not Good
(DSS)
This chapter describes how to use the following types of indexes in a data
warehousing environment:
■ Using Bitmap Indexes in Data Warehouses
■ Using B-Tree Indexes in Data Warehouses
■ Using Index Compression
■ Choosing Between Local Indexes and Global Indexes
Note: Bitmap indexes are available only if you have purchased the
Oracle Database Enterprise Edition.
Indexes 6-1
Using Bitmap Indexes in Data Warehouses
Parallel query and parallel DML work with bitmap indexes. Bitmap indexing also
supports parallel create indexes and concatenated indexes.
Bitmap indexes are required to take advantage of Oracle's star transformation
capabilities.
Cardinality
The advantages of using bitmap indexes are greatest for columns in which the ratio
of the number of distinct values to the number of rows in the table is small. We refer
to this ratio as the degree of cardinality. A gender column, which has only two
distinct values (male and female), is optimal for a bitmap index. However, data
warehouse administrators also build bitmap indexes on columns with higher
cardinalities.
For example, on a table with one million rows, a column with 10,000 distinct values
is a candidate for a bitmap index. A bitmap index on this column can outperform a
B-tree index, particularly when this column is often queried in conjunction with
other indexed columns. In fact, in a typical data warehouse environments, a bitmap
index can be considered for any non-unique column.
B-tree indexes are most effective for high-cardinality data: that is, for data with
many possible values, such as customer_name or phone_number. In a data
warehouse, B-tree indexes should be used only for unique columns or other
columns with very high cardinalities (that is, columns that are almost unique). The
majority of indexes in a data warehouse should be bitmap indexes.
In ad hoc queries and similar situations, bitmap indexes can dramatically improve
query performance. AND and OR conditions in the WHERE clause of a query can be
resolved quickly by performing the corresponding Boolean operations directly on
the bitmaps before converting the resulting bitmap to rowids. If the resulting
number of rows is small, the query can be answered quickly without resorting to a
full table scan.
Indexes 6-3
Using Bitmap Indexes in Data Warehouses
Each entry (or bit) in the bitmap corresponds to a single row of the customers
table. The value of each bit depends upon the values of the corresponding row in
the table. For example, the bitmap cust_gender='F' contains a one as its first bit
because the gender is F in the first row of the customers table. The bitmap cust_
gender='F' has a zero for its third bit because the gender of the third row is not F.
Bitmap indexes can efficiently process this query by merely counting the number of
ones in the bitmap illustrated in Figure 6–1. The result set will be found by using
bitmap or merge operations without the necessity of a conversion to rowids. To
identify additional specific customer attributes that satisfy the criteria, use the
resulting bitmap to access the table after a bitmap to rowid conversion.
0 0 0 0 0 0
1 1 0 1 1 1
1 0 1 1 1 1
AND OR = AND =
0 0 1 0 1 0
0 1 0 0 1 0
1 1 0 1 1 1
This query uses a bitmap index on cust_marital_status. Note that this query
would not be able to use a B-tree index, because B-tree indexes do not store the
NULL values.
SELECT COUNT(*) FROM customers;
Indexes 6-5
Using Bitmap Indexes in Data Warehouses
Any bitmap index can be used for this query because all table rows are indexed,
including those that have NULL data. If nulls were not indexed, the optimizer would
be able to use indexes only on columns with NOT NULL constraints.
Example 6–3 Bitmap Join Index: One Dimension Table Columns Joins One Fact Table
Unlike the example in "Bitmap Index" on page 6-3, where a bitmap index on the
cust_gender column on the customers table was built, we now create a bitmap
join index on the fact table sales for the joined column customers(cust_
gender). Table sales stores cust_id values only:
SELECT time_id, cust_id, amount_sold FROM sales;
The following query shows illustrates the join result that is used to create the
bitmaps that are stored in the bitmap join index:
SELECT sales.time_id, customers.cust_gender, sales.amount_sold
FROM sales, customers
WHERE sales.cust_id = customers.cust_id;
TIME_ID C AMOUNT_SOLD
--------- - -----------
01-JAN-98 M 2291
01-JAN-98 F 114
01-JAN-98 M 553
01-JAN-98 M 0
01-JAN-98 M 195
01-JAN-98 M 280
01-JAN-98 M 32
...
Indexes 6-7
Using Bitmap Indexes in Data Warehouses
Table 6–2 illustrates the bitmap representation for the bitmap join index in this
example.
You can create other bitmap join indexes using more than one column or more than
one table, as shown in these examples.
Example 6–4 Bitmap Join Index: Multiple Dimension Columns Join One Fact Table
You can create a bitmap join index on more than one column from a single
dimension table, as in the following example, which uses customers(cust_
gender, cust_marital_status) from the sh schema:
CREATE BITMAP INDEX sales_cust_gender_ms_bjix
ON sales(customers.cust_gender, customers.cust_marital_status)
FROM sales, customers
WHERE sales.cust_id = customers.cust_id
LOCAL NOLOGGING;
Example 6–5 Bitmap Join Index: Multiple Dimension Tables Join One Fact Table
You can create a bitmap join index on multiple dimension tables, as in the
following, which uses customers(gender) and products(category):
CREATE BITMAP INDEX sales_c_gender_p_cat_bjix
ON sales(customers.cust_gender, products.prod_category)
FROM sales, customers, products
WHERE sales.cust_id = customers.cust_id
AND sales.prod_id = products.prod_id
LOCAL NOLOGGING;
Indexes 6-9
Using B-Tree Indexes in Data Warehouses
choose to compress four columns, the repetitiveness will be almost gone, and the
compression ratio will be worse.
Although key compression reduces the storage requirements of an index, it can
increase the CPU time required to reconstruct the key column values during an
index scan. It also incurs some additional storage overhead, because every prefix
entry has an overhead of four bytes associated with it.
Indexes 6-11
Choosing Between Local Indexes and Global Indexes
Many significant constraint features have been introduced for data warehousing.
Readers familiar with Oracle's constraint functionality in Oracle database version 7
and Oracle database version 8.x should take special note of the functionality
described in this chapter. In fact, many Oracle database version7-based and Oracle
database version8-based data warehouses lacked constraints because of concerns
about constraint performance. Newer constraint functionality addresses these
concerns.
■ RELY Constraints
■ Integrity Constraints and Parallelism
■ Integrity Constraints and Partitioning
■ View Constraints
By default, this constraint is both enabled and validated. Oracle implicitly creates a
unique index on sales_id to support this constraint. However, this index can be
problematic in a data warehouse for three reasons:
■ The unique index can be very large, because the sales table can easily have
millions or even billions of rows.
■ The unique index is rarely used for query execution. Most data warehousing
queries do not have predicates on unique keys, so creating this index will
probably not improve performance.
■ If sales is partitioned along a column other than sales_id, the unique index
must be global. This can detrimentally affect all maintenance operations on the
sales table.
A unique index is required for unique constraints to ensure that each individual
row modified in the sales table satisfies the UNIQUE constraint.
For data warehousing tables, an alternative mechanism for unique constraints is
illustrated in the following statement:
ALTER TABLE sales ADD CONSTRAINT sales_uk
UNIQUE (prod_id, cust_id, promo_id, channel_id, time_id) DISABLE VALIDATE;
This statement creates a unique constraint, but, because the constraint is disabled, a
unique index is not required. This approach can be advantageous for many data
warehousing environments because the constraint now ensures uniqueness without
the cost of a unique index.
However, there are trade-offs for the data warehouse administrator to consider with
DISABLE VALIDATE constraints. Because this constraint is disabled, no DML
statements that modify the unique column are permitted against the sales table.
You can use one of two strategies for modifying this table in the presence of a
constraint:
■ Use DDL to add data to this table (such as exchanging partitions). See the
example in Chapter 15, "Maintaining the Data Warehouse".
■ Before modifying this table, drop the constraint. Then, make all necessary data
modifications. Finally, re-create the disabled constraint. Re-creating the
constraint is more efficient than re-creating an enabled constraint. However, this
approach does not guarantee that data added to the sales table while the
constraint has been dropped is unique.
However, in some situations, you may choose to use a different state for the
FOREIGN KEY constraints, in particular, the ENABLE NOVALIDATE state. A data
warehouse administrator might use an ENABLE NOVALIDATE constraint when
either:
■ The tables contain data that currently disobeys the constraint, but the data
warehouse administrator wishes to create a constraint for future enforcement.
■ An enforced constraint is required immediately.
Suppose that the data warehouse loaded new data into the fact tables every day, but
refreshed the dimension tables only on the weekend. During the week, the
dimension tables and fact tables may in fact disobey the FOREIGN KEY constraints.
Nevertheless, the data warehouse administrator might wish to maintain the
enforcement of this constraint to prevent any changes that might affect the
FOREIGN KEY constraint outside of the ETL process. Thus, you can create the
FOREIGN KEY constraints every night, after performing the ETL process, as shown
here:
ALTER TABLE sales ADD CONSTRAINT sales_time_fk
FOREIGN KEY (time_id) REFERENCES times (time_id)
ENABLE NOVALIDATE;
ENABLE NOVALIDATE can quickly create an enforced constraint, even when the
constraint is believed to be true. Suppose that the ETL process verifies that a
FOREIGN KEY constraint is true. Rather than have the database re-verify this
FOREIGN KEY constraint, which would require time and database resources, the
data warehouse administrator could instead create a FOREIGN KEY constraint using
ENABLE NOVALIDATE.
RELY Constraints
The ETL process commonly verifies that certain constraints are true. For example, it
can validate all of the foreign keys in the data coming into the fact table. This means
that you can trust it to provide clean data, instead of implementing constraints in
the data warehouse. You create a RELY constraint as follows:
ALTER TABLE sales ADD CONSTRAINT sales_time_fk
FOREIGN KEY (time_id) REFERENCES times (time_id)
RELY DISABLE NOVALIDATE;
This statement assumes that the primary key is in the RELY state. RELY constraints,
even though they are not used for data validation, can:
■ Enable more sophisticated query rewrites for materialized views. See
Chapter 18, "Query Rewrite" for further details.
■ Enable other data warehousing tools to retrieve information regarding
constraints directly from the Oracle data dictionary.
Creating a RELY constraint is inexpensive and does not impose any overhead
during DML or load. Because the constraint is not being validated, no data
processing is necessary to create it.
View Constraints
You can create constraints on views. The only type of constraint supported on a
view is a RELY constraint.
This type of constraint is useful when queries typically access views instead of base
tables, and the database administrator thus needs to define the data relationships
between views rather than tables. View constraints are particularly useful in OLAP
environments, where they may enable more sophisticated rewrites for materialized
views.
This chapter introduces you to the use of materialized views, and discusses:
■ Overview of Data Warehousing with Materialized Views
■ Types of Materialized Views
■ Creating Materialized Views
■ Registering Existing Materialized Views
■ Choosing Indexes for Materialized Views
■ Dropping Materialized Views
■ Analyzing Materialized View Capabilities
are often referred to as summaries, because they store summarized data. They can
also be used to precompute joins with or without aggregations. A materialized view
eliminates the overhead associated with expensive joins and aggregations for a
large or important class of queries.
Oracle
Query Results
Strategy
Generate Plan
Strategy
When using query rewrite, create materialized views that satisfy the largest number
of queries. For example, if you identify 20 queries that are commonly applied to the
detail or fact tables, then you might be able to satisfy them with five or six
well-written materialized views. A materialized view definition can include any
number of aggregations (SUM, COUNT(x), COUNT(*), COUNT(DISTINCT x), AVG,
VARIANCE, STDDEV, MIN, and MAX). It can also include any number of joins. If you
are unsure of which materialized views to create, Oracle provides the SQLAccess
Advisor, which is a set of advisory procedures in the DBMS_ADVISOR package to
help in designing and evaluating materialized views for query rewrite. See
Chapter 17, "SQLAccess Advisor" for further details.
If a materialized view is to be used by query rewrite, it must be stored in the same
database as the detail tables on which it relies. A materialized view can be
partitioned, and you can define a materialized view on a partitioned table. You can
also define one or more indexes on the materialized view.
Unlike indexes, materialized views can be accessed directly using a SELECT
statement. However, it is recommended that you try to avoid writing SQL
statements that directly reference the materialized view, because then it is difficult
to change them without affecting the application. Instead, let query rewrite
transparently rewrite your query to use the materialized view.
Note that the techniques shown in this chapter illustrate how to use materialized
views in data warehouses. Materialized views can also be used by Oracle
Replication. See Oracle Database Advanced Replication for further information.
Operational
Databases Staging
file
Extraction of Data
Incremental Transformations
Detail Data
Summary
Management
Data Warehouse
Query
Rewrite MDDB
Data Mart
Detail
Incremental Extract
Load and Refresh Program
Summary
Workload
Statistics
Understanding the summary management process during the earliest stages of data
warehouse design can yield large dividends later in the form of higher
performance, lower summary administration costs, and reduced storage
requirements.
If you are concerned with the time required to enable constraints and whether any
constraints might be violated, then use the ENABLE NOVALIDATE with the RELY
frequency (such as daily or weekly) and the nature of the business. For a daily
update frequency, an update window of two to six hours might be typical.
You need to know your update window for the following activities:
■ Loading the detail data
■ Updating or rebuilding the indexes on the detail data
■ Performing quality assurance tests on the data
■ Refreshing the materialized views
■ Updating the indexes on the materialized views
Fast refresh for a materialized view containing joins and aggregates is possible after
any type of DML to the base tables (direct load or conventional INSERT, UPDATE,
or DELETE). It can be defined to be refreshed ON COMMIT or ON DEMAND. A
REFRESH ON COMMIT materialized view will be refreshed automatically when a
transaction that does DML to one of the materialized view's detail tables commits.
The time taken to complete the commit may be slightly longer than usual when this
method is chosen. This is because the refresh operation is performed as part of the
commit process. Therefore, this method may not be suitable if many users are
concurrently changing the tables upon which the materialized view is based.
Here are some examples of materialized views with aggregates. Note that
materialized view logs are only created because this materialized view will be fast
refreshed.
allowed because the appropriate materialized view logs have been created on tables
product and sales.
This example creates a materialized view that contains aggregates on a single table.
Because the materialized view log has been created with all referenced columns in
the materialized view's defining query, the materialized view is fast refreshable. If
DML is applied against the sales table, then the changes will be reflected in the
materialized view when the commit is issued.
Note that COUNT(*) must always be present to guarantee all types of fast refresh.
Otherwise, you may be limited to fast refresh after inserts only. Oracle recommends
that you include the optional aggregates in column Z in the materialized view in
order to obtain the most efficient and accurate fast refresh of the aggregates.
If you specify REFRESH FAST, Oracle performs further verification of the query
definition to ensure that fast refresh can be performed if any of the detail tables
change. These additional checks are:
■ A materialized view log must be present for each detail table and the ROWID
column must be present in each materialized view log.
■ The rowids of all the detail tables must appear in the SELECT list of the
materialized view query definition.
■ If there are no outer joins, you may have arbitrary selections and joins in the
WHERE clause. However, if there are outer joins, the WHERE clause cannot have
any selections. Further, if there are outer joins, all the joins must be connected
by ANDs and must use the equality (=) operator.
■ If there are outer joins, unique constraints must exist on the join columns of the
inner table. For example, if you are joining the fact table and a dimension table
and the join is an outer join with the fact table being the outer table, there must
exist unique constraints on the join columns of the dimension table.
If some of these restrictions are not met, you can create the materialized view as
REFRESH FORCE to take advantage of fast refresh when it is possible. If one of the
tables did not meet all of the criteria, but the other tables did, the materialized view
would still be fast refreshable with respect to the other tables for which all the
criteria are met.
Alternatively, if the previous example did not include the columns times_rid and
customers_rid, and if the refresh method was REFRESH FORCE, then this
materialized view would be fast refreshable only if the sales table was updated but
not if the tables times or customers were updated.
CREATE MATERIALIZED VIEW detail_sales_mv
PARALLEL
BUILD IMMEDIATE
REFRESH FORCE AS
SELECT s.rowid "sales_rid", c.cust_id, c.cust_last_name, s.amount_sold,
s.quantity_sold, s.time_id
FROM sales s, times t, customers c
WHERE s.cust_id = c.cust_id(+) AND s.time_id = t.time_id(+);
Even though the materialized view's defining query is almost identical and logically
equivalent to the user's input query, query rewrite does not happen because of the
failure of full text match that is the only rewrite possibility for some queries (for
example, a subquery in the WHERE clause).
You can add a column alias list to a CREATE MATERIALIZED VIEW statement. The
column alias list explicitly resolves any column name conflict without attaching
aliases in the SELECT clause of the materialized view. The syntax of the
materialized view column alias list is illustrated in the following example:
CREATE MATERIALIZED VIEW sales_mv (sales_tid, costs_tid)
ENABLE QUERY REWRITE AS
SELECT s.time_id, c.time_id
FROM sales s, products p, costs c
WHERE s.prod_id = p.prod_id AND c.prod_id = p.prod_id AND
p.prod_name IN (SELECT prod_name FROM products);
In this example, the defining query of sales_mv now matches exactly with the user
query Q1, so full text match rewrite will take place.
Note that when aliases are specified in both the SELECT clause and the new alias
list clause, the alias list clause supersedes the ones in the SELECT clause.
Build Methods
Two build methods are available for creating the materialized view, as shown in
Table 8–3. If you select BUILD IMMEDIATE, the materialized view definition is
added to the schema objects in the data dictionary, and then the fact or detail tables
are scanned according to the SELECT expression and the results are stored in the
materialized view. Depending on the size of the tables to be scanned, this build
process can take a considerable amount of time.
An alternative approach is to use the BUILD DEFERRED clause, which creates the
materialized view without data, thereby enabling it to be populated at a later date
using the DBMS_MVIEW.REFRESH package described in Chapter 15, "Maintaining
the Data Warehouse".
Refresh Options
When you define a materialized view, you can specify three refresh options: how to
refresh, what type of refresh, and can trusted constraints be used. If unspecified, the
defaults are assumed as ON DEMAND, FORCE, and ENFORCED constraints
respectively.
The two refresh execution modes are ON COMMIT and ON DEMAND. Depending on
the materialized view you create, some of the options may not be available.
Table 8–4 describes the refresh modes.
When a materialized view is maintained using the ON COMMIT method, the time
required to complete the commit may be slightly longer than usual. This is because
the refresh operation is performed as part of the commit process. Therefore this
method may not be suitable if many users are concurrently changing the tables
upon which the materialized view is based.
If you anticipate performing insert, update or delete operations on tables referenced
by a materialized view concurrently with the refresh of that materialized view, and
that materialized view includes joins and aggregation, Oracle recommends you use
ON COMMIT fast refresh rather than ON DEMAND fast refresh.
If you think the materialized view did not refresh, check the alert log or trace file.
If a materialized view fails during refresh at COMMIT time, you must explicitly
invoke the refresh procedure using the DBMS_MVIEW package after addressing the
errors specified in the trace files. Until this is done, the materialized view will no
longer be refreshed automatically at commit time.
You can specify how you want your materialized views to be refreshed from the
detail tables by selecting one of four options: COMPLETE, FAST, FORCE, and NEVER.
Table 8–5 describes the refresh options.
Whether the fast refresh option is available depends upon the type of materialized
view. You can call the procedure DBMS_MVIEW.EXPLAIN_MVIEW to determine
whether fast refresh is possible.
You can also specify if it is acceptable to use trusted constraints and REWRITE_
INTEGRITY = TRUSTED during refresh. Any nonvalidated RELY constraint is a
trusted constraint. For example, nonvalidated foreign key/primary key
relationships, functional dependencies defined in dimensions or a materialized
view in the UNKNOWN state. If query rewrite is enabled during refresh, these can
improve the performance of refresh by enabling more performant query rewrites.
Any materialized view that can uses TRUSTED constraints for refresh is left in a
state of trusted freshness (the UNKNOWN state) after refresh.
This is reflected in the column STALENESS in the view USER_MVIEWS. The column
UNKNOWN_TRUSTED_FD in the same view is also set to Y, which means yes.
You can define this property of the materialized either during create time by
specifying REFRESH USING TRUSTED [ENFORCED] CONSTRAINTS or by using
ALTER MATERIALIZED VIEW DDL.
■ Materialized aggregate views with outer joins are fast refreshable after
conventional DML and direct loads, provided only the outer table has been
modified. Also, unique constraints must exist on the join columns of the inner
join table. If there are outer joins, all the joins must be connected by ANDs and
must use the equality (=) operator.
■ For materialized views with CUBE, ROLLUP, grouping sets, or concatenation of
them, the following restrictions apply:
■ The SELECT list should contain grouping distinguisher that can either be a
GROUPING_ID function on all GROUP BY expressions or GROUPING
functions one for each GROUP BY expression. For example, if the GROUP BY
clause of the materialized view is "GROUP BY CUBE(a, b)", then the
SELECT list should contain either "GROUPING_ID(a, b)" or
"GROUPING(a) AND GROUPING(b)" for the materialized view to be fast
refreshable.
■ GROUP BY should not result in any duplicate groupings. For example,
"GROUP BY a, ROLLUP(a, b)" is not fast refreshable because it results
in duplicate groupings "(a), (a, b), AND (a)".
set to TRUE on the base tables. If you use DBMS_MVIEW.REFRESH, the entire
materialized view chain is refreshed from the top down. With DBMS_
MVIEW.REFRESH_DEPENDENT, the entire chain is refreshed from the bottom up.
This statement will first refresh all child materialized views of sales_mv and
cost_mv based on the dependency analysis and then refresh the two specified
materialized views.
You can query the STALE_SINCE column in the *_MVIEWS views to find out when
a materialized view became stale.
ORDER BY Clause
An ORDER BY clause is allowed in the CREATE MATERIALIZED VIEW statement. It
is used only during the initial creation of the materialized view. It is not used
during a full refresh or a fast refresh.
To improve the performance of queries against large materialized views, store the
rows in the materialized view in the order specified in the ORDER BY clause. This
initial ordering provides physical clustering of the data. If indexes are built on the
columns by which the materialized view is ordered, accessing the rows of the
materialized view using the index often reduces the time for disk I/O due to the
physical clustering.
The ORDER BY clause is not considered part of the materialized view definition. As a
result, there is no difference in the manner in which Oracle Database detects the
various types of materialized views (for example, materialized join views with no
aggregates). For the same reason, query rewrite is not affected by the ORDER BY
clause. This feature is similar to the CREATE TABLE ... ORDER BY capability.
are not created on the materialized view. For fast refresh of materialized views, the
definition of the materialized view logs must normally specify the ROWID clause. In
addition, for aggregate materialized views, it must also contain every column in the
table referenced in the materialized view, the INCLUDING NEW VALUES clause and
the SEQUENCE clause.
An example of a materialized view log is shown as follows where one is created on
the table sales.
CREATE MATERIALIZED VIEW LOG ON sales WITH ROWID
(prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold)
INCLUDING NEW VALUES;
To view the comment after the preceding statement execution, the user can query
the catalog views, {USER, DBA} ALL_MVIEW_COMMENTS. For example:
SELECT MVIEW_NAME, COMMENTS
FROM USER_MVIEW_COMMENTS WHERE MVIEW_NAME = 'SALES_MV';
Note: If the compatibility is set to 10.0.1 or higher, COMMENT ON TABLE will not be
allowed for the materialized view container table. The following error message will
be thrown if it is issued.
ORA-12098: cannot comment on the materialized view.
In the case of a prebuilt table, if it has an existing comment, the comment will be
inherited by the materialized view after it has been created. The existing comment
will be prefixed with '(from table)'. For example, table sales_summary was
created to contain sales summary information. An existing comment 'Sales
summary data' was associated with the table. A materialized view of the same
name is created to use the prebuilt table as its container table. After the materialized
view creation, the comment becomes '(from table) Sales summary data'.
However, if the prebuilt table, sales_summary, does not have any comment, the
following comment is added: 'Sales summary data'. Then, if we drop the
materialized view, the comment will be passed to the prebuilt table with the
comment: '(from materialized view) Sales summary data'.
You could have compressed this table to save space. See "Storage And Table
Compression" on page 8-22 for details regarding table compression.
In some cases, user-defined materialized views are refreshed on a schedule that is
longer than the update cycle. For example, a monthly materialized view might be
updated only at the end of each month, and the materialized view values always
refer to complete time periods. Reports written directly against these materialized
views implicitly select only data that is not in the current (incomplete) time period.
If a user-defined materialized view already contains a time dimension:
■ It should be registered and then fast refreshed each update cycle.
■ You can create a view that selects the complete time period of interest.
■ The reports should be modified to refer to the view instead of referring directly
to the user-defined materialized view.
If the user-defined materialized view does not contain a time dimension, then:
■ Create a new materialized view that does include the time dimension (if
possible).
■ The view should aggregate over the time column in the new materialized view.
DBMS_MVIEW.EXPLAIN_MVIEW Declarations
The following PL/SQL declarations that are made for you in the DBMS_MVIEW
package show the order and datatypes of these parameters for explaining an
existing materialized view and a potential materialized view with output to a table
and to a VARRAY.
Explain an existing or potential materialized view with output to MV_
CAPABILITIES_TABLE:
DBMS_MVIEW.EXPLAIN_MVIEW (mv IN VARCHAR2,
stmt_id IN VARCHAR2:= NULL);
Using MV_CAPABILITIES_TABLE
One of the simplest ways to use DBMS_MVIEW.EXPLAIN_MVIEW is with the MV_
CAPABILITIES_TABLE, which has the following structure:
CREATE TABLE MV_CAPABILITIES_TABLE
(STMT_ID VARCHAR(30), -- client-supplied unique statement identifier
MV VARCHAR(30), -- NULL for SELECT based EXPLAIN_MVIEW
CAPABILITY_NAME VARCHAR(30), -- A descriptive name of particular
-- capabilities, such as REWRITE.
You can use the utlxmv.sql script found in the admin directory to create MV_
CAPABILITIES_TABLE.
Then, you invoke EXPLAIN_MVIEW with the materialized view to explain. You need
to use the SEQ column in an ORDER BY clause so the rows will display in a logical
order. If a capability is not possible, N will appear in the P column and an
explanation in the MSGTXT column. If a capability is not possible for more than one
reason, a row is displayed for each reason.
EXECUTE DBMS_MVIEW.EXPLAIN_MVIEW ('SH.CAL_MONTH_SALES_MV');
MV_CAPABILITIES_TABLE.CAPABILITY_NAME Details
Table 8–7 lists explanations for values in the CAPABILITY_NAME column.
■ The top level partition key must consist of only a single column.
■ The materialized view must contain either the partition key column or a
partition marker or ROWID or join dependent expression of the detail table. See
PL/SQL Packages and Types Reference for details regarding the DBMS_
MVIEW.PMARKER function.
■ If you use a GROUP BY clause, the partition key column or the partition marker
or ROWID or join dependent expression must be present in the GROUP BY clause.
■ If you use an analytic window function or the MODEL clause, the partition key
column or the partition marker or ROWID or join dependent expression must be
present in their respective PARTITION BY subclauses.
■ Data modifications can only occur on the partitioned table. If PCT refresh is
being done for a table which has join dependent expression in the materialized
view, then data modifications should not have occurred in any of the join
dependent tables.
■ The COMPATIBILITY initialization parameter must be a minimum of 9.0.0.0.0.
■ PCT is not supported for a materialized view that refers to views, remote tables,
or outer joins.
■ PCT-based refresh is not supported for UNION ALL materialized views.
Partition Key
Partition change tracking requires sufficient information in the materialized view to
be able to correlate a detail row in the source partitioned detail table to the
corresponding materialized view row. This can be accomplished by including the
detail table partition key columns in the SELECT list and, if GROUP BY is used, in the
GROUP BY list.
Consider an example of a materialized view storing daily customer sales. The
following example uses the sh sample schema and the three detail tables sales,
products, and times to create the materialized view. sales table is partitioned
by time_id column and products is partitioned by the prod_id column. times
is not a partitioned table.
For cust_dly_sales_mv, PCT is enabled on both the sales table and products
table because their respective partitioning key columns time_id and prod_id are
in the materialized view.
In this query, times table is a join dependent table since it is joined to sales table
on the partitioning key column time_id. Moreover, calendar_month_name is a
dimension hierarchical attribute of times.time_id, because calendar_month_
name is an attribute of times.mon_id and times.mon_id is a dimension
hierarchical parent of times.time_id. Hence, the expression calendar_month_
name from times tables is a join dependent expression. Let's look at another
example:
SELECT s.time_id, y.calendar_year_name
FROM sales s, times_d d, times_m m, times_y y
WHERE s.time_id = d.time_id AND d.day_id = m.day_id AND m.mon_id = y.mon_id;
Here, times table is denormalized into times_d, times_m and times_y tables.
The expression calendar_year_name from times_y table is a join dependent
expression and the tables times_d, times_m and times_y are join dependent
tables. This is because times_y table is joined indirectly through times_m and
times_d tables to sales table on its partitioning key column time_id.
This lets users create materialized views containing aggregates on some level higher
than the partitioning key of the detail table. Consider the following example of
materialized view storing monthly customer sales.
Here, you can correlate a detail table row to its corresponding materialized view
row using the join dependent table times and the relationship that
times.calendar_month_name is a dimensional attribute determined by
times.time_id. This enables partition change tracking on sales table. In
addition to this, PCT is enabled on products table because of presence of its
partitioning key column prod_id in the materialized view.
Partition Marker
The DBMS_MVIEW.PMARKER function is designed to significantly reduce the
cardinality of the materialized view (see Example 9–3 on page 9-6 for an example).
The function returns a partition identifier that uniquely identifies the partition for a
specified row within a specified partition table. Therefore, the DBMS_
MVIEW.PMARKER function is used instead of the partition key column in the
SELECT and GROUP BY clauses.
Unlike the general case of a PL/SQL function in a materialized view, use of the
DBMS_MVIEW.PMARKER does not prevent rewrite with that materialized view even
when the rewrite mode is QUERY_REWRITE_INTEGRITY=ENFORCED.
As an example of using the PMARKER function, consider calculating a typical
number, such as revenue generated by a product category during a given year. If
there were 1000 different products sold each month, it would result in 12,000 rows
in the materialized view.
Partial Rewrite
A subsequent INSERT statement adds a new row to the sales_part3 partition of
table sales. At this point, because cust_dly_sales_mv has PCT available on
table sales using a partition key, Oracle can identify the stale rows in the
materialized view cust_dly_sales_mv corresponding to sales_part3 partition
(The other rows are unchanged in their freshness state). Query rewrite cannot
(PARTITION month1
VALUES LESS THAN (TO_DATE('31-12-1998', 'DD-MM-YYYY'))
PCTFREE 0 PCTUSED 99
STORAGE (INITIAL 64k NEXT 16k PCTINCREASE 0)
TABLESPACE sf1,
PARTITION month2
VALUES LESS THAN (TO_DATE('31-12-1999', 'DD-MM-YYYY'))
PCTFREE 0 PCTUSED 99
STORAGE (INITIAL 64k NEXT 16k PCTINCREASE 0)
TABLESPACE sf2,
PARTITION month3
VALUES LESS THAN (TO_DATE('31-12-2000', 'DD-MM-YYYY'))
PCTFREE 0 PCTUSED 99
STORAGE (INITIAL 64k NEXT 16k PCTINCREASE 0)
TABLESPACE sf3) AS
SELECT s.time_id, s.cust_id, SUM(s.amount_sold) AS sum_dollar_sales,
SUM(s.quantity_sold) AS sum_unit_sales
FROM sales s GROUP BY s.time_id, s.cust_id;
In this example, the table part_sales_tab has been partitioned over three
months and then the materialized view was registered to use the prebuilt table. This
materialized view is eligible for query rewrite because the ENABLE QUERY
REWRITE clause has been included.
■ If PCT is enabled using either the partitioning key column or join expressions,
both the materialized view should be range or list partitioned.
■ PCT refresh is non-atomic.
OLAP Cubes
While data warehouse environments typically view data in the form of a star
schema, OLAP environments view data in the form of a hierarchical cube. A
hierarchical cube includes the data aggregated along the rollup hierarchy of each of
its dimensions and these aggregations are combined across dimensions. It includes
the typical set of aggregations needed for business intelligence queries.
Note that as you increase the number of dimensions and levels, the number of
groups to calculate increases dramatically. This example involves 16 groups, but if
you were to add just two more dimensions with the same number of levels, you
would have 4 x 4 x 4 x 4 = 256 different groups. Also, consider that a similar
increase in groups occurs if you have multiple hierarchies in your dimensions. For
example, the time dimension might have an additional hierarchy of fiscal month
rolling up to fiscal quarter and then fiscal year. Handling the explosion of groups
has historically been the major challenge in data storage for OLAP systems.
Typical OLAP queries slice and dice different parts of the cube comparing
aggregations from one level to aggregation from another level. For instance, a query
might find sales of the grocery division for the month of January, 2002 and compare
them with total sales of the grocery division for all of 2001.
Hence, the most effective partitioning scheme for these materialized views is to use
composite partitioning (range-list on (time, GROUPING_ID) columns). By
partitioning the materialized views this way, you enable:
■ PCT refresh, thereby improving refresh performance.
■ Partition pruning: only relevant aggregate groups will be accessed, thereby
greatly reducing the query processing cost.
If you do not want to use PCT refresh, you can just partition by list on GROUPING_
ID column.
Example 9–5 Materialized View Using UNION ALL with Two Join Views
To create a UNION ALL materialized view with two join views, the materialized
view logs must have the rowid column and, in the following example, the UNION
ALL marker is the columns, 1 marker and 2 marker.
CREATE MATERIALIZED VIEW LOG ON sales WITH ROWID;
CREATE MATERIALIZED VIEW LOG ON customers WITH ROWID;
Example 9–6 Materialized View Using UNION ALL with Joins and Aggregates
The following example shows a UNION ALL of a materialized view with joins and a
materialized view with aggregates. A couple of things can be noted in this example.
Nulls or constants can be used to ensure that the data types of the corresponding
SELECT list columns match. Also the UNION ALL marker column can be a string
literal, which is 'Year' umarker, 'Quarter' umarker, or 'Daily' umarker
in the following example:
CREATE MATERIALIZED VIEW LOG ON sales WITH ROWID, SEQUENCE
(amount_sold, time_id)
INCLUDING NEW VALUES;
By using two materialized views, you can incrementally maintain the materialized
view my_groupby_mv. The materialized view my_model_mv is on a much smaller
data set because it is built on my_groupby_mv and can be maintained by a
complete refresh.
Materialized views with models can use complete refresh or PCT refresh only, and
are available for partial text query rewrite only.
See Chapter 22, "SQL for Modeling" for further details about model calculations.
The state of a materialized view can be checked by querying the data dictionary
views USER_MVIEWS or ALL_MVIEWS. The column STALENESS will show one of
the values FRESH, STALE, UNUSABLE, UNKNOWN, UNDEFINED, or NEEDS_COMPILE
to indicate whether the materialized view can be used. The state is maintained
automatically. However, if the staleness of a materialized view is marked as NEEDS_
COMPILE, you could issue an ALTER MATERIALIZED VIEW ... COMPILE statement
to validate the materialized view and get the correct staleness state.
another schema. Moreover, if you enable query rewrite on a materialized view that
references tables outside your schema, you must have the GLOBAL QUERY REWRITE
privilege or the QUERY REWRITE object privilege on each table outside your
schema.
If the materialized view is on a prebuilt container, the creator, if different from the
owner, must have SELECT WITH GRANT privilege on the container table.
If you continue to get a privilege error while trying to create a materialized view
and you believe that all the required privileges have been granted, then the problem
is most likely due to a privilege not being granted explicitly and trying to inherit the
privilege from a role instead. The owner of the materialized view must have
explicitly been granted SELECT access to the referenced tables if the tables are in a
different schema.
If the materialized view is being created with ON COMMIT REFRESH specified, then
the owner of the materialized view requires an additional privilege if any of the
tables in the defining query are outside the owner's schema. In that case, the owner
requires the ON COMMIT REFRESH system privilege or the ON COMMIT REFRESH
object privilege on each table outside the owner's schema.
up the VPD-generated predicate on the request query with the predicate you
directly specify when you create the materialized view.
The following sections will help you create and manage a data warehouse:
■ What are Dimensions?
■ Creating Dimensions
■ Viewing Dimensions
■ Using Dimensions with Constraints
■ Validating Dimensions
■ Altering Dimensions
■ Deleting Dimensions
Dimensions 10-1
What are Dimensions?
region
subregion
country
state
city
customer
Data analysis typically starts at higher levels in the dimensional hierarchy and
gradually drills down if the situation warrants such analysis.
Dimensions do not have to be defined. However, if your application uses
dimensional modeling, it is worth spending time creating them as it can yield
significant benefits, because they help query rewrite perform more complex types of
rewrite. Dimensions are also beneficial to certain types of materialized view refresh
operations and with the SQLAccess Advisor. They are only mandatory if you use
the SQLAccess Advisor (a GUI tool for materialized view and index management)
without a workload to recommend which materialized views and indexes to create,
drop, or retain.
Dimensions 10-3
Creating Dimensions
In spite of the benefits of dimensions, you must not create dimensions in any
schema that does not fully satisfy the dimensional relationships described in this
chapter. Incorrect results can be returned from queries otherwise.
Creating Dimensions
Before you can create a dimension object, the dimension tables must exist in the
database possibly containing the dimension data. For example, if you create a
customer dimension, one or more tables must exist that contain the city, state, and
country information. In a star schema data warehouse, these dimension tables
already exist. It is therefore a simple task to identify which ones will be used.
Now you can draw the hierarchies of a dimension as shown in Figure 10–1. For
example, city is a child of state (because you can aggregate city-level data up to
state), and country. This hierarchical information will be stored in the database
object dimension.
In the case of normalized or partially normalized dimension representation (a
dimension that is stored in more than one table), identify how these tables are
joined. Note whether the joins between the dimension tables can guarantee that
each child-side row joins with one and only one parent-side row. In the case of
denormalized dimensions, determine whether the child-side columns uniquely
determine the parent-side (or attribute) columns. If you use constraints to represent
these relationships, they can be enabled with the NOVALIDATE and RELY clauses if
the relationships represented by the constraints are guaranteed by other means.
You create a dimension using either the CREATE DIMENSION statement or the
Dimension Wizard in Oracle Enterprise Manager. Within the CREATE DIMENSION
statement, use the LEVEL clause to identify the names of the dimension levels.
must be contained in exactly one country. States that belong to more than one
country, or that belong to no country, violate hierarchical integrity. Hierarchical
integrity is necessary for the correct operation of management functions for
materialized views that include aggregates.
For example, you can declare a dimension products_dim, which contains levels
product, subcategory, and category:
CREATE DIMENSION products_dim
LEVEL product IS (products.prod_id)
LEVEL subcategory IS (products.prod_subcategory)
LEVEL category IS (products.prod_category) ...
Each level in the dimension must correspond to one or more columns in a table in
the database. Thus, level product is identified by the column prod_id in the
products table and level subcategory is identified by a column called prod_
subcategory in the same table.
In this example, the database tables are denormalized and all the columns exist in
the same table. However, this is not a prerequisite for creating dimensions. "Using
Normalized Dimension Tables" on page 10-10 shows how to create a dimension
customers_dim that has a normalized schema design using the JOIN KEY clause.
The next step is to declare the relationship between the levels with the HIERARCHY
statement and give that hierarchy a name. A hierarchical relationship is a functional
dependency from one level of a hierarchy to the next level in the hierarchy. Using
the level names defined previously, the CHILD OF relationship denotes that each
child's level value is associated with one and only one parent level value. The
following statement declares a hierarchy prod_rollup and defines the
relationship between products, subcategory, and category.
HIERARCHY prod_rollup
(product CHILD OF
subcategory CHILD OF
category)
Dimensions 10-5
Creating Dimensions
See Also: Chapter 18, "Query Rewrite" for further details of using
dimensional information
The design, creation, and maintenance of dimensions is part of the design, creation,
and maintenance of your data warehouse schema. Once the dimension has been
created, check that it meets these requirements:
■ There must be a 1:n relationship between a parent and children. A parent can
have one or more children, but a child can have only one parent.
■ There must be a 1:1 attribute relationship between hierarchy levels and their
dependent dimension attributes. For example, if there is a column fiscal_
month_desc, then a possible attribute relationship would be fiscal_month_
desc to fiscal_month_name.
■ If the columns of a parent level and child level are in different relations, then the
connection between them also requires a 1:n join relationship. Each row of the
child table must join with one and only one row of the parent table. This
relationship is stronger than referential integrity alone, because it requires that
the child join key must be non-null, that referential integrity must be
maintained from the child join key to the parent join key, and that the parent
join key must be unique.
■ You must ensure (using database constraints if necessary) that the columns of
each hierarchy level are non-null and that hierarchical integrity is maintained.
■ The hierarchies of a dimension can overlap or be disconnected from each other.
However, the columns of a hierarchy level cannot be associated with more than
one dimension.
■ Join relationships that form cycles in the dimension graph are not supported.
For example, a hierarchy level cannot be joined to itself either directly or
indirectly.
Dimensions 10-7
Creating Dimensions
Multiple Hierarchies
A single dimension definition can contain multiple hierarchies. Suppose our retailer
wants to track the sales of certain items over time. The first step is to define the time
dimension over which sales will be tracked. Figure 10–2 illustrates a dimension
times_dim with two time hierarchies.
year fis_year
quarter fis_quarter
month fis_month
fis_week
day
From the illustration, you can construct the hierarchy of the denormalized time_
dim dimension's CREATE DIMENSION statement as follows. The complete CREATE
DIMENSION statement as well as the CREATE TABLE statement are shown in Oracle
Database Sample Schemas.
CREATE DIMENSION times_dim
LEVEL day IS times.time_id
LEVEL month IS times.calendar_month_desc
LEVEL quarter IS times.calendar_quarter_desc
LEVEL year IS times.calendar_year
LEVEL fis_week IS times.week_ending_day
LEVEL fis_month IS times.fiscal_month_desc
LEVEL fis_quarter IS times.fiscal_quarter_desc
LEVEL fis_year IS times.fiscal_year
HIERARCHY cal_rollup (
day CHILD OF
Dimensions 10-9
Creating Dimensions
month CHILD OF
quarter CHILD OF
year
)
HIERARCHY fis_rollup (
day CHILD OF
fis_week CHILD OF
fis_month CHILD OF
fis_quarter CHILD OF
fis_year
) <attribute determination clauses>;
region
JOIN KEY (customers.country_id) REFERENCES country);
Viewing Dimensions
Dimensions can be viewed through one of two methods:
■ Using Oracle Enterprise Manager
■ Using the DESCRIBE_DIMENSION Procedure
Dimensions 10-11
Using Dimensions with Constraints
HIERARCHY CHANNEL_ROLLUP (
CHANNEL CHILD OF
CHANNEL_CLASS)
This information is also used for query rewrite. See Chapter 18, "Query Rewrite" for
more information.
Validating Dimensions
The information of a dimension object is declarative only and not enforced by the
database. If the relationships described by the dimensions are incorrect, incorrect
results could occur. Therefore, you should verify the relationships specified by
CREATE DIMENSION using the DBMS_DIMENSION.VALIDATE_DIMENSION
procedure periodically.
However, rather than query this table, it may be better to query the rowid of the
invalid row to retrieve the actual row that has violated the constraint. In this
example, the dimension TIME_FN is checking a table called month. It has found a
row that violates the constraints. Using the rowid, you can see exactly which row in
the month table is causing the problem, as in the following:
SELECT * FROM month
WHERE rowid IN (SELECT bad_rowid
FROM dimension_exceptions
WHERE statement_id = 'my 1st example');
Dimensions 10-13
Altering Dimensions
Altering Dimensions
You can modify the dimension using the ALTER DIMENSION statement. You can
add or drop a level, hierarchy, or attribute from the dimension using this command.
Referring to the time dimension in Figure 10–2 on page 10-9, you can remove the
attribute fis_year, drop the hierarchy fis_rollup, or remove the level
fiscal_year. In addition, you can add a new level called f_year as in the
following:
ALTER DIMENSION times_dim DROP ATTRIBUTE fis_year;
ALTER DIMENSION times_dim DROP HIERARCHY fis_rollup;
ALTER DIMENSION times_dim DROP LEVEL fis_year
ALTER DIMENSION times_dim ADD LEVEL f_year IS times.fiscal_year;
If you try to remove anything with further dependencies inside the dimension,
Oracle Database rejects the altering of the dimension. A dimension becomes invalid
if you change any schema object that the dimension is referencing. For example, if
the table on which the dimension is defined is altered, the dimension becomes
invalid.
To check the status of a dimension, view the contents of the column invalid in the
ALL_DIMENSIONS data dictionary view. To revalidate the dimension, use the
COMPILE option as follows:
ALTER DIMENSION times_dim COMPILE;
Dimensions can also be modified or deleted using Oracle Enterprise Manager. See
Oracle Enterprise Manager Administrator's Guide for more information.
Deleting Dimensions
A dimension is removed using the DROP DIMENSION statement. For example:
DROP DIMENSION times_dim;
This section deals with the tasks for managing a data warehouse.
It contains the following chapters:
■ Chapter 11, "Overview of Extraction, Transformation, and Loading"
■ Chapter 12, "Extraction in Data Warehouses"
■ Chapter 13, "Transportation in Data Warehouses"
■ Chapter 14, "Loading and Transformation"
■ Chapter 15, "Maintaining the Data Warehouse"
■ Chapter 16, "Change Data Capture"
■ Chapter 17, "SQLAccess Advisor"
11
Overview of Extraction, Transformation, and
Loading
This chapter discusses extraction, which is the process of taking data from an
operational system and moving it to your data warehouse or staging system. The
chapter discusses:
■ Overview of Extraction in Data Warehouses
■ Introduction to Extraction Methods in Data Warehouses
■ Data Warehousing Extraction Examples
Full Extraction
The data is extracted completely from the source system. Because this extraction
reflects all the data currently available on the source system, there's no need to keep
track of changes to the data source since the last successful extraction. The source
data will be provided as-is and no additional logical information (for example,
timestamps) is necessary on the source site. An example for a full extraction may be
an export file of a distinct table or a remote SQL statement scanning the complete
source table.
Incremental Extraction
At a specific point in time, only the data that has changed since a well-defined event
back in history will be extracted. This event may be the last time of extraction or a
more complex business event like the last booking day of a fiscal period. To identify
this delta change there must be a possibility to identify all the changed information
since this specific time event. This information can be either provided by the source
data itself such as an application column, reflecting the last-changed timestamp or a
change table where an appropriate additional mechanism keeps track of the
changes besides the originating transactions. In most cases, using the latter method
means adding extraction logic to the source system.
Many data warehouses do not use any change-capture techniques as part of the
extraction process. Instead, entire tables from the source systems are extracted to the
data warehouse or staging area, and these tables are compared with a previous
extract from the source system to identify the changed data. This approach may not
have significant impact on the source systems, but it clearly can place a considerable
burden on the data warehouse processes, particularly if the data volumes are large.
Oracle's Change Data Capture mechanism can extract and maintain such delta
information. See Chapter 16, "Change Data Capture" for further details about the
Change Data Capture framework.
Online Extraction
The data is extracted directly from the source system itself. The extraction process
can connect directly to the source system to access the source tables themselves or to
an intermediate system that stores the data in a preconfigured manner (for example,
snapshot logs or change tables). Note that the intermediate system is not necessarily
physically different from the source system.
With online extractions, you need to consider whether the distributed transactions
are using original source objects or prepared source objects.
Offline Extraction
The data is not extracted directly from the source system but is staged explicitly
outside the original source system. The data already has an existing structure (for
example, redo logs, archive logs or transportable tablespaces) or was created by an
extraction routine.
You should consider the following structures:
■ Flat files
Data in a defined, generic format. Additional information about the source
object is necessary for further processing.
■ Dump files
These techniques are based upon the characteristics of the source systems, or may
require modifications to the source systems. Thus, each of these techniques must be
carefully evaluated by the owners of the source system prior to implementation.
Each of these techniques can work in conjunction with the data extraction technique
discussed previously. For example, timestamps can be used whether the data is
being unloaded to a file or accessed through a distributed query. See Chapter 16,
"Change Data Capture" for further details.
Timestamps
The tables in some operational systems have timestamp columns. The timestamp
specifies the time and date that a given row was last modified. If the tables in an
operational system have columns containing timestamps, then the latest data can
easily be identified using the timestamp columns. For example, the following query
might be useful for extracting today's data from an orders table:
SELECT * FROM orders
WHERE TRUNC(CAST(order_date AS date),'dd') =
TO_DATE(SYSDATE,'dd-mon-yyyy');
Partitioning
Some source systems might use range partitioning, such that the source tables are
partitioned along a date key, which allows for easy identification of new data. For
example, if you are extracting from an orders table, and the orders table is
partitioned by week, then it is easy to identify the current week's data.
Triggers
Triggers can be created in operational systems to keep track of recently updated
records. They can then be used in conjunction with timestamp columns to identify
the exact time and date when a given row was last modified. You do this by creating
a trigger on each source table that requires change data capture. Following each
DML statement that is executed on the source table, this trigger updates the
timestamp column with the current time. Thus, the timestamp column provides the
exact time and date when a given row was last modified.
The exact format of the output file can be specified using SQL*Plus system
variables.
This extraction technique offers the advantage of storing the result in a customized
format. Note that using the external table data pump unload facility, you can also
extract the result of an arbitrary SQL operation. The example previously extracts the
results of a join.
This extraction technique can be parallelized by initiating multiple, concurrent
SQL*Plus sessions, each session running a separate query representing a different
portion of the data to be extracted. For example, suppose that you wish to extract
data from an orders table, and that the orders table has been range partitioned
by month, with partitions orders_jan1998, orders_feb1998, and so on. To
extract a single year of data from the orders table, you could initiate 12 concurrent
SQL*Plus sessions, each extracting a single partition. The SQL script for one such
session could be:
SPOOL order_jan.dat
SELECT * FROM orders PARTITION (orders_jan1998);
SPOOL OFF
These 12 SQL*Plus processes would concurrently spool data to 12 separate files. You
can then concatenate them if necessary (using operating system utilities) following
the extraction. If you are planning to use SQL*Loader for loading into the target,
these 12 files can be used as is for a parallel load with 12 SQL*Loader sessions. See
Chapter 13, "Transportation in Data Warehouses" for an example.
Even if the orders table is not partitioned, it is still possible to parallelize the
extraction either based on logical or physical criteria. The logical method is based
on logical ranges of column values, for example:
SELECT ... WHERE order_date
BETWEEN TO_DATE('01-JAN-99') AND TO_DATE('31-JAN-99');
The physical method is based on a range of values. By viewing the data dictionary,
it is possible to identify the Oracle Database data blocks that make up the orders
table. Using this information, you could then derive a set of rowid-range queries for
extracting data from the orders table:
SELECT * FROM orders WHERE rowid BETWEEN value1 and value2;
columns. It is also helpful to know the extraction format, which might be the
separator between distinct columns.
The total number of extraction files specified limits the maximum degree of
parallelism for the write operation. Note that the parallelizing of the extraction does
not automatically parallelize the SELECT portion of the statement.
Unlike using any kind of export/import, the metadata for the external table is not
part of the created files when using the external table data pump unload. To extract
the appropriate metadata for the external table, use the DBMS_METADATA package,
as illustrated in the following statement:
SET LONG 2000
SELECT DBMS_METADATA.GET_DDL('TABLE','EXTRACT_CUST') FROM DUAL;
This statement creates a local table in a data mart, country_city, and populates it
with data from the countries and customers tables on the source system.
This technique is ideal for moving small volumes of data. However, the data is
transported from the source system to the data warehouse through a single Oracle
Net connection. Thus, the scalability of this technique is limited. For larger data
volumes, file-based data extraction and transportation techniques are often more
scalable and thus more appropriate.
See Oracle Database Heterogeneous Connectivity Administrator's Guide and Oracle
Database Concepts for more information on distributed queries.
The following topics provide information about transporting data into a data
warehouse:
■ Overview of Transportation in Data Warehouses
■ Introduction to Transportation Mechanisms in Data Warehouses
hold a copy of the current month's data. Using the CREATE TABLE ... AS SELECT
statement, the current month's data can be efficiently copied to this tablespace:
CREATE TABLE temp_jan_sales NOLOGGING TABLESPACE ts_temp_sales
AS SELECT * FROM sales
WHERE time_id BETWEEN '31-DEC-1999' AND '01-FEB-2000';
In this step, we have copied the January sales data into a separate tablespace;
however, in some cases, it may be possible to leverage the transportable tablespace
feature without even moving data to a separate tablespace. If the sales table has
been partitioned by month in the data warehouse and if each partition is in its own
tablespace, then it may be possible to directly transport the tablespace containing
the January data. Suppose the January partition, sales_jan2000, is located in the
tablespace ts_sales_jan2000. Then the tablespace ts_sales_jan2000 could
potentially be transported, rather than creating a temporary copy of the January
sales data in the ts_temp_sales.
However, the same conditions must be satisfied in order to transport the tablespace
ts_sales_jan2000 as are required for the specially created tablespace. First, this
tablespace must be set to READ ONLY. Second, because a single partition of a
partitioned table cannot be transported without the remainder of the partitioned
table also being transported, it is necessary to exchange the January partition into a
separate table (using the ALTER TABLE statement) to transport the January data.
The EXCHANGE operation is very quick, but the January data will no longer be a
part of the underlying sales table, and thus may be unavailable to users until this
data is exchanged back into the sales table after the export of the metadata. The
January data can be exchanged back into the sales table after you complete step 3.
This operation will generate an export file, jan_sales.dmp. The export file will be
small, because it contains only metadata. In this case, the export file will contain
information describing the table temp_jan_sales, such as the column names,
column datatype, and all other information that the target Oracle database will need
in order to access the objects in ts_temp_sales.
Step 3 Copy the Datafiles and Export File to the Target System
Copy the data files that make up ts_temp_sales, as well as the export file jan_
sales.dmp to the data mart platform, using any transportation mechanism for flat
files. Once the datafiles have been copied, the tablespace ts_temp_sales can be
set to READ WRITE mode if desired.
TABLESPACES=ts_temp_sales FILE=jan_sales.dmp
At this point, the tablespace ts_temp_sales and the table temp_sales_jan are
accessible in the data mart. You can incorporate this new data into the data mart's
tables.
You can insert the data from the temp_sales_jan table into the data mart's sales
table in one of two ways:
INSERT /*+ APPEND */ INTO sales SELECT * FROM temp_sales_jan;
Following this operation, you can delete the temp_sales_jan table (and even the
entire ts_temp_sales tablespace).
Alternatively, if the data mart's sales table is partitioned by month, then the new
transported tablespace and the temp_sales_jan table can become a permanent
part of the data mart. The temp_sales_jan table can become a partition of the
data mart's sales table:
ALTER TABLE sales ADD PARTITION sales_00jan VALUES
LESS THAN (TO_DATE('01-feb-2000','dd-mon-yyyy'));
ALTER TABLE sales EXCHANGE PARTITION sales_00jan
WITH TABLE temp_sales_jan INCLUDING INDEXES WITH VALIDATION;
This chapter helps you create and manage a data warehouse, and discusses:
■ Overview of Loading and Transformation in Data Warehouses
■ Loading Mechanisms
■ Transformation Mechanisms
■ Loading and Transformation Scenarios
Transformation Flow
From an architectural perspective, you can transform your data in two ways:
■ Multistage Data Transformation
■ Pipelined Data Transformation
Table Table
sales
Insert into sales
warehouse table
Table
The new functionality renders some of the former necessary process steps obsolete
while some others can be remodeled to enhance the data flow and the data
transformation to become more scalable and non-interruptive. The task shifts from
serial transform-then-load process (with most of the tasks done outside the
database) or load-then-transform process, to an enhanced transform-while-loading.
Oracle offers a wide variety of new capabilities to address all the issues and tasks
relevant in an ETL scenario. It is important to understand that the database offers
toolkit functionality rather than trying to address a one-size-fits-all solution. The
underlying database has to enable the most appropriate ETL process flow for a
specific customer need, and not dictate or constrain it from a technical perspective.
Figure 14–2 illustrates the new functionality, which is discussed throughout later
sections.
Flat Files
sales
Insert into sales
warehouse table
Table
Loading Mechanisms
You can use the following mechanisms for loading a data warehouse:
■ Loading a Data Warehouse with SQL*Loader
■ Loading a Data Warehouse with External Tables
■ Loading a Data Warehouse with OCI and Direct-Path APIs
■ Loading a Data Warehouse with Export/Import
data to be first loaded in the database. You can then use SQL, PL/SQL, and Java to
access the external data.
External tables enable the pipelining of the loading phase with the transformation
phase. The transformation process can be merged with the loading process without
any interruption of the data streaming. It is no longer necessary to stage the data
inside the database for further processing inside the database, such as comparison
or transformation. For example, the conversion functionality of a conventional load
can be used for a direct-path INSERT AS SELECT statement in conjunction with the
SELECT from an external table.
The main difference between external tables and regular tables is that externally
organized tables are read-only. No DML operations (UPDATE/INSERT/DELETE)
are possible and no indexes can be created on them.
External tables are a mostly compliant to the existing SQL*Loader functionality and
provide superior functionality in most cases. External tables are especially useful for
environments where the complete external source has to be joined with existing
database objects or when the data has to be transformed in a complex manner. For
example, unlike SQL*Loader, you can apply any arbitrary SQL transformation and
use the direct path insert method.
You can create an external table named sales_transactions_ext, representing
the structure of the complete sales transaction data, represented in the external file
sh_sales.dat. The product department is especially interested in a cost analysis
on product and time. We thus create a fact table named cost in the sales
history schema. The operational source data is the same as for the sales fact
table. However, because we are not investigating every dimensional information
that is provided, the data in the cost fact table has a coarser granularity than in the
sales fact table, for example, all different distribution channels are aggregated.
We cannot load the data into the cost fact table without applying the previously
mentioned aggregation of the detailed information, due to the suppression of some
of the dimensions.
The external table framework offers a solution to solve this. Unlike SQL*Loader,
where you would have to load the data before applying the aggregation, you can
combine the loading and transformation within a single SQL DML statement, as
shown in the following. You do not have to stage the data temporarily before
inserting into the target table.
The object directories must already exist, and point to the directory containing the
sh_sales.dat file as well as the directory containing the bad and log files.
CREATE TABLE sales_transactions_ext
The external table can now be used from within the database, accessing some
columns of the external data only, grouping the data, and inserting it into the
costs fact table:
INSERT /*+ APPEND */ INTO COSTS
(TIME_ID, PROD_ID, UNIT_COST, UNIT_PRICE)
SELECT TIME_ID, PROD_ID, AVG(UNIT_COST), AVG(amount_sold/quantity_sold)
FROM sales_transactions_ext GROUP BY time_id, prod_id;
Transformation Mechanisms
You have the following choices for transforming data inside the database:
■ Transformation Using SQL
■ Transformation Using PL/SQL
■ Transformation Using Table Functions
value. For example, you can do this efficiently using a SQL function as part of the
insertion into the target sales table statement:
The structure of source table sales_activity_direct is as follows:
DESC sales_activity_direct
Name Null? Type
------------ ----- ----------------
SALES_DATE DATE
PRODUCT_ID NUMBER
CUSTOMER_ID NUMBER
PROMOTION_ID NUMBER
AMOUNT NUMBER
QUANTITY NUMBER
See Chapter 15, "Maintaining the Data Warehouse" for more information regarding
MERGE operations.
other without the necessity of intermediate staging. You can use table functions to
implement such behavior.
Figure 14–3 illustrates a typical aggregation where you input a set of rows and
output a set of rows, in that case, after performing a SUM operation.
In Out
Region Sales Region Sum of Sales
North 10 Table North 35
South 20 Function South 30
North 25 West 10
East 5 East 5
West 10
South 10
... ...
The table function takes the result of the SELECT on In as input and delivers a set
of records in a different format as output for a direct insertion into Out.
Additionally, a table function can fan out data within the scope of an atomic
transaction. This can be used for many occasions like an efficient logging
mechanism or a fan out for other independent transformations. In such a scenario, a
single staging table will be needed.
tf1 tf2
Source Target
Stage Table 1
tf3
This will insert into target and, as part of tf1, into Stage Table 1 within the
scope of an atomic transaction.
INSERT INTO target SELECT * FROM tf3(SELT * FROM stage_table1);
, prod_min_price NUMBER(8,2));
TYPE product_t_rectab IS TABLE OF product_t_rec;
TYPE strong_refcur_t IS REF CURSOR RETURN product_t_rec;
TYPE refcur_t IS REF CURSOR;
END;
/
i:=i+1;
objset.extend;
objset(i):=product_t( prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc, prod_category, prod_category_desc,
prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id,
prod_status, prod_list_price, prod_min_price);
END IF;
END LOOP;
CLOSE cur;
RETURN objset;
END;
/
You can use the table function in a SQL statement to show the results. Here we use
additional SQL functionality for the output:
SELECT DISTINCT UPPER(prod_category), prod_status
FROM TABLE(obsolete_products(
CURSOR(SELECT prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class,
prod_unit_of_measure, prod_pack_size,
supplier_id, prod_status, prod_list_price, prod_min_price
FROM products)));
The following example implements the same filtering than the first one. The main
differences between those two are:
■ This example uses a strong typed REF cursor as input and can be parallelized
based on the objects of the strong typed cursor, as shown in one of the following
examples.
■ The table function returns the result set incrementally as soon as records are
created.
CREATE OR REPLACE FUNCTION
obsolete_products_pipe(cur cursor_pkg.strong_refcur_t) RETURN product_t_table
PIPELINED
PARALLEL_ENABLE (PARTITION cur BY ANY) IS
prod_id NUMBER(6);
prod_name VARCHAR2(50);
prod_desc VARCHAR2(4000);
prod_subcategory VARCHAR2(50);
prod_subcategory_desc VARCHAR2(2000);
prod_category VARCHAR2(50);
prod_category_desc VARCHAR2(2000);
prod_weight_class NUMBER(2);
prod_unit_of_measure VARCHAR2(20);
prod_pack_size VARCHAR2(30);
supplier_id NUMBER(6);
prod_status VARCHAR2(20);
prod_list_price NUMBER(8,2);
prod_min_price NUMBER(8,2);
sales NUMBER:=0;
BEGIN
LOOP
-- Fetch from cursor variable
FETCH cur INTO prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc, prod_category, prod_category_desc,
prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id,
prod_status, prod_list_price, prod_min_price;
EXIT WHEN cur%NOTFOUND; -- exit when last row is fetched
IF prod_status='obsolete' AND prod_category !='Electronics' THEN
PIPE ROW (product_t( prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class,
prod_unit_of_measure, prod_pack_size, supplier_id, prod_status,
prod_list_price, prod_min_price));
END IF;
END LOOP;
CLOSE cur;
RETURN;
END;
/
We now change the degree of parallelism for the input table products and issue the
same statement again:
ALTER TABLE products PARALLEL 4;
The session statistics show that the statement has been parallelized:
SELECT * FROM V$PQ_SESSTAT WHERE statistic='Queries Parallelized';
1 row selected.
Table functions are also capable to fanout results into persistent table structures.
This is demonstrated in the next example. The function filters returns all obsolete
products except a those of a specific prod_category (default Electronics), which
was set to status obsolete by error. The result set of the table function consists of
all other obsolete product categories. The detected wrong prod_id's are stored in a
separate table structure obsolete_products_error. Note that if a table function
is part of an autonomous transaction, it must COMMIT or ROLLBACK before each
PIPE ROW statement to avoid an error in the calling subprogram. Its result set
consists of all other obsolete product categories. It furthermore demonstrates how
normal variables can be used in conjunction with table functions:
CREATE OR REPLACE FUNCTION obsolete_products_dml(cur cursor_pkg.strong_refcur_t,
prod_cat varchar2 DEFAULT 'Electronics') RETURN product_t_table
PIPELINED
PARALLEL_ENABLE (PARTITION cur BY ANY) IS
PRAGMA AUTONOMOUS_TRANSACTION;
prod_id NUMBER(6);
prod_name VARCHAR2(50);
prod_desc VARCHAR2(4000);
prod_subcategory VARCHAR2(50);
prod_subcategory_desc VARCHAR2(2000);
prod_category VARCHAR2(50);
prod_category_desc VARCHAR2(2000);
prod_weight_class NUMBER(2);
prod_unit_of_measure VARCHAR2(20);
prod_pack_size VARCHAR2(30);
supplier_id NUMBER(6);
prod_status VARCHAR2(20);
prod_list_price NUMBER(8,2);
prod_min_price NUMBER(8,2);
sales NUMBER:=0;
BEGIN
LOOP
-- Fetch from cursor variable
FETCH cur INTO prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class,
prod_unit_of_measure, prod_pack_size, supplier_id, prod_status,
prod_list_price, prod_min_price;
EXIT WHEN cur%NOTFOUND; -- exit when last row is fetched
IF prod_status='obsolete' THEN
IF prod_category=prod_cat THEN
INSERT INTO obsolete_products_errors VALUES
(prod_id, 'correction: category '||UPPER(prod_cat)||' still
available');
COMMIT;
ELSE
PIPE ROW (product_t( prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class,
prod_unit_of_measure, prod_pack_size, supplier_id, prod_status,
prod_list_price, prod_min_price));
END IF;
END IF;
END LOOP;
CLOSE cur;
RETURN;
END;
/
The following query shows all obsolete product groups except the prod_
category Electronics, which was wrongly set to status obsolete:
SELECT DISTINCT prod_category, prod_status FROM TABLE(obsolete_products_dml(
CURSOR(SELECT prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class,
prod_unit_of_measure, prod_pack_size, supplier_id, prod_status,
prod_list_price, prod_min_price
FROM products)));
As you can see, there are some products of the prod_category Electronics
that were obsoleted by accident:
SELECT DISTINCT msg FROM obsolete_products_errors;
Taking advantage of the second input variable, you can specify a different product
group than Electronics to be considered:
SELECT DISTINCT prod_category, prod_status
FROM TABLE(obsolete_products_dml(
CURSOR(SELECT prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class,
prod_unit_of_measure, prod_pack_size, supplier_id, prod_status,
prod_list_price, prod_min_price
FROM products),'Photo'));
Because table functions can be used like a normal table, they can be nested, as
shown in the following:
SELECT DISTINCT prod_category, prod_status
FROM TABLE(obsolete_products_dml(CURSOR(SELECT *
FROM TABLE(obsolete_products_pipe(CURSOR(SELECT prod_id, prod_name, prod_desc,
prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc,
prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id,
prod_status, prod_list_price, prod_min_price
FROM products))))));
The biggest advantage of Oracle Database's ETL is its toolkit functionality, where
you can combine any of the latter discussed functionality to improve and speed up
your ETL processing. For example, you can take an external table as input, join it
with an existing table and use it as input for a parallelized table function to process
complex business logic. This table function can be used as input source for a MERGE
operation, thus streaming the new information for the data warehouse, provided in
a flat file within one single statement through the complete ETL process.
See PL/SQL User's Guide and Reference for details about table functions and the
PL/SQL programming. For details about table functions implemented in other
languages, see Oracle Data Cartridge Developer's Guide.
In order to execute this transformation, a lookup table must relate the product_id
values to the UPC codes. This table might be the product dimension table, or
perhaps another table in the data warehouse that has been created specifically to
support this transformation. For this example, we assume that there is a table
named product, which has a product_id and an upc_code column.
This data substitution transformation can be implemented using the following
CTAS statement:
CREATE TABLE temp_sales_step2 NOLOGGING PARALLEL AS SELECT sales_transaction_id,
product.product_id sales_product_id, sales_customer_id, sales_time_id,
sales_channel_id, sales_quantity_sold, sales_dollar_amount
FROM temp_sales_step1, product
WHERE temp_sales_step1.upc_code = product.upc_code;
This CTAS statement will convert each valid UPC code to a valid product_id
value. If the ETL process has guaranteed that each UPC code is valid, then this
statement alone may be sufficient to implement the entire transformation.
Using this outer join, the sales transactions that originally contained invalidated
UPC codes will be assigned a product_id of NULL. These transactions can be
handled later.
Additional approaches to handling invalid UPC codes exist. Some data warehouses
may choose to insert null-valued product_id values into their sales table, while
other data warehouses may not allow any new data from the entire batch to be
inserted into the sales table until all invalid UPC codes have been addressed. The
correct approach is determined by the business requirements of the data warehouse.
Regardless of the specific requirements, exception handling can be addressed by the
same basic SQL techniques as transformations.
Pivoting Scenarios
A data warehouse can receive data from many different sources. Some of these
source systems may not be relational databases and may store data in very different
formats from the data warehouse. For example, suppose that you receive a set of
sales records from a nonrelational database having the form:
product_id, customer_id, weekly_start_date, sales_sun, sales_mon, sales_tue,
sales_wed, sales_thu, sales_fri, sales_sat
PRODUCT_ID CUSTOMER_ID WEEKLY_ST SALES_SUN SALES_MON SALES_TUE SALES_WED SALES_THU SALES_FRI SALES_SAT
---------- ----------- --------- ---------- ---------- ---------- -------------------- ---------- ----------
111 222 01-OCT-00 100 200 300 400 500 600 700
222 333 08-OCT-00 200 300 400 500 600 700 800
333 444 15-OCT-00 300 400 500 600 700 800 900
In your data warehouse, you would want to store the records in a more typical
relational form in a fact table sales of the sh sample schema:
prod_id, cust_id, time_id, amount_sold
Thus, you need to build a transformation such that each record in the input stream
must be converted into seven records for the data warehouse's sales table. This
operation is commonly referred to as pivoting, and Oracle Database offers several
ways to do this.
The result of the previous example will resemble the following:
This statement only scans the source table once and then inserts the appropriate
data for each day.
This chapter discusses how to load and refresh a data warehouse, and discusses:
■ Using Partitioning to Improve Data Warehouse Refresh
■ Optimizing DML Operations During Refresh
■ Refreshing Materialized Views
■ Using Materialized Views with Partitioned Tables
Apply all constraints to the sales_01_2001 table that are present on the
sales table. This includes referential integrity constraints. A typical constraint
would be:
ALTER TABLE sales_01_2001 ADD CONSTRAINT sales_customer_id
REFERENCES customer(customer_id) ENABLE NOVALIDATE;
If the partitioned table sales has a primary or unique key that is enforced with
a global index structure, ensure that the constraint on sales_pk_jan01 is
validated without the creation of an index structure, as in the following:
ALTER TABLE sales_01_2001 ADD CONSTRAINT sales_pk_jan01
PRIMARY KEY (sales_transaction_id) DISABLE VALIDATE;
The creation of the constraint with ENABLE clause would cause the creation of a
unique index, which does not match a local index structure of the partitioned
table. You must not have any index structure built on the nonpartitioned table
to be exchanged for existing global indexes of the partitioned table. The
exchange command would fail.
3. Add the sales_01_2001 table to the sales table.
In order to add this new data to the sales table, we need to do two things.
First, we need to add a new partition to the sales table. We will use the ALTER
TABLE ... ADD PARTITION statement. This will add an empty partition to the
sales table:
ALTER TABLE sales ADD PARTITION sales_01_2001
VALUES LESS THAN (TO_DATE('01-FEB-2001', 'DD-MON-YYYY'));
Then, we can add our newly created table to this partition using the EXCHANGE
PARTITION operation. This will exchange the new, empty partition with the
newly loaded table.
ALTER TABLE sales EXCHANGE PARTITION sales_01_2001 WITH TABLE sales_01_2001
INCLUDING INDEXES WITHOUT VALIDATION UPDATE GLOBAL INDEXES;
The EXCHANGE operation will preserve the indexes and constraints that were
already present on the sales_01_2001 table. For unique constraints (such as
the unique constraint on sales_transaction_id), you can use the UPDATE
GLOBAL INDEXES clause, as shown previously. This will automatically
maintain your global index structures as part of the partition maintenance
operation and keep them accessible throughout the whole process. If there were
only foreign-key constraints, the exchange operation would be instantaneous.
The benefits of this partitioning technique are significant. First, the new data is
loaded with minimal resource utilization. The new data is loaded into an entirely
separate table, and the index processing and constraint processing are applied only
to the new partition. If the sales table was 50 GB and had 12 partitions, then a new
month's worth of data contains approximately 4 GB. Only the new month's worth of
data needs to be indexed. None of the indexes on the remaining 46 GB of data needs
to be modified at all. This partitioning scheme additionally ensures that the load
processing time is directly proportional to the amount of new data being loaded,
not to the total size of the sales table.
Second, the new data is loaded with minimal impact on concurrent queries. All of
the operations associated with data loading are occurring on a separate sales_01_
2001 table. Therefore, none of the existing data or indexes of the sales table is
affected during this data refresh process. The sales table and its indexes remain
entirely untouched throughout this refresh process.
Third, in case of the existence of any global indexes, those are incrementally
maintained as part of the exchange command. This maintenance does not affect the
availability of the existing global index structures.
The exchange operation can be viewed as a publishing mechanism. Until the data
warehouse administrator exchanges the sales_01_2001 table into the sales
table, end users cannot see the new data. Once the exchange has occurred, then any
end user query accessing the sales table will immediately be able to see the
sales_01_2001 data.
Partitioning is useful not only for adding new data but also for removing and
archiving data. Many data warehouses maintain a rolling window of data. For
example, the data warehouse stores the most recent 36 months of sales data. Just
as a new partition can be added to the sales table (as described earlier), an old
partition can be quickly (and independently) removed from the sales table. These
two benefits (reduced resources utilization and minimal end-user impact) are just as
pertinent to removing a partition as they are to adding a partition.
Removing data from a partitioned table does not necessarily mean that the old data
is physically deleted from the database. There are two alternatives for removing old
data from a partitioned table. First, you can physically delete all data from the
database by dropping the partition containing the old data, thus freeing the
allocated space:
ALTER TABLE sales DROP PARTITION sales_01_1998;
Also, you can exchange the old partition with an empty table of the same structure;
this empty table is created equivalent to step1 and 2 described in the load process.
Note that the old data is still existent as the exchanged, nonpartitioned table
sales_archive_01_1998.
If the partitioned table was setup in a way that every partition is stored in a
separate tablespace, you can archive (or transport) this table using Oracle
Database's transportable tablespace framework before dropping the actual data (the
tablespace). See "Transportation Using Transportable Tablespaces" on page 15-5 for
further details regarding transportable tablespaces.
In some situations, you might not want to drop the old data immediately, but keep
it as part of the partitioned table; although the data is no longer of main interest,
there are still potential queries accessing this old, read-only data. You can use
Oracle's data compression to minimize the space usage of the old data. We also
assume that at least one compressed partition is already part of the partitioned
table. See Chapter 3, "Physical Design in Data Warehouses" for a generic discussion
of table compression and Chapter 5, "Parallelism and Partitioning in Data
Warehouses" for partitioning and table compression.
Refresh Scenarios
A typical scenario might not only need to compress old data, but also to merge
several old partitions to reflect the granularity for a later backup of several merged
partitions. Let's assume that a backup (partition) granularity is on a quarterly base
for any quarter, where the oldest month is more than 36 months behind the most
recent month. In this case, we are therefore compressing and merging sales_01_
1998, sales_02_1998, and sales_03_1998 into a new, compressed partition
sales_q1_1998.
1. Create the new merged partition in parallel another tablespace. The partition
will be compressed as part of the MERGE operation:
ALTER TABLE sales MERGE PARTITION sales_01_1998, sales_02_1998, sales_03_
1998 INTO PARTITION sales_q1_1998 TABLESPACE archive_q1_1998
COMPRESS UPDATE GLOBAL INDEXES PARALLEL 4;
2. The partition MERGE operation invalidates the local indexes for the new merged
partition. We therefore have to rebuild them:
ALTER TABLE sales MODIFY PARTITION sales_1_1998
Alternatively, you can choose to create the new compressed table outside the
partitioned table and exchange it back. The performance and the temporary space
consumption is identical for both methods:
1. Create an intermediate table to hold the new merged information. The
following statement inherits all NOT NULL constraints from the origin table by
default:
CREATE TABLE sales_q1_1998_out TABLESPACE archive_q1_1998 NOLOGGING COMPRESS
PARALLEL 4 AS SELECT * FROM sales
WHERE time_id >= TO_DATE('01-JAN-1998','dd-mon-yyyy')
AND time_id < TO_DATE('01-JUN-1998','dd-mon-yyyy');
2. Create the equivalent index structure for table sales_q1_1998_out than for
the existing table sales.
3. Prepare the existing table sales for the exchange with the new compressed table
sales_q1_1998_out. Because the table to be exchanged contains data
actually covered in three partition, we have to 'create one matching partition,
having the range boundaries we are looking for. You simply have to drop two
of the existing partitions. Note that you have to drop the lower two partitions
sales_01_1998 and sales_02_1998; the lower boundary of a range
partition is always defined by the upper (exclusive) boundary of the previous
partition:
ALTER TABLE sales DROP PARTITION sales_01_1998;
ALTER TABLE sales DROP PARTITION sales_02_1998;
Both methods apply to slightly different business scenarios: Using the MERGE
PARTITION approach invalidates the local index structures for the affected
partition, but it keeps all data accessible all the time. Any attempt to access the
affected partition through one of the unusable index structures raises an error. The
limited availability time is approximately the time for re-creating the local bitmap
index structures. In most cases this can be neglected, since this part of the
partitioned table shouldn't be touched too often.
Refresh Scenario 1
Data is loaded daily. However, the data warehouse contains two years of data, so
that partitioning by day might not be desired.
The solution is to partition by week or month (as appropriate). Use INSERT to add
the new data to an existing partition. The INSERT operation only affects a single
partition, so the benefits described previously remain intact. The INSERT operation
could occur while the partition remains a part of the table. Inserts into a single
partition can be parallelized:
INSERT /*+ APPEND*/ INTO sales PARTITION (sales_01_2001)
SELECT * FROM new_sales;
Refresh Scenario 2
New data feeds, although consisting primarily of data for the most recent day,
week, and month, also contain some data from previous time periods.
Solution 1 Use parallel SQL operations (such as CREATE TABLE ... AS SELECT) to
separate the new data from the data in previous time periods. Process the old data
separately using other techniques.
New data feeds are not solely time based. You can also feed new data into a data
warehouse with data from multiple operational systems on a business need basis.
For example, the sales data from direct channels may come into the data warehouse
separately from the data from indirect channels. For business reasons, it may
furthermore make sense to keep the direct and indirect data in separate partitions.
As a typical scenario, suppose that there is a table called new_sales that contains
both inserts and updates that will be applied to the sales table. When designing
the entire data warehouse load process, it was determined that the new_sales
table would contain records with the following semantics:
■ If a given sales_transaction_id of a record in new_sales already exists
in sales, then update the sales table by adding the sales_dollar_amount
and sales_quantity_sold values from the new_sales table to the existing
row in the sales table.
■ Otherwise, insert the entire new record from the new_sales table into the
sales table.
This UPDATE-ELSE-INSERT operation is often called a merge. A merge can be
executed using one SQL statement.
In addition to using the MERGE statement for unconditional UPDATE ELSE INSERT
functionality into a target table, you can also use it to:
■ Perform an UPDATE only or INSERT only statement.
■ Apply additional WHERE conditions for the UPDATE or INSERT portion of the
MERGE statement.
■ The UPDATE operation can even delete rows if a specific condition yields true.
When the INSERT clause is omitted, Oracle performs a regular join of the source
and the target tables. When the UPDATE clause is omitted, Oracle performs an
antijoin of the source and the target tables. This makes the join between the source
and target table more efficient.
This shows how the UPDATE operation would be skipped if the condition P.PROD_
STATUS <> "OBSOLETE" is not true. The condition predicate can refer to both the
target and the source table.
This example shows that the INSERT operation would be skipped if the condition
S.PROD_STATUS <> "OBSOLETE" is not true, and INSERT will only occur if the
condition is true. The condition predicate can refer to the source table only. This
predicate would be most likely a column filter.
Thus when a row is updated in products, Oracle checks the delete condition
D.PROD_STATUS = "OBSOLETE", and deletes the row if the condition yields true.
The DELETE operation is not as same as that of a complete DELETE statement. Only
the rows from the destination of the MERGE can be deleted. The only rows that will
be affected by the DELETE are the ones that are updated by this MERGE statement.
Thus, although a give row of the destination table meets the delete condition, if it
does not join under the ON clause condition, it will not be deleted.
join conditions that always result to FALSE, for example, 1=0, such MERGE
statements will be optimized and the join condition will be suppressed.
MERGE USING New_Product S -- Source/Delta table
INTO Products P -- Destination table 1
ON (1 = 0) -- Search/Join condition
WHEN NOT MATCHED THEN -- insert if no join
INSERT (PROD_ID, PROD_STATUS) VALUES (S.PROD_ID, S.PROD_NEW_STATUS)
Purging Data
Occasionally, it is necessary to remove large amounts of data from a data
warehouse. A very common scenario is the rolling window discussed previously, in
which older data is rolled out of the data warehouse to make room for new data.
However, sometimes other data might need to be removed from a data warehouse.
Suppose that a retail company has previously sold products from MS Software,
and that MS Software has subsequently gone out of business. The business users
of the warehouse may decide that they are no longer interested in seeing any data
related to MS Software, so this data should be deleted.
One approach to removing a large volume of data is to use parallel delete as shown
in the following statement:
DELETE FROM sales WHERE sales_product_id IN (SELECT product_id
FROM product WHERE product_category = 'MS Software');
This SQL statement will spawn one parallel process for each partition. This
approach will be much more efficient than a serial DELETE statement, and none of
the data in the sales table will need to be moved. However, this approach also has
some disadvantages. When removing a large percentage of rows, the DELETE
statement will leave many empty row-slots in the existing partitions. If new data is
being loaded using a rolling window technique (or is being loaded using
direct-path INSERT or load), then this storage space will not be reclaimed.
Moreover, even though the DELETE statement is parallelized, there might be more
efficient methods. An alternative method is to re-create the entire sales table,
keeping the data for all product categories except MS Software.
CREATE TABLE sales2 AS SELECT * FROM sales, product
WHERE sales.sales_product_id = product.product_id
AND product_category <> 'MS Software'
NOLOGGING PARALLEL (DEGREE 8)
#PARTITION ... ; #create indexes, constraints, and so on
DROP TABLE SALES;
RENAME SALES2 TO SALES;
This approach may be more efficient than a parallel delete. However, it is also costly
in terms of the amount of disk space, because the sales table must effectively be
instantiated twice.
An alternative method to utilize less space is to re-create the sales table one
partition at a time:
CREATE TABLE sales_temp AS SELECT * FROM sales WHERE 1=0;
INSERT INTO sales_temp PARTITION (sales_99jan)
SELECT * FROM sales, product
WHERE sales.sales_product_id = product.product_id
AND product_category <> 'MS Software';
<create appropriate indexes and constraints on sales_temp>
ALTER TABLE sales EXCHANGE PARTITION sales_99jan WITH TABLE sales_temp;
Performing a refresh operation requires temporary space to rebuild the indexes and
can require additional space for performing the refresh operation itself. Some sites
might prefer not to refresh all of their materialized views at the same time: as soon
as some underlying detail data has been updated, all materialized views using this
data will become stale. Therefore, if you defer refreshing your materialized views,
you can either rely on your chosen rewrite integrity level to determine whether or
not a stale materialized view can be used for query rewrite, or you can temporarily
disable query rewrite with an ALTER SYSTEM SET QUERY_REWRITE_ENABLED =
FALSE statement. After refreshing the materialized views, you can re-enable query
rewrite as the default for all sessions in the current database instance by specifying
ALTER SYSTEM SET QUERY_REWRITE_ENABLED as TRUE. Refreshing a
materialized view automatically updates all of its indexes. In the case of full refresh,
this requires temporary sort space to rebuild all indexes during refresh. This is
because the full refresh truncates or deletes the table before inserting the new full
data volume. If insufficient temporary space is available to rebuild the indexes, then
you must explicitly drop each index or mark it UNUSABLE prior to performing the
refresh operation.
If you anticipate performing insert, update or delete operations on tables referenced
by a materialized view concurrently with the refresh of that materialized view, and
that materialized view includes joins and aggregation, Oracle recommends you use
ON COMMIT fast refresh rather than ON DEMAND fast refresh.
Complete Refresh
A complete refresh occurs when the materialized view is initially defined as BUILD
IMMEDIATE, unless the materialized view references a prebuilt table. For
materialized views using BUILD DEFERRED, a complete refresh must be requested
before it can be used for the first time. A complete refresh may be requested at any
time during the life of any materialized view. The refresh involves reading the detail
tables to compute the results for the materialized view. This can be a very
time-consuming process, especially if there are huge amounts of data to be read and
processed. Therefore, you should always consider the time required to process a
complete refresh before requesting it.
There are, however, cases when the only refresh method available for an already
built materialized view is complete refresh because the materialized view does not
satisfy the conditions specified in the following section for a fast refresh.
Fast Refresh
Most data warehouses have periodic incremental updates to their detail data. As
described in "Materialized View Schema Design" on page 8-8, you can use the
SQL*Loader or any bulk load utility to perform incremental loads of detail data.
Fast refresh of your materialized views is usually efficient, because instead of
having to recompute the entire materialized view, the changes are applied to the
existing data. Thus, processing only the changes can result in a very fast refresh
time.
ON COMMIT Refresh
A materialized view can be refreshed automatically using the ON COMMIT method.
Therefore, whenever a transaction commits which has updated the tables on which
a materialized view is defined, those changes will be automatically reflected in the
materialized view. The advantage of using this approach is you never have to
remember to refresh the materialized view. The only disadvantage is the time
required to complete the commit will be slightly longer because of the extra
processing involved. However, in a data warehouse, this should not be an issue
because there is unlikely to be concurrent processes trying to update the same table.
Three refresh procedures are available in the DBMS_MVIEW package for performing
ON DEMAND refresh. Each has its own unique set of parameters.
Multiple materialized views can be refreshed at the same time, and they do not all
have to use the same refresh method. To give them different refresh methods,
specify multiple method codes in the same order as the list of materialized views
(without commas). For example, the following specifies that cal_month_sales_
mv be completely refreshed and fweek_pscat_sales_mv receive a fast refresh:
DBMS_MVIEW.REFRESH('CAL_MONTH_SALES_MV, FWEEK_PSCAT_SALES_MV', 'CF', '',
TRUE, FALSE, 0,0,0, FALSE);
If the refresh method is not specified, the default refresh method as specified in the
materialized view definition will be used.
If set to TRUE, then all refreshes are done in one transaction. If set to FALSE,
then the refresh of each specified materialized view is done in a separate
transaction. If set to FALSE, Oracle can optimize refresh by using parallel DML
and truncate DDL on a materialized views.
An example of refreshing all materialized views is the following:
DBMS_MVIEW.REFRESH_ALL_MVIEWS(failures,'C','', TRUE, FALSE);
To perform a full refresh on all materialized views that reference the customers
table, specify:
DBMS_MVIEW.REFRESH_DEPENDENT(failures, 'CUSTOMERS', 'C', '', FALSE, FALSE );
To obtain the list of materialized views that are directly dependent on a given object
(table or materialized view), use the procedure DBMS_MVIEW.GET_MV_
DEPENDENCIES to determine the dependent materialized views for a given table,
or for deciding the order to refresh nested materialized views.
DBMS_MVIEW.GET_MV_DEPENDENCIES(mvlist IN VARCHAR2, deplist OUT VARCHAR2)
The input to this function is the name or names of the materialized view. The output
is a comma separated list of the materialized views that are defined on it. For
example, the following statement:
DBMS_MVIEW.GET_MV_DEPENDENCIES("JOHN.SALES_REG, SCOTT.PROD_TIME", deplist)
This populates deplist with the list of materialized views defined on the input
arguments. For example:
deplist <= "JOHN.SUM_SALES_WEST, JOHN.SUM_SALES_EAST, SCOTT.SUM_PROD_MONTH".
Monitoring a Refresh
While a job is running, you can query the V$SESSION_LONGOPS view to tell you
the progress of each materialized view being refreshed.
SELECT * FROM V$SESSION_LONGOPS;
Scheduling Refresh
Very often you will have multiple materialized views in the database. Some of these
can be computed by rewriting against others. This is very common in data
warehousing environment where you may have nested materialized views or
materialized views at different levels of some hierarchy.
In such cases, you should create the materialized views as BUILD DEFERRED, and
then issue one of the refresh procedures in DBMS_MVIEW package to refresh all the
materialized views. Oracle Database will compute the dependencies and refresh the
materialized views in the right order. Consider the example of a complete
hierarchical cube described in "Examples of Hierarchical Cube Materialized Views"
on page 20-32. Suppose all the materialized views have been created as BUILD
DEFERRED. Creating the materialized views as BUILD DEFERRED will only create
the metadata for all the materialized views. And, then, you can just call one of the
refresh procedures in DBMS_MVIEW package to refresh all the materialized views in
the right order:
EXECUTE DBMS_MVIEW.REFRESH_DEPENDENT(list=>'SALES', method => 'C');
The procedure will refresh the materialized views in the order of their dependencies
(first sales_hierarchical_mon_cube_mv, followed by sales_
hierarchical_qtr_cube_mv, then, sales_hierarchical_yr_cube_mv and
You can refresh your materialized views fast after partition maintenance
operations on the detail tables. "Partition Change Tracking" on page 9-2 for
details on enabling PCT for materialized views.
■ Partitioning the materialized view will also help refresh performance as refresh
can update the materialized view using parallel DML. For example, assume
that the detail tables and materialized view are partitioned and have a parallel
clause. The following sequence would enable Oracle to parallelize the refresh of
the materialized view.
1. Bulk load into the detail table.
2. Enable parallel DML with an ALTER SESSION ENABLE PARALLEL DML
statement.
3. Refresh the materialized view.
■ For refresh using DBMS_MVIEW.REFRESH, set the parameter atomic_refresh
to FALSE.
■ For COMPLETE refresh, this will TRUNCATE to delete existing rows in the
materialized view, which is faster than a delete.
■ For PCT refresh, if the materialized view is partitioned appropriately, this
will use TRUNCATE PARTITION to delete rows in the affected partitions of
the materialized view, which is faster than a delete.
■ For FAST or FORCE refresh, if COMPLETE or PCT refresh is chosen, this will
be able to use the TRUNCATE optimizations described earlier.
■ When using DBMS_MVIEW.REFRESH with JOB_QUEUES, remember to set
atomic to FALSE. Otherwise, JOB_QUEUES will not get used. Set the number
of job queue processes greater than the number of processors.
If job queues are enabled and there are many materialized views to refresh, it is
faster to refresh all of them in a single command than to call them individually.
■ Use REFRESH FORCE to ensure refreshing a materialized view so that it can
definitely be used for query rewrite. The best refresh method will be chosen. If a
fast refresh cannot be done, a complete refresh will be performed.
■ Refresh all the materialized views in a single procedure call. This gives Oracle
an opportunity to schedule refresh of all the materialized views in the right
order taking into account dependencies imposed by nested materialized views
and potential for efficient refresh by using query rewrite against other
materialized views.
3. Commit
If many updates are needed, try to group them all into one transaction because
refresh will be performed just once at commit time, rather than after each update.
When you use the DBMS_MVIEW package to refresh a number of materialized views
containing only joins with the ATOMIC parameter set to TRUE, if you disable parallel
DML, refresh performance may degrade.
In a data warehousing environment, assuming that the materialized view has a
parallel clause, the following sequence of steps is recommended:
1. Bulk load into the fact table
2. Enable parallel DML
3. An ALTER SESSION ENABLE PARALLEL DML statement
4. Refresh the materialized view
These procedures have the following behavior when used with nested materialized
views:
■ If REFRESH is applied to a materialized view my_mv that is built on other
materialized views, then my_mv will be refreshed with respect to the current
contents of the other materialized views (that is, the other materialized views
will not be made fresh first) unless you specify nested => TRUE.
■ If REFRESH_DEPENDENT is applied to materialized view my_mv, then only
materialized views that directly depend on my_mv will be refreshed (that is, a
materialized view that depends on a materialized view that depends on my_mv
will not be refreshed) unless you specify nested => TRUE.
■ If REFRESH_ALL_MVIEWS is used, the order in which the materialized views
will be refreshed is guaranteed to respect the dependencies between nested
materialized views.
■ GET_MV_DEPENDENCIES provides a list of the immediate (or direct)
materialized view dependencies for an object.
The form of a maintenance marker column, column MARKER in the example, must
be numeric_or_string_literal AS column_alias, where each UNION ALL
member has a distinct value for numeric_or_string_literal.
As can be seen from the partial sample output from EXPLAIN_MVIEW, any partition
maintenance operation performed on the sales table will allow PCT fast refresh.
However, PCT is not possible after partition maintenance operations or updates to
the products table as there is insufficient information contained in cust_mth_
sales_mv for PCT refresh to be possible. Note that the times table is not
partitioned and hence can never allow for PCT refresh. Oracle will apply PCT
refresh if it can determine that the materialized view has sufficient information to
support PCT for all the updated tables.
1. Suppose at some later point, a SPLIT operation of one partition in the sales
table becomes necessary.
ALTER TABLE SALES
SPLIT PARTITION month3 AT (TO_DATE('05-02-1998', 'DD-MM-YYYY'))
INTO (PARTITION month3_1 TABLESPACE summ,
PARTITION month3 TABLESPACE summ);
Fast refresh will automatically do a PCT refresh as it is the only fast refresh possible
in this scenario. However, fast refresh will not occur if a partition maintenance
operation occurs when any update has taken place to a table on which PCT is not
enabled. This is shown in "PCT Fast Refresh Scenario 2".
"PCT Fast Refresh Scenario 1" would also be appropriate if the materialized view
was created using the PMARKER clause as illustrated in the following:
CREATE MATERIALIZED VIEW cust_sales_marker_mv
BUILD IMMEDIATE
REFRESH FAST ON DEMAND
ENABLE QUERY REWRITE AS
SELECT DBMS_MVIEW.PMARKER(s.rowid) s_marker, SUM(s.quantity_sold),
SUM(s.amount_sold), p.prod_name, t.calendar_month_name, COUNT(*),
COUNT(s.quantity_sold), COUNT(s.amount_sold)
FROM sales s, products p, times t
WHERE s.time_id = t.time_id AND s.prod_id = p.prod_id
GROUP BY DBMS_MVIEW.PMARKER(s.rowid),
p.prod_name, t.calendar_month_name;
6. Refresh cust_mth_sales_mv.
EXECUTE DBMS_MVIEW.REFRESH('CUST_MTH_SALES_MV', 'F',
'',TRUE,FALSE,0,0,0,FALSE);
ORA-12052: cannot fast refresh materialized view SH.CUST_MTH_SALES_MV
The materialized view is not fast refreshable because DML has occurred to a table
on which PCT fast refresh is not possible. To avoid this occurring, Oracle
recommends performing a fast refresh immediately after any partition maintenance
operation on detail tables for which partition tracking fast refresh is available.
If the situation in "PCT Fast Refresh Scenario 2" occurs, there are two possibilities;
perform a complete refresh or switch to the CONSIDER FRESH option outlined in
the following, if suitable. However, it should be noted that CONSIDER FRESH and
partition change tracking fast refresh are not compatible. Once the ALTER
MATERIALIZED VIEW cust_mth_sales_mv CONSIDER FRESH statement has
been issued, PCT refresh will not longer be applied to this materialized view, until a
complete refresh is done. Moreover, you should not use CONSIDER FRESH unless
you have taken manual action to ensure that the materialized view is indeed fresh.
A common situation in a data warehouse is the use of rolling windows of data. In
this case, the detail table and the materialized view may contain say the last 12
months of data. Every month, new data for a month is added to the table and the
oldest month is deleted (or maybe archived). PCT refresh provides a very efficient
mechanism to maintain the materialized view in this case.
3. Now, if the materialized view satisfies all conditions for PCT refresh.
EXECUTE DBMS_MVIEW.REFRESH('CUST_MTH_SALES_MV', 'F', '', TRUE,
FALSE,0,0,0,FALSE);
Fast refresh will automatically detect that PCT is available and perform a PCT
refresh.
The materialized view is now considered stale and requires a refresh because of
the partition operation. However, as the detail table no longer contains all the
data associated with the partition fast refresh cannot be attempted.
2. Therefore, alter the materialized view to tell Oracle to consider it fresh.
ALTER MATERIALIZED VIEW cust_mth_sales_mv CONSIDER FRESH;
Because the fast refresh detects that only INSERT statements occurred against
the sales table it will update the materialized view with the new data.
However, the status of the materialized view will remain UNKNOWN. The only
way to return the materialized view to FRESH status is with a complete refresh
which, also will remove the historical data from the materialized view.
Change Data Capture efficiently identifies and captures data that has been added
to, updated in, or removed from, Oracle relational tables and makes this change
data available for use by applications or individuals. Change Data Capture is
provided as a database component beginning with Oracle9i.
This chapter describes Change Data Capture in the following sections:
■ Overview of Change Data Capture
■ Change Sources and Modes of Data Capture
■ Change Sets
■ Change Tables
■ Getting Information About the Change Data Capture Environment
■ Preparing to Publish Change Data
■ Publishing Change Data
■ Subscribing to Change Data
■ Considerations for Asynchronous Change Data Capture
■ Managing Published Data
■ Implementation and System Configuration
See PL/SQL Packages and Types Reference for reference information about the Change
Data Capture publish and subscribe PL/SQL packages.
Moreover, you can obtain the deleted rows and old versions of updated rows with
the following query:
SELECT * FROM old_version
MINUS SELECT * FROM new_version;
■ There is no way to determine which changes were made as part of the same
transaction. For example, suppose a sales manager creates a special discount to
close a deal. The fact that the creation of the discount and the creation of the
sale occurred as part of the same transaction cannot be captured, unless the
source database is specifically designed to do so.
Change-value selection involves capturing the data on the source database by
selecting the new and changed data from the source tables based on the value of a
specific column. For example, suppose the source table has a LAST_UPDATE_DATE
column. To capture changes, you base your selection from the source table on the
LAST_UPDATE_DATE column value.
However, there are also several problems with this method:
■ The overhead of capturing the change data must be borne on the source
database, and you must run potentially expensive queries against the source
table on the source database. The need for these queries may force you to add
indexes that would otherwise be unneeded. There is no way to offload this
overhead to the staging database.
■ This method is no better at capturing intermediate values than the table
differencing method. If the price in the product's table fluctuates, you will not
be able to capture all the intermediate values, or even tell if the price had
changed, if the ending value is the same as it was the last time that you
captured change data.
■ This method is also no better than the table differencing method at capturing
which data changes were made together in the same transaction. If you need to
capture information concerning which changes occurred together in the same
transaction, you must include specific designs for this purpose in your source
database.
■ The granularity of the change-value column may not be fine enough to
uniquely identify the new and changed rows. For example, suppose the
following:
– You capture data changes using change-value selection on a date column
such as LAST_UPDATE_DATE.
– The capture happens at a particular instant in time, 14-FEB-2003 17:10:00.
– Additional updates occur to the table during the same second that you
performed your capture.
When you next capture data changes, you will select rows with a LAST_
UPDATE_DATE strictly after 14-FEB-2003 17:10:00, and thereby miss the changes
that occurred during the remainder of that second.
To use change-value selection, you either have to accept that anomaly, add an
artificial change-value column with the granularity you need, or lock out
changes to the source table during the capture process, thereby further
burdening the performance of the source database.
■ You have to design your source database in advance with this capture
mechanism in mind – all tables from which you wish to capture change data
must have a change-value column. If you want to build a data warehouse with
data sources from legacy systems, those legacy systems may not supply the
necessary change-value columns you need.
Change Data Capture does not depend on expensive and cumbersome table
differencing or change-value selection mechanisms. Instead, it captures the change
data resulting from INSERT, UPDATE, and DELETE operations made to user tables.
The change data is then stored in a relational table called a change table, and the
change data is made available to applications or individuals in a controlled way.
Publisher
The publisher is usually a database administrator (DBA) who creates and maintains
the schema objects that make up the Change Data Capture system. Typically, a
publisher deals with two databases:
■ Source database
This is the production database that contains the data of interest. Its associated
tables are referred to as the source tables.
■ Staging database
This is the database where the change data capture takes place. Depending on
the capture mode that the publisher uses, the staging database can be the same
as, or different from, the source database. The following Change Data Capture
objects reside on the staging database:
– Change table
A change table is a relational table that contains change data for a single
source table. To subscribers, a change table is known as a publication.
– Change set
A change set is a set of change data that is guaranteed to be transactionally
consistent. It contains one or more change tables.
– Change source
The change source is a logical representation of the source database. It
contains one or more change sets.
The publisher performs these tasks:
■ Determines the source databases and tables from which the subscribers are
interested in viewing change data, and the mode (synchronous or
asynchronous) in which to capture the change data.
■ Uses the Oracle-supplied package, DBMS_CDC_PUBLISH, to set up the system
to capture change data from the source tables of interest.
■ Allows subscribers to have controlled access to the change data in the change
tables by using the SQL GRANT and REVOKE statements to grant and revoke the
SELECT privilege on change tables for users and roles. (Keep in mind, however,
that subscribers use views, not change tables directly, to access change data.)
In Figure 16–1, the publisher determines that subscribers are interested in viewing
change data from the HQ source database. In particular, subscribers are interested in
change data from the SH.SALES and SH.PROMOTIONS source tables.
The publisher creates a change source HQ_SRC on the DW staging database, a change
set, SH_SET, and two change tables: sales_ct and promo_ct. The sales_ct
change table contains all the columns from the source table, SH.SALES. For the
promo_ct change table, however, the publisher has decided to exclude the PROMO_
COST column.
Subscribers
The subscribers are consumers of the published change data. A subscriber performs
the following tasks:
■ Uses the Oracle supplied package, DBMS_CDC_SUBSCRIBE, to:
– Create subscriptions
A subscription controls access to the change data from one or more source
tables of interest within a single change set. A subscription contains one or
more subscriber views.
A subscriber view is a view that specifies the change data from a specific
publication in a subscription. The subscriber is restricted to seeing change
data that the publisher has published and has granted the subscriber access
to use. See "Subscribing to Change Data" on page 16-42 for more
information on choosing a method for specifying a subscriber view.
– Notify Change Data Capture when ready to receive a set of change data
A subscription window defines the time range of rows in a publication that
the subscriber can currently see in subscriber views. The oldest row in the
window is called the low boundary; the newest row in the window is
called the high boundary. Each subscription has its own subscription
window that applies to all of its subscriber views.
– Notify Change Data Capture when finished with a set of change data
■ Uses SELECT statements to retrieve change data from the subscriber views.
A subscriber has the privileges of the user account under which the subscriber is
running, plus any additional privileges that have been granted to the subscriber.
In Figure 16–2, the subscriber is interested in a subset of columns that the publisher
(in Figure 16–1) has published. Note that the publications shown in Figure 16–2, are
represented as change tables in Figure 16–1; this reflects the different terminology
used by subscribers and publishers, respectively.
The subscriber creates a subscription, sales_promos_list and two subscriber
views (spl_sales and spl_promos) on the SH_SET change set on the DW
staging database. Within each subscriber view, the subscriber includes a subset of
the columns that were made available by the publisher. Note that because the
publisher did not create a change table that includes the PROMO_COST column,
there is no way for the subscriber to view change data for that column.
Staging Database: DW
Change Set: SH_SET
Publication on SH.SALES
Publication on SH.PROMOTIONS
Note: Oracle provides the previously listed benefits only when the
subscriber accesses change data through a subscriber view.
Synchronous
This mode uses triggers on the source database to capture change data. It has no
latency because the change data is captured continuously and in real time on the
source database. The change tables are populated when DML operations on the
source table are committed.
While the synchronous mode of Change Data Capture adds overhead to the source
database at capture time, this mode can reduce costs (as compared to attempting to
extract change data using table differencing or change-value section) by simplifying
the extraction of change data.
There is a single, predefined synchronous change source, SYNC_SOURCE, that
represents the source database. This is the only synchronous change source. It
cannot be altered or dropped.
Change tables for this mode of Change Data Capture must reside locally in the
source database.
Figure 16–3 illustrates the synchronous configuration. Triggers executed after DML
operations occur on the source tables populate the change tables in the change sets
within the SYNC_SOURCE change source.
Source Database
Source
Database SYNC_SOURCE
Transactions Change Source
Change Set
Subscriber
Views
Asynchronous
This mode captures change data after the changes have been committed to the
source database by using the database redo log files.
The asynchronous mode of Change Data Capture is dependent on the level of
supplemental logging enabled at the source database. Supplemental logging adds
redo logging overhead at the source database, so it must be carefully balanced with
the needs of the applications or individuals using Change Data Capture. See
"Asynchronous Change Data Capture and Supplemental Logging" on page 16-50
for information on supplemental logging.
There are two methods of capturing change data asynchronously, HotLog and
AutoLog, as described in the following sections:
HotLog
Change data is captured from the online redo log file on the source database. There
is a brief latency between the act of committing source table transactions and the
arrival of change data.
There is a single, predefined HotLog change source, HOTLOG_SOURCE, that
represents the current redo log files of the source database. This is the only HotLog
change source. It cannot be altered or dropped.
Change tables for this mode of Change Data Capture must reside locally in the
source database.
Figure 16–4, illustrates the asynchronous HotLog configuration. The Logwriter
Process (LGWR) records committed transactions in the online redo log files on the
source database. Change Data Capture uses Oracle Streams processes to
automatically populate the change tables in the change sets within the HOTLOG_
SOURCE change source as newly committed transactions arrive.
Source Database
Source HOTLOG_SOURCE
Database Change Source
Transactions
Change Set
e
ptur
l Ca
oca
m sL
rea
St
AutoLog
Change data is captured from a set of redo log files managed by log transport
services. Log transport services control the automated transfer of redo log files from
the source database to the staging database. Using database initialization
parameters (described in "Initialization Parameters for Asynchronous AutoLog
Publishing" on page 16-22), the publisher configures log transport services to copy
the redo log files from the source database system to the staging database system
and to automatically register the redo log files. Change sets are populated
automatically as new redo log files arrive. The degree of latency depends on
frequency of redo log switches on the source database.
There is no predefined AutoLog change source. The publisher provides information
about the source database to create an AutoLog change source. See "Performing
Asynchronous AutoLog Publishing" on page 16-35 for details.
Change sets for this mode of Change Data Capture can be remote from or local to
the source database. Typically, they are remote.
Figure 16–5 shows a typical Change Data Capture asynchronous AutoLog
configuration in which, when the log switches on the source database, archiver
processes archive the redo log file on the source database to the destination
specified by the LOG_ARCHIVE_DEST_1 parameter and copy the redo log file to the
staging database as specified by the LOG_ARCHIVE_DEST_2 parameter. (Although
the image presents these parameters as LOG_ARCHIVE_DEST_1 and LOG_
ARCHIVE_DEST_2, the integer value in these parameter strings can be any value
between 1 and 10.)
Note that the archiver processes use Oracle Net to send redo data over the network
to the remote file server (RFS) process. Transmitting redo log files to a remote
destination requires uninterrupted connectivity through Oracle Net.
On the staging database, the RFS process writes the redo data to the copied log files
in the location specified by the value of the TEMPLATE attribute in the LOG_
ARCHIVE_DEST_2 parameter (specified in the source database initialization
parameter file). Then, Change Data Capture uses Oracle Streams downstream
capture to populate the change tables in the change sets within the AutoLog change
source.
Change Set
LGWR RFS
Source Tables Change Tables
Online Redo
VE_
Log Files
CHI
Subscriber
Views
_AR
LOG
Oracle Net
ARCn
LOG_ARCHIVE_DEST_1
Change Sets
A change set is a logical grouping of change data that is guaranteed to be
transactionally consistent and that can be managed as a unit. A change set is a
member of one (and only one) change source. A change source can contain one or
more change sets. Conceptually, a change set shares the same mode as its change
source. For example, an AutoLog change set is a change set contained in an
AutoLog change source.
When a publisher includes two or more change tables in the same change set,
subscribers can perform join operations across the tables represented within the
change set and be assured of transactional consistency.
Change Tables
A given change table contains the change data resulting from DML operations
performed on a given source table. A change table consists of two things: the
change data itself, which is stored in a database table; and the system metadata
necessary to maintain the change table, which includes control columns.
The publisher specifies the source columns that are to be included in the change
table. Typically, for a change table to contain useful data, the publisher needs to
include the primary key column in the change table along with any other columns
of interest to subscribers. For example, suppose subscribers are interested in
changes that occur to the UNIT_COST and the UNIT_PRICE columns in the
SH.COSTS table. If the publisher does not include the PROD_ID column in the
change table, subscribers will know only that the unit cost and unit price of some
products have changed, but will be unable to determine for which products these
changes have occurred.
There are optional and required control columns. The required control columns are
always included in a change table; the optional ones are included if specified by the
publisher when creating the change table. Control columns are managed by Change
Data Capture. See "Understanding Change Table Control Columns" on page 16-60
and "Understanding TARGET_COLMAP$ and SOURCE_COLMAP$ Values" on
page 16-62 for detailed information on control columns.
well as information from objects in other schemas, if the current user has access
to those objects by way of grants of privileges or roles.
■ A view with the USER prefix allows the user to display all the information from
the schema of the user issuing the query without the use of additional special
privileges or roles.
Note: To look at all the views (those intended for both the
publisher and the subscriber), a user must have the SELECT_
CATALOG_ROLE privilege.
Table 16–2 Views Intended for Use by Change Data Capture Publishers
View Name Description
CHANGE_SOURCES Describes existing change sources.
CHANGE_SETS Describes existing change sets.
CHANGE_TABLES Describes existing change tables.
DBA_SOURCE_TABLES Describes all existing source tables in the database.
DBA_PUBLISHED_ Describes all published columns of source tables in the database.
COLUMNS
DBA_SUBSCRIPTIONS Describes all subscriptions.
DBA_SUBSCRIBED_ Describes all source tables to which any subscriber has
TABLES subscribed.
DBA_SUBSCRIBED_ Describes the columns of source tables to which any subscriber
COLUMNS has subscribed.
Table 16–3 Views Intended for Use by Change Data Capture Subscribers
View Name Description
ALL_SOURCE_TABLES Describes all existing source tables accessible to the current user.
USER_SOURCE_TABLES Describes all existing source tables owned by the current user.
ALL_PUBLISHED_ Describes all published columns of source tables accessible to
COLUMNS the current user.
USER_PUBLISHED_ Describes all published columns of source tables owned by the
COLUMNS current user.
ALL_SUBSCRIPTIONS Describes all subscriptions accessible to the current user.
Table 16–3 (Cont.) Views Intended for Use by Change Data Capture Subscribers
View Name Description
USER_SUBSCRIPTIONS Describes all the subscriptions owned by the current user.
ALL_SUBSCRIBED_ Describes the source tables to which any subscription accessible
TABLES to the current user has subscribed.
USER_SUBSCRIBED_ Describes the source tables to which the current user has
TABLES subscribed.
ALL_SUBSCRIBED_ Describes the columns of source tables to which any
COLUMNS subscription accessible to the current user has subscribed.
USER_SUBSCRIBED_ Describes the columns of source tables to which the current user
COLUMNS has subscribed.
See Oracle Database Reference for complete information about these views.
This example creates a password file with 10 entries, where the password for SYS is
mypassword. For redo log file transmission to succeed, the password for the SYS
user account must be identical for the source and staging databases.
Table 16–5 Source Database Initialization Parameters for Asynchronous HotLog Publishing
Parameter Recommended Value
COMPATIBLE 10.1.0
JAVA_POOL_SIZE 50000000
JOB_QUEUE_PROCESSES 2
PARALLEL_MAX_SERVERS (current value) + (5 * (the number of change sets planned))
Table 16–5 (Cont.) Source Database Initialization Parameters for Asynchronous HotLog Publishing
Parameter Recommended Value
SESSIONS (current value) + (2 * (the number of change sets planned))
Table 16–6 Source Database Initialization Parameters for Asynchronous AutoLog Publishing
Parameter Recommended Value
COMPATIBLE 10.1.0
JAVA_POOL_SIZE 50000000
LOG_ARCHIVE_DEST_11 The directory specification on the system hosting the source database where the
archived redo log files are to be kept.
Table 16–6 (Cont.) Source Database Initialization Parameters for Asynchronous AutoLog Publishing
Parameter Recommended Value
LOG_ARCHIVE_DEST_21 This parameter must include the SERVICE, ARCH or LGWR ASYNC, OPTIONAL,
NOREGISTER, and REOPEN attributes so that log transport services are configured
to copy the redo log files from the source database to the staging database. These
attributes are set as follows:
■ SERVICE specifies the network name of the staging database.
■ ARCH or LGWR ASYNC
ARCH specifies that the archiver process (ARCn) copy the redo log files to the
staging database after a source database log switch occurs.
LGWR ASYNC specifies that the log writer process (LGWR) copy redo data to
the staging database as the redo is generated on the source database. Note
that, the copied redo data becomes available to Change Data Capture only
after a source database log switch occurs.
■ OPTIONAL specifies that the copying of a redo log file to the staging database
need not succeed before the corresponding online redo log at the source
database can be overwritten. This is needed to avoid stalling operations on the
source database due to a transmission failure to the staging database. The
original redo log file remains available to the source database in either
archived or backed up form, if it is needed.
■ NOREGISTER specifies that the staging database location is not recorded in the
staging database control file.
■ REOPEN specifies the minimum number of seconds the archiver process
(ARCn) should wait before trying to access the staging database if a previous
attempt to access this location failed.
■ TEMPLATE defines a directory specification and a format template for the file
name used for the redo log files that are copied to the staging database.2
LOG_ARCHIVE_DEST_ ENABLE
STATE_11 Indicates that log transport services can transmit archived redo log files to this
destination.
LOG_ARCHIVE_DEST_ ENABLE
STATE_21 Indicates that log transport services can transmit redo log files to this destination.
LOG_ARCHIVE_FORMAT "arch1_%s_%t_%r.dbf"
Specifies a format template for the default file name when archiving redo log files.2
The string value (arch1) and the file name extension (.dbf) do not have to be
exactly as specified here.
REMOTE_LOGIN_ SHARED
PASSWORDFILE
1
The integer value in this parameter can be any value between 1 and 10. In this manual, the values 1 and 2 are used. For each
LOG_ARCHIVE_DEST_n parameter, there must be a corresponding LOG_ARCHIVE_DEST_STATE_n parameter that specifies
the same value for n.
2
In the format template, %t corresponds to the thread number, %s corresponds to the sequence number, and %r corresponds
to the resetlogs ID. Together, these ensure that unique names are constructed for the copied redo log files.
Table 16–7 Staging Database Initialization Parameters for Asynchronous AutoLog Publishing
Parameter Recommended Value
COMPATIBLE 10.1.0
GLOBAL_NAMES TRUE
JAVA_POOL_SIZE 50000000
JOB_QUEUE_PROCESSES 2
PARALLEL_MAX_SERVERS (current value) + (5 * (the number of change sets planned))
PROCESSES (current value) + (7 * (the number of change sets planned))
REMOTE_LOGIN_ SHARED
PASSWORDFILE
capture process and a Streams apply process, with an accompanying queue and
queue table. Each Streams configuration uses additional processes, parallel
execution servers, and memory.
Oracle Streams capture and apply processes each have a parallelism parameter that
is used to improve performance. When a publisher first creates a change set, its
capture parallelism value and apply parallelism value are each 1. If desired, a
publisher can increase one or both of these values using Streams interfaces.
If Oracle Streams capture parallelism and apply parallelism values are increased
after the change sets are created, the staging database DBA must adjust
initialization parameter values accordingly. Example 16–1 and Example 16–2
demonstrate how to obtain the capture parallelism and apply parallelism values for
change set CHICAGO_DAILY. By default, each parallelism value is 1, so the amount
by which a given parallelism value has been increased is the returned value minus
1.
The staging database DBA must adjust the staging database initialization
parameters as described in the following list to accommodate the parallel execution
servers and other processes and memory required for asynchronous Change Data
Capture:
■ PARALLEL_MAX_SERVERS
For each change set for which Oracle Streams capture or apply parallelism
values were increased, increase the value of this parameter by the sum of
increased Streams parallelism values.
For example, if the statement in Example 16–1 returns a value of 2, and the
statement in Example 16–2 returns a value of 3, then the staging database DBA
should increase the value of the PARALLEL_MAX_SERVERS parameter by (2-1)
+ (3-1), or 3 for the CHICAGO_DAILY change set. If the Streams capture or apply
parallelism values have increased for other change sets, increases for those
change sets must also be made.
■ PROCESSES
For each change set for which Oracle Streams capture or apply parallelism
values were changed, increase the value of this parameter by the sum of
increased Streams parallelism values. See the previous list item, PARALLEL_
MAX_SERVERS, for an example.
■ STREAMS_POOL_SIZE
For each change set for which Oracle Streams capture or apply parallelism
values were changed, increase the value of this parameter by (10MB * (the
increased capture parallelism value)) + (1MB * increased apply parallelism
value).
For example, if the statement in Example 16–1 returns a value of 2, and the
statement in Example 16–2 returns a value of 3, then the staging database DBA
should increase the value of the STREAMS_POOL_SIZE parameter by (10 MB *
(2-1) + 1MB * (3-1)), or 12MB for the CHICAGO_DAILY change set. If the Oracle
Streams capture or apply parallelism values have increased for other change
sets, increases for those change sets must also be made.
See Oracle Streams Concepts and Administration for more information on Streams
capture parallelism and apply parallelism values. See Oracle Database Reference
for more information about database initialization parameters.
tables on source tables owned by SYS or SYSTEM because triggers will not fire and
therefore changes will not be captured.
This example shows how to create a change set. If the publisher wants to use the
predefined SYNC_SET, he or she should skip Step 3 and specify SYNC_SET as the
change set name in the remaining steps.
This example assumes that the publisher and the source database DBA are two
different people.
Step 2 Source Database DBA: Create and grant privileges to the publisher.
The source database DBA creates a user (for example, cdcpub), to serve as the
Change Data Capture publisher and grants the necessary privileges to the publisher
so that he or she can perform the operations needed to create Change Data Capture
change sets and change tables on the source database, as described in"Creating a
User to Serve As a Publisher" on page 16-18. This example assumes that the
tablespace ts_cdcpub has already been created.
CREATE USER cdcpub IDENTIFIED BY cdcpub DEFAULT TABLESPACE ts_cdcpub
QUOTA UNLIMITED ON SYSTEM
QUOTA UNLIMITED ON SYSAUX;
GRANT CREATE SESSION TO cdcpub;
GRANT CREATE TABLE TO cdcpub;
GRANT CREATE TABLESPACE TO cdcpub;
GRANT UNLIMITED TABLESPACE TO cdcpub;
GRANT SELECT_CATALOG_ROLE TO cdcpub;
GRANT EXECUTE_CATALOG_ROLE TO cdcpub;
GRANT CONNECT, RESOURCE TO cdcpub;
DBMS_CDC_PUBLISH.CREATE_CHANGE_SET(
change_set_name => 'CHICAGO_DAILY',
description => 'Change set for job history info',
change_source_name => 'SYNC_SOURCE');
END;
/
The change set captures changes from the predefined change source SYNC_SOURCE.
Because begin_date and end_date parameters cannot be specified for
synchronous change sets, capture begins at the earliest available change data and
continues capturing change data indefinitely.
This statement creates a change table named jobhist_ct within the change set
CHICAGO_DAILY. The column_type_list parameter identifies the columns
captured by the change table. The source_schema and source_table
parameters identify the schema and source table that reside in the source database.
The capture_values setting in the example indicates that for update operations,
the change data will contain two separate rows for each row that changed: one row
will contain the row values before the update occurred, and the other row will
contain the row values after the update occurred.
The Change Data Capture synchronous system is now ready for subscriber1 to
create subscriptions.
If you intend to capture all the column values in a row whenever a column in
that row is updated, you can use the following statement instead of listing each
column one-by-one in the ALTER TABLE statement. However, do not use this
form of the ALTER TABLE statement if all columns are not needed. Logging all
columns incurs more overhead than logging selected columns.
ALTER TABLE HR.JOB_HISTORY ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
Step 3 Source Database DBA: Create and grant privileges to the publisher.
The source database DBA creates a user, (for example, cdcpub), to serve as the
Change Data Capture publisher and grants the necessary privileges to the publisher
so that he or she can perform the underlying Oracle Streams operations needed to
create Change Data Capture change sets, and change tables on the source database,
as described in "Creating a User to Serve As a Publisher" on page 16-18. This
example assumes that the ts_cdcpub tablespace has already been created. For
example:
CREATE USER cdcpub IDENTIFIED BY cdcpub DEFAULT TABLESPACE ts_cdcpub
QUOTA UNLIMITED ON SYSTEM
QUOTA UNLIMITED ON SYSAUX;
GRANT CREATE SESSION TO cdcpub;
GRANT CREATE TABLE TO cdcpub;
GRANT CREATE TABLESPACE TO cdcpub;
GRANT UNLIMITED TABLESPACE TO cdcpub;
GRANT SELECT_CATALOG_ROLE TO cdcpub;
GRANT EXECUTE_CATALOG_ROLE TO cdcpub;
GRANT CREATE SEQUENCE TO cdcpub;
GRANT CONNECT, RESOURCE, DBA TO cdcpub;
Note that for HotLog Change Data Capture, the source database and the staging
database are the same database.
each source table's changes. The source table structure and the column datatypes
must be supported by Change Data Capture. See "Datatypes and Table Structures
Supported for Asynchronous Change Data Capture" on page 16-51 for more
information.
BEGIN
DBMS_CAPTURE_ADM.PREPARE_TABLE_INSTANTIATION(TABLE_NAME => 'hr.job_history');
END;
/
The change set captures changes from the predefined HOTLOG_SOURCE change
source.
Step 6 Staging Database Publisher: Create the change tables that will contain
the changes to the source tables.
The publisher uses the DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE procedure
on the staging database to create change tables.
The publisher creates one or more change tables for each source table to be
published, specifies which columns should be included, and specifies the
combination of before and after images of the change data to capture.
The following example creates a change table on the staging database that captures
changes made to a source table on the source database. The example uses the
sample table HR.JOB_HISTORY as the source table.
BEGIN
DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE(
owner => 'cdcpub',
change_table_name => 'job_hist_ct',
change_set_name => 'CHICAGO_DAILY',
source_schema => 'HR',
source_table => 'JOB_HISTORY',
column_type_list => 'EMPLOYEE_ID NUMBER(6),START_DATE DATE,END_DATE DATE,
JOB_ID VARCHAR(10), DEPARTMENT_ID NUMBER(4)',
capture_values => 'both',
rs_id => 'y',
row_id => 'n',
user_id => 'n',
timestamp => 'n',
object_id => 'n',
source_colmap => 'n',
target_colmap => 'y',
options_string => 'TABLESPACE TS_CHICAGO_DAILY');
END;
/
This statement creates a change table named job_history_ct within change set
CHICAGO_DAILY. The column_type_list parameter identifies the columns to be
captured by the change table. The source_schema and source_table
parameters identify the schema and source table that reside on the source database.
The capture_values setting in this statement indicates that for update
operations, the change data will contain two separate rows for each row that
changed: one row will contain the row values before the update occurred and the
other row will contain the row values after the update occurred.
The options_string parameter in this statement specifies a tablespace for the
change table. (This example assumes that the publisher previously created the TS_
CHICAGO_DAILY tablespace.)
DBMS_CDC_PUBLISH.ALTER_CHANGE_SET(
change_set_name => 'CHICAGO_DAILY',
enable_capture => 'y');
END;
/
The Change Data Capture Asynchronous HotLog system is now ready for
subscriber1 to create subscriptions.
Step 1 Source Database DBA: Prepare to copy redo log files from the source
database.
The source database DBA and the staging database DBA must set up log transport
services to copy redo log files from the source database to the staging database and
to prepare the staging database to receive these redo log files, as follows:
1. The source database DBA configures Oracle Net so that the source database can
communicate with the staging database. (See Oracle Net Services Administrator's
Guide for information about Oracle Net).
2. The source database DBA sets the database initialization parameters on the
source database as described in "Setting Initialization Parameters for Change
Data Capture Publishing" on page 16-21. In the following code example
stagingdb is the network name of the staging database:
compatible = 10.1.0
java_pool_size = 50000000
log_archive_dest_1="location=/oracle/dbs mandatory reopen=5"
log_archive_dest_2 = "service=stagingdb arch optional noregister reopen=5
template = /usr/oracle/dbs/arch1_%s_%t_%r.dbf"
log_archive_dest_state_1 = enable
log_archive_dest_state_2 = enable
log_archive_format="arch1_%s_%t_%r.dbf"
remote_login_passwordfile=shared
See Oracle Data Guard Concepts and Administration for information on log
transport services.
1. Place the database into FORCE LOGGING logging mode to protect against
unlogged direct writes in the source database that cannot be captured by
asynchronous Change Data Capture:
ALTER DATABASE FORCE LOGGING;
If you intend to capture all the column values in a row whenever a column in
that row is updated, you can use the following statement instead of listing each
column one-by-one in the ALTER TABLE statement. However, do not use this
form of the ALTER TABLE statement if all columns are not needed. Logging all
columns incurs more overhead than logging selected columns.
ALTER TABLE HR.JOB_HISTORY ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
Step 4 Staging Database DBA: Create and grant privileges to the publisher.
The staging database DBA creates a user, (for example, cdcpub), to serve as the
Change Data Capture publisher and grants the necessary privileges to the publisher
so that he or she can perform the underlying Oracle Streams operations needed to
create Change Data Capture change sources, change sets, and change tables on the
staging database, as described in "Creating a User to Serve As a Publisher" on
page 16-18. For example:.
CREATE USER cdcpub IDENTIFIED BY cdcpub DEFAULT TABLESPACE ts_cdcpub
For asynchronous AutoLog publishing to work, it is critical that the source database
DBA build the data dictionary before the source tables are prepared. The source
database DBA must be careful to follow Step 5 and Step 6 in the order they are
presented here.
See Oracle Streams Concepts and Administration for more information on the
LogMiner data dictionary.
Step 7 Source Database DBA: Get the global name of the source database.
In Step 8, the publisher will need to reference the global name of the source
database. The source database DBA can query the GLOBAL_NAME column in the
GLOBAL_NAME view on the source database to retrieve this information for the
publisher:
SELECT GLOBAL_NAME FROM GLOBAL_NAME;
GLOBAL_NAME
----------------------------------------------------------------------------
HQDB
Step 8 Staging Database Publisher: Identify each change source database and
create the change sources.
The publisher uses the DBMS_CDC_PUBLISH.CREATE_AUTOLOG_CHANGE_SOURCE
procedure on the staging database to create change sources.
The process of managing the capture system begins with the creation of a change
source. A change source describes the source database from which the data will be
captured, and manages the relationship between the source database and the
staging database. A change source always specifies the SCN of a data dictionary
build from the source database as its first_scn parameter.
The publisher gets the SCN of the data dictionary build and the global database
name from the source database DBA (as shown in Step 5 and Step 7, respectively). If
the publisher cannot get the value to use for the first_scn parameter value from
the source database DBA, then, with the appropriate privileges, he or she can query
the V$ARCHIVED_LOG view on the source database to determine the value. This is
The publisher creates one or more change tables for each source table to be
published, specifies which columns should be included, and specifies the
combination of before and after images of the change data to capture.
The publisher can set the options_string field of the DBMS_CDC_
PUBLISH.CREATE_CHANGE_TABLE procedure to have more control over the
physical properties and tablespace properties of the change tables. The options_
string field can contain any option available (except partitioning) on the CREATE
TABLE statement. In this example, it specifies a tablespace for the change set. (This
example assumes that the publisher previously created the TS_CHICAGO_DAILY
tablespace.)
The following example creates a change table on the staging database that captures
changes made to a source table in the source database. The example uses the sample
table HR.JOB_HISTORY.
BEGIN
DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE(
owner => 'cdcpub',
change_table_name => 'JOB_HIST_CT',
change_set_name => 'CHICAGO_DAILY',
source_schema => 'HR',
source_table => 'JOB_HISTORY',
column_type_list => 'EMPLOYEE_ID NUMBER(6),START_DATE DATE,END_DATE DATE,
JOB_ID VARCHAR2(10), DEPARTMENT_ID NUMBER(4)',
capture_values => 'both',
rs_id => 'y',
row_id => 'n',
user_id => 'n',
timestamp => 'n',
object_id => 'n',
source_colmap => 'n',
target_colmap => 'y',
options_string => 'TABLESPACE TS_CHICAGO_DAILY');
END;
/
This example creates a change table named job_hist_ct within change set
CHICAGO_DAILY. The column_type_list parameter identifies the columns
captured by the change table. The source_schema and source_table
parameters identify the schema and source table that reside in the source database,
not the staging database.
The capture_values setting in the example indicates that for update operations,
the change data will contain two separate rows for each row that changed: one row
will contain the row values before the update occurred and the other row will
contain the row values after the update occurred.
Step 12 Source Database DBA: Switch the redo log files at the source
database.
To begin capturing data, a log file must be archived. The source database DBA can
initiate the process by switching the current redo log file:
ALTER SYSTEM ARCHIVE LOGFILE;
The Change Data Capture asynchronous AutoLog system is now ready for
subscriber1 to create subscriptions.
may subscribe to any source tables for which the publisher has created one or more
change tables by doing one of the following:
■ Specifying the source tables and columns of interest.
When there are multiple publications that contain the columns of interest, then
Change Data Capture selects one on behalf of the user.
■ Specifying the publication IDs and columns of interest.
When there are multiple publications on a single source table and these
publications share some columns, the subscriber should specify publication IDs
(rather than source tables) if any of the shared columns will be used in a single
subscription.
The following steps provide an example to demonstrate the second scenario:
Step 1 Find the source tables for which the subscriber has access privileges.
The subscriber queries the ALL_SOURCE_TABLES view to see all the published
source tables for which the subscriber has access privileges:
SELECT * FROM ALL_SOURCE_TABLES;
SOURCE_SCHEMA_NAME SOURCE_TABLE_NAME
------------------------------ ------------------------------
HR JOB_HISTORY
Step 2 Find the change set names and columns for which the subscriber has
access privileges.
The subscriber queries the ALL_PUBLISHED_COLUMNS view to see all the change
sets, columns, and publication IDs for the HR.JOB_HISTORY table for which the
subscriber has access privileges:
SELECT UNIQUE CHANGE_SET_NAME, COLUMN_NAME, PUB_ID
FROM ALL_PUBLISHED_COLUMNS
WHERE SOURCE_SCHEMA_NAME ='HR' AND SOURCE_TABLE_NAME = 'JOB_HISTORY';
Step 4 Subscribe to a source table and the columns in the source table.
The subscriber calls the DBMS_CDC_SUBSCRIBE.SUBSCRIBE procedure to specify
which columns of the source tables are of interest to the subscriber.
A subscription can contain one or more source tables referenced by the same change
set.
In the following example, the subscriber wants to see the EMPLOYEE_ID, START_
DATE, and END_DATE columns from the JOB_HISTORY table. Because all these
columns are contained in the same publication (and the subscriber has privileges to
access that publication) as shown in the query in Step 2, the following call can be
used:
BEGIN
DBMS_CDC_SUBSCRIBE.SUBSCRIBE(
subscription_name => 'JOBHIST_SUB',
source_schema => 'HR',
source_table => 'JOB_HISTORY',
column_list => 'EMPLOYEE_ID, START_DATE, END_DATE, JOB_ID',
subscriber_view => 'JOBHIST_VIEW');
END;
/
However, assume that for security reasons the publisher has not created a single
change table that includes all these columns. Suppose that instead of the results
shown in Step 2, the query of the ALL_PUBLISHED_COLUMNS view shows that the
columns of interest are included in multiple publications as shown in the following
example:
This returned data shows that the EMPLOYEE_ID column is included in both
publication 34883 and publication 34885. A single subscribe call must specify
columns available in a single publication. Therefore, if the subscriber wants to
subscribe to columns in both publications, using EMPLOYEE_ID to join across the
subscriber views, then the subscriber must use two calls, each specifying a different
publication ID:
BEGIN
DBMS_CDC_SUBSCRIBE.SUBSCRIBE(
subscription_name => 'MULTI_PUB',
publication_id => 34885,
column_list => 'EMPLOYEE_ID, START_DATE, END_DATE',
subscriber_view => 'job_dates');
DBMS_CDC_SUBSCRIBE.SUBSCRIBE(
subscription_name => 'MULTI_PUB',
publication_id => 34883,
column_list => 'EMPLOYEE_ID, JOB_ID',
subscriber_view => 'job_type');
END;
/
Note that each DBMS_CDC_SUBSCRIBE.SUBSCRIBE call specifies a unique
subscriber view.
If this is the subscriber's first call to the EXTEND_WINDOW procedure, then the
subscription window contains all the change data in the publication. Otherwise, the
subscription window contains all the new change data that was created since the
last call to the EXTEND_WINDOW procedure.
If no new change data has been added, then the subscription window remains
unchanged.
The subscriber view name, JOBHIST_VIEW, was specified when the subscriber
called the DBMS_CDC_SUBSCRIBE.SUBSCRIBE procedure in Step 4.
Step 8 Indicate that the current set of change data is no longer needed.
The subscriber uses the DBMS_CDC_SUBSCRIBE.PURGE_WINDOW procedure to let
Change Data Capture know that the subscriber no longer needs the current set of
change data. This helps Change Data Capture to manage the amount of data in the
change table and sets the low boundary of the subscription window. Calling the
DBMS_CDC_SUBSCRIBE.PURGE_WINDOW procedure causes the subscription
window to be empty.
For example:
BEGIN
DBMS_CDC_SUBSCRIBE.PURGE_WINDOW(
subscription_name => 'JOBHIST_SUB');
END;
/
See Oracle Database Administrator's Guide for information about running a database
in ARCHIVELOG mode.
A redo log file used by Change Data Capture must remain available on the staging
database until Change Data Capture has captured it. However, it is not necessary
that the redo log file remain available until the Change Data Capture subscriber is
done with the change data.
To determine which redo log files are no longer needed by Change Data Capture for
a given change set, the publisher alters the change set's Streams capture process,
which causes Streams to perform some internal cleanup and populates the DBA_
LOGMNR_PURGED_LOG view. The publisher follows these steps:
1. Uses the following query on the staging database to get the three SCN values
needed to determine an appropriate new first_scn value for the change set,
CHICAGO_DAILY:
SELECT cap.CAPTURE_NAME, cap.FIRST_SCN, cap.APPLIED_SCN,
cap.SAFE_PURGE_SCN
FROM DBA_CAPTURE cap, CHANGE_SETS cset
2. Determines a new first_scn value that is greater than the original first_
scn value and less than or equal to the applied_scn and safe_purge_scn
values returned by the query in step 1. In this example, this value is 778293, and
the capture process name is CDC$C_CHICAGO_DAILY, therefore the publisher
can alter the first_scn value for the capture process as follows:
BEGIN
DBMS_CAPTURE_ADM.ALTER_CAPTURE(
capture_name => 'CDC$C_CHICAGO_DAILY',
first_scn => 778293);
END;
/
If there if not an SCN value that meets these criteria, then the change set needs
all of its redo log files.
3. Queries the DBA_LOGMNR_PURGED_LOG view to see any log files that are no
longer needed by Change Data Capture:
SELECT FILE_NAME
FROM DBA_LOGMNR_PURGED_LOG;
Note: Redo log files may be required on the staging database for
purposes other than Change Data Capture. Before deleting a redo
log file, the publisher should be sure that no other users need it.
See the information on setting the first SCN for an existing capture process and on
capture process checkpoints in Oracle Streams Concepts and Administration for more
information.
The first_scn value can be updated for all change sets in an AutoLog change
source by using the DBMS_CDC_PUBLISH.ALTER_AUTOLOG_CHANGE_SOURCE
first_scn parameter. Note that the new first_scn value must meet the criteria
stated in step 2 of the preceding list for all change sets in the AutoLog change
source.
Both the size of the redo log files and the frequency with which a log switch occurs
can affect the generation of the archived log files at the source database. For Change
Data Capture, the most important factor in deciding what size to make a redo log
file is the tolerance for latency between when a change is made and when that
change data is available to subscribers. However, because the Oracle Database
software attempts a check point at each log switch, if the redo log file is too small,
frequent log switches will lead to frequent checkpointing and negatively impact the
performance of the source database.
See Oracle Data Guard Concepts and Administration for step-by-step instructions on
monitoring log file archival information. Substitute the terms source and staging
database for the Oracle Data Guard terms primary database and archiving
destinations, respectively.
When using log transport services to supply redo log files to an AutoLog change
source, gaps in the sequence of redo log files are automatically detected and
resolved. If a situation arises where it is necessary to manually add a log file to an
AutoLog change set, the publisher can use instructions on explicitly assigning log
files to a downstream capture process described in the Oracle Streams Concepts and
Administration. These instructions require the name of the capture process for the
AutoLog change set. The publisher can obtain the name of the capture process for
an AutoLog change set from the CHANGE_SETS data dictionary view.
If an unconditional log group is not created for all source table columns to be
captured, then when an update DML operation occurs, some unchanged user
column values in change tables will be null instead of reflecting the actual
source table value.
For example, suppose a source table contains two columns, X and Y, and that
the source database DBA has defined an unconditional log group for that table
that includes only column Y. Furthermore, assume that a user updates only
column Y in that table row. When the subscriber views the change data for that
row, the value of the unchanged column X will be null. However, because the
actual column value for X is excluded from the redo log file and therefore
cannot be included in the change table, the subscriber cannot assume that the
actual source table value for column X is null. The subscriber must rely on the
contents of the TARGET_COLMAP$ control column to determine whether the
actual source table value for column X is null or it is unchanged.
See Oracle Database Utilities for more information on the various types of
supplemental logging.
Datatypes and Table Structures Supported for Asynchronous Change Data Capture
Asynchronous Change Data Capture supports columns of all built-in Oracle
datatypes except the following:
■ BFILE
■ BLOB
■ CLOB
■ LONG
■ NCLOB
■ ROWID
■ UROWID
■ object types (for example, XMLType)
Asynchronous Change Data Capture does not support the following table
structures:
The following example creates a change set, JOBHIST_SET, in the AutoLog change
source, HQ_SOURCE, that starts capture two days from now and continues
indefinitely:
BEGIN
DBMS_CDC_PUBLISH.CREATE_CHANGE_SET(
change_set_name => 'JOBHIST_SET',
description => 'Job History Application Change Set',
change_source_name => 'HQ_SOURCE',
stop_on_ddl => 'Y',
begin_date => sysdate+2);
END;
/
The Oracle Streams capture and apply processes for the change set are started when
the change set is enabled.
The publisher can disable the JOBHIST_SET asynchronous change set with the
following call:
BEGIN
DBMS_CDC_PUBLISH.ALTER_CHANGE_SET(
change_set_name => 'JOBHIST_SET',
enable_capture => 'n');
END;
/
The Oracle Streams capture and apply processes for the change set are stopped
when the change set is disabled.
Although a disabled change set cannot process new change data, it does not lose
any change data provided that the necessary archived redo log files remain
available until the change set is enabled and processes them. Oracle recommends
that change sets be enabled as much as possible to avoid accumulating archived
redo log files. See "Asynchronous Change Data Capture and Redo Log Files" on
page 16-48 for more information.
Change Data Capture can automatically disable an asynchronous change set if DDL
is encountered during capture and the stop_on_ddl parameter is set to 'Y', or if
there is an internal capture error. The publisher must check the alert log for more
information, take any necessary actions to adjust to the DDL or recover from the
internal error, and explicitly enable the change set. See "Recovering from Errors
Returned on Asynchronous Change Sets" on page 16-55 for more information.
The publisher can alter the JOBHIST_SET change set so that it does not stop on
DDL by using the following call:
BEGIN
DBMS_CDC_PUBLISH.ALTER_CHANGE_SET(
change_set_name => 'JOBHIST_SET',
stop_on_ddl => 'n');
END;
If a DDL statement causes processing to stop, a message is written to the alert log
indicating the DDL statement and change set involved. For example, if a TRUNCATE
TABLE DDL statement causes the JOB_HIST change set to stop processing, the alert
log contains lines such as the following:
Change Data Capture received DDL for change set JOB_HIST
Change Data Capture received DDL and stopping: truncate table job_history
Because they do not affect the column data itself, the following DDL statements do
not cause Change Data Capture to stop capturing change data when the stop_on_
ddl parameter is set to 'Y':
■ ANALYZE TABLE
■ LOCK TABLE
■ GRANT privileges to access a table
■ REVOKE privileges to access a table
■ COMMENT on a table
■ COMMENT on a column
These statements can be issued on the source database without concern for their
impact on Change Data Capture processing. For example, when an ANALYZE
TABLE command is issued on the JOB_HISTORY source table, the alert log on the
staging database will contain a line similar to the following when the stop_on_
ddl parameter is set to 'Y':
Change Data Capture received DDL and ignoring: analyze table job_history compute
statistics
The publisher must check the alert log for more information and attempt to fix the
underlying problem. The publisher can then attempt to recover from the error by
calling ALTER_CHANGE_SET with the recover_after_error and remove_ddl
parameters set appropriately. The publisher can retry this procedure as many times
as necessary to resolve the problem. When recovery succeeds, the error is removed
from the change set and the publisher can enable the asynchronous change set (as
described in "Enabling and Disabling Asynchronous Change Sets" on page 16-53).
If more information is needed to resolve capture errors, the publisher can query the
DBA_APPLY_ERROR view to see information about Streams apply errors; capture
errors correspond to Streams apply errors. The publisher must always use the
DBMS_CDC_PUBLISH.ALTER_CHANGE_SET procedure to recover from capture
errors because both Streams and Change Data Capture actions are needed for
recovery and only the DBMS_CDC_PUBLISH.ALTER_CHANGE_SET procedure
performs both sets of actions. See Oracle Streams Concepts and Administration for
information about the error queue and apply errors.
The following two scenarios demonstrate how a publisher might investigate and
then recover from two different types of errors returned to Change Data Capture:
An Error Due to Running Out of Disk Space The publisher can view the contents of the
alert log to determine which error is being returned for a given change set and
which SCN is not being processed. For example, the alert log may contain lines such
as the following (where LCR refers to a logical change record):
Change Data Capture has encountered error number: 1688 for change set: CHICAGO_
DAILY
Change Data Capture did not process LCR with scn 219337
The publisher can determine the message associated with the error number
specified in the alert log by querying the DBA_APPLY_ERROR view for the error
message text, where the APPLY_NAME in the DBA_APPLY_ERROR view equals the
APPLY_NAME of the change set specified in the alert log. For example:
ERROR_MESSAGE
--------------------------------------------------------------------------------
ORA-01688: unable to extend table LOGADMIN.CT1 partition P1 by 32 in tablespace
TS_CHICAGO_DAILY
After taking action to fix the problem that is causing the error, the publisher can
attempt to recover from the error. For example, the publisher can attempt to recover
the CHICAGO_DAILY change set after an error with the following call:
BEGIN
DBMS_CDC_PUBLISH.ALTER_CHANGE_SET(
change_set_name => 'CHICAGO_DAILY',
recover_after_error => 'y');
END;
/
If the recovery does not succeed, then an error is returned and the publisher can
take further action to attempt to resolve the problem. The publisher can retry the
recovery procedure as many times as necessary to resolve the problem.
An Error Due to Stopping on DDL Suppose a SQL TRUNCATE TABLE statement is issued
against the JOB_HISTORY source table and the stop_on_ddl parameter is set to
'Y', then an error such as the following is returned from an attempt to enable the
change set:
ERROR at line 1:
ORA-31468: cannot process DDL change record
ORA-06512: at "SYS.DBMS_CDC_PUBLISH", line 79
ORA-06512: at line 2
Because the TRUNCATE TABLE statement removes all rows from a table, the
publisher will want to notify subscribers before taking action to reenable Change
Data Capture processing. He or she might suggest to subscribers that they purge
and extend their subscription windows. The publisher can then attempt to restore
Change Data Capture processing by altering the change set and specifying the
remove_ddl => 'Y' parameter along with the recover_after_error => 'Y'
parameter, as follows:
BEGIN
DBMS_CDC_PUBLISH.ALTER_CHANGE_SET(
change_set_name => 'JOB_HIST',
recover_after_error => 'y',
remove_ddl => 'y');
END;
/
After this procedure completes, the alert log will contain lines similar to the
following:
Mon Jun 9 16:20:17 2003
Change Data Capture received DDL and ignoring: truncate table JOB_HISTORY
The scn for the truncate statement is 202998
– When the publisher creates a change table, he or she can use the options_
string parameter to specify a tablespace for the change table being
created. See Step 4 in "Performing Synchronous Publishing" on page 16-27
for an example.
If both methods are used, the tablespace specified by the publisher in the
options_string parameter takes precedence over the default tablespace
specified in the SQL CREATE USER statement.
■ For asynchronous Change Data Capture, the publisher should be certain that
the source table that will be referenced in a DBMS_CDC_PUBLISH.CREATE_
CHANGE_TABLE procedure has been created prior to calling this procedure,
particularly if the change set that will be specified in the procedure has the
stop_on_ddl parameter set to 'Y'.
Suppose the publisher created a change set with the stop_on_ddl parameter
set to 'Y', then created the change table, and then the source table was created.
In this scenario, the DDL that creates the source table would trigger the stop_
on_ddl condition and cause Change Data Capture processing to stop.
■ For asynchronous Change Data Capture, the source database DBA should
create an unconditional log group for all source table columns that will be
captured in a change table. This should be done before any change tables are
created on a source table. If an unconditional log group is not created for source
table columns to be captured, then when an update DML operation occurs,
some unchanged user column values in change tables will be null instead of
reflecting the actual source table value. This will require the publisher to
evaluate the TARGET_COLMAP$ control column to distinguish unchanged
column values from column values that are actually null. See "Asynchronous
Change Data Capture and Supplemental Logging" on page 16-50 for
information on creating unconditional log groups and see "Understanding
Change Table Control Columns" on page 16-60 for information on control
columns.
In Example 16–3, the first 'FE' is the low order byte and the last '00' is the high order
byte. To correctly interpret the meaning of the values, you must consider which bits
are set in each byte. The bits in the bitmap are counted starting at zero. The first bit
is bit 0, the second bit is bit 1, and so on. Bit 0 is always ignored. For the other bits, if
a particular bit is set to 1, it means that the value for that column has been changed.
To interpret the string of bytes as presented in the Example 16–3, you read from left
to right. The first byte is the string 'FE'. Broken down into bits (again from left to
right) this string is "1111 1110", which maps to columns " 7,6,5,4 3,2,1,-" in the
change table (where the hyphen represents the ignored bit). The first bit tells you if
column 7 in the change table has changed. The right-most bit is ignored. The values
in Example 16–3 indicate that the first 7 columns have a value present. This is
typical - the first several columns in a change table are control columns.
The next byte in Example 16–3 is the string '11'. Broken down into bits, this string is
"0001 0001", which maps to columns "15,14,13,12 11,10,9,8" in the change table.
These bits indicate that columns 8 and 12 are changed. Columns 9, 10, 11, 13, 14, 15,
are not changed. The rest of the string is all '00', indicating that none of the other
columns has been changed.
A publisher can issue the following query to determine the mapping of column
numbers to column names:
SELECT COLUMN_NAME, COLUMN_ID FROM ALL_TAB_COLUMNS WHERE OWNER='PUBLISHER_
STEWART' AND TABLE_NAME='MY_CT';
COLUMN_NAME COLUMN_ID
------------------------------ ----------
OPERATION$ 1
CSCN$ 2
COMMIT_TIMESTAMP$ 3
XIDUSN$ 4
XIDSLT$ 5
XIDSEQ$ 6
RSID$ 7
TARGET_COLMAP$ 8
C_ID 9
C_KEY 10
C_ZIP 11
COLUMN_NAME COLUMN_ID
------------------------------ ----------
C_DATE 12
C_1 13
C_3 14
C_5 15
C_7 16
C_9 17
Using Example 16–3, the publisher can conclude that following columns were
changed in the particular change row in the change table represented by this
TARGET_COLMAP$ value: OPERATION$, CSCN$, COMMIT_TIMESTAMP$, XIDUSN$,
XIDSLT$, XIDSEQ$, RSID$, TARGET_COLMAP$, and C_DATE.
Note that Change Data Capture generates values for all control columns in all
change rows, so the bits corresponding to control columns are always set to 1 in
every TARGET_COLMAP$ column. Bits that correspond to user columns that have
changed are set to 1 for the OPERATION$ column values UN and I, as appropriate.
(See Table 16–8 for information about the OPERATION$ column values.)
A common use for the values in the TARGET_COLMAP$ column is for determining
the meaning of a null value in a change table. A column value in a change table can
be null for two reasons: the value was changed to null by a user or application, or
Change Data Capture inserted a null value into the column because a value was not
present in the redo data from the source table. If a user changed the value to null,
the bit for that column will be set to 1; if Change Data Capture set the value to null,
then the column will be set to 0.
Values in the SOURCE_COLMAP$ column are interpreted in a similar manner, with
the following exceptions:
■ The SOURCE_COLMAP$ column refers to columns of source tables, not columns
of change tables.
■ The SOURCE_COLMAP$ column does not reference control columns because
these columns are not present in the source table.
■ Changed source columns are set to 1 in the SOURCE_COLMAP$ column for
OPERATION$ column values UO, UU, UN, and I, as appropriate. (See Table 16–8
for information about the OPERATION$ column values.)
■ The SOURCE_COLMAP$ column is valid only for synchronous change tables.
for users and roles. The publisher must grant the SELECT privilege before a
subscriber can subscribe to the change table.
The publisher must not grant any DML access (use of INSERT, UPDATE, or DELETE
statements) to the subscribers on the change tables because of the risk that a
subscriber might inadvertently change the data in the change table, making it
inconsistent with its source. Furthermore, the publisher should avoid creating
change tables in schemas to which subscribers have DML access.
By default, this purge job runs every 24 hours. The publisher who created the
first change table can adjust this interval using the PL/SQL DBMS_
JOB.CHANGE procedure. The values for the JOB parameter for this procedure
can be found by querying the USER_JOBS view for the job number that
corresponds to the WHAT column containing the string 'SYS.DBMS_CDC_
PUBLISH.PURGE();'.
See PL/SQL Packages and Types Reference for information about the DBMS_JOB
package and the Oracle Database Reference for information about the USER_JOBS
view.
■ Publisher
The publisher can manually execute a purge operation at any time. The
publisher has the ability to perform purge operations at a finer granularity than
the automatic purge operation performed by Change Data Capture. There are
three purge operations available to the publisher:
– DBMS_CDC_PUBLISH.PURGE
Purges all change tables on the staging database. This is the same PURGE
operation as is performed automatically by Change Data Capture.
– DBMS_CDC_PUBLISH.PURGE_CHANGE_SET
Purges all the change tables in a named change set.
– DBMS_CDC_PUBLISH.PURGE_CHANGE_TABLE
Purges a named changed table.
Thus, calls to the DBMS_CDC_SUBSCRIBE.PURGE_WINDOW procedure by
subscribers and calls to the PURGE procedure by Change Data Capture (or one of
the PURGE procedures by the publisher) work together: when each subscriber
purges a subscription window, it indicates change data that is no longer needed; the
PURGE procedure evaluates the sum of the input from all the subscribers before
actually purging data.
Note that it is possible that a subscriber could fail to call PURGE_WINDOW, with the
result being that unneeded rows would not be deleted by the purge job. The
publisher can query the DBA_SUBSCRIPTIONS view to determine if this is
happening. In extreme circumstances, a publisher may decide to manually drop an
active subscription so that space can be reclaimed. One such circumstance is a
subscriber that is an applications program that fails to call the PURGE_WINDOW
procedure when appropriate. The DBMS_CDC_PUBLISH.DROP_SUBSCRIPTION
procedure lets the publisher drop active subscriptions if circumstances require it;
however, the publisher should first consider that subscribers may still be using the
change data.
If the publisher still wants to drop the change table, in spite of active subscriptions,
he or she must call the DROP_CHANGE_TABLE procedure using the force_flag
=> 'Y' parameter. This tells Change Data Capture to override its normal
safeguards and allow the change table to be dropped despite active subscriptions.
The subscriptions will no longer be valid, and subscribers will lose access to the
change data.
Note: The SQL DROP USER CASCADE statement will drop all the
publisher's change tables, and if any other users have active
subscriptions to the (dropped) change table, these will no longer be
valid. In addition to dropping the change tables, the DROP USER
CASCADE statement drops any change sources, change sets, and
subscriptions that are owned by the user specified in the DROP
USER CASCADE statement.
■ Change Data Capture objects are exported and imported only as part of full
database export and import operations (those in which the expdp and impdb
commands specify the FULL=y parameter). Schema-level import and export
operations include some underlying objects (for example, the table underlying a
change table), but not the Change Data Capture metadata needed for change
data capture to occur.
■ AutoLog change sources, change sets, and change tables are not supported.
■ You should export asynchronous change sets and change tables at a time when
users are not making DDL and DML changes to the database being exported.
■ When importing asynchronous change sets and change tables, you must also
import the underlying Oracle Streams configuration; set the Oracle Data Pump
import parameter STREAMS_CONFIGURATION to y explicitly (or implicitly by
accepting the default), so that the necessary Streams objects are imported. If you
perform an import operation and specify STREAMS_CONFIGURATION=n, then
imported asynchronous change sets and change tables will not be able to
continue capturing change data.
■ Change Data Capture objects never overwrite existing objects when they are
imported (similar to the effect of the import command TABLE_EXISTS_
ACTION=skip parameter for tables). Change Data Capture generates warnings
in the import log for these cases.
■ Change Data Capture objects are validated at the end of an import operation to
determine if all expected underlying objects are present in the correct form.
Change Data Capture generates validation warnings in the import log if it
detects validation problems. Imported Change Data Capture objects with
validation warnings usually cannot continue capturing change data.
The following are examples of Data Pump export and import commands that
support Change Data Capture objects:
> expdp system/manager DIRECTORY=dpump_dir FULL=y
> impdp system/manager DIRECTORY=dpump_dir FULL=y STREAMS_CONFIGURATION=y
After a Data Pump full database import operation completes for a database
containing AutoLog Change Data Capture objects, the following steps must be
performed to restore these objects:
1. The publisher must manually drop the change tables with the SQL DROP
TABLE command. This is needed because the tables are imported without the
accompanying Change Data Capture metadata.
2. The publisher must re-create the AutoLog change sources, change sets, and
change tables using the appropriate DBMS_CDC_PUBLISH procedures.
3. Subscribers must re-create their subscriptions to the AutoLog change sets.
Change data may be lost in the interval between a Data Pump full database export
operation involving AutoLog Change Data Capture objects and their re-creation
after a Data Pump full database import operation in the preceding step. This can be
minimized by preventing changes to the source tables during this interval, if
possible.
See Oracle Database Utilities for information on Oracle Data Pump.
The following are publisher considerations for exporting and importing change
tables:
■ When change tables are imported, the job queue is checked for a Change Data
Capture purge job. If no purge job is found, then one is submitted automatically
(using the DBMS_CDC_PUBLISH.PURGE procedure). If a change table is
imported, but no subscriptions are taken out before the purge job runs (24 hours
later, by default), then all rows in the table will be purged.
The publisher can use one of the following methods to prevent the purging of
data from a change table:
– Suspend the purge job using the DBMS_JOB package to either disable the
job (using the BROKEN procedure) or execute the job sometime in the future
when there are subscriptions (using the NEXT_DATE procedure).
– Create a temporary subscription to preserve the change table data until real
subscriptions appear. Then, drop the temporary subscription.
■ When importing data into a source table for which a change table already exists,
the imported data is also recorded in any associated change tables.
Assume that the publisher has a source table SALES that has an associated
change table ct_sales. When the publisher imports data into SALES, that
data is also recorded in ct_sales.
■ When importing a change table having the optional control ROW_ID column,
the ROW_ID columns stored in the change table have meaning only if the
associated source table has not been imported. If a source table is re-created or
imported, each row will have a new ROW_ID that is unrelated to the ROW_ID
that was previously recorded in a change table.
The original level of export and import support available in Oracle9i is retained for
backward compatibility. Synchronous change tables that reside in the SYNC_SET
change set can be exported as part of a full database, schema, or individual table
export operation and can be imported as needed. The following Change Data
Capture objects are not included in the original export and import support: change
sources, change sets, change tables that do not reside in the SYNC_SET change set,
and subscriptions.
To reinstall Change Data Capture, the SQL script initcdc.sql is provided in the
admin directory. It creates the Change Data Capture system triggers and Java
classes that are required by Change Data Capture.
This chapter illustrates how to use the SQLAccess Advisor, which is a tuning tool
that provides advice on materialized views, indexes, and materialized view logs.
The chapter contains:
■ Overview of the SQLAccess Advisor in the DBMS_ADVISOR Package
■ Using the SQLAccess Advisor
■ Tuning Materialized Views for Fast Refresh and Query Rewrite
Oracle
SQL
Warehouse Cache
Materialized
Views,
Indexes, and
Dimensions
Workload
Workload Collection
(optional)
Using the SQLAccess Advisor Wizard or API, you can do the following:
■ Recommend materialized views and indexes based on collected or hypothetical
workload information.
■ Manage workloads.
■ Mark, update, and remove recommendations.
In addition, you can use the SQLAccess Advisor API to do the following:
■ Perform a quick tune using a single SQL statement.
■ Show how to make a materialized view fast refreshable.
■ Show how to change a materialized view so that general query rewrite is
possible.
The SQLAccess Advisor's recommendations are significantly improved if you
gather structural statistics about table and index cardinalities, and the distinct
cardinalities of every dimension level column, JOIN KEY column, and fact table key
column. You do this by gathering either exact or estimated statistics with the DBMS_
STATS package. Because gathering statistics is time-consuming and extreme
statistical accuracy is not required, it is generally preferable to estimate statistics.
Without these statistics, any queries referencing that table will be marked as invalid
in the workload, resulting in no recommendations being made for those queries. It
is also recommended that all existing indexes and materialized views have been
analyzed. See PL/SQL Packages and Types Reference for more information regarding
the DBMS_STATS package.
the naming conventions for what it recommends. With respect to the workload,
parameters control how long the workload exists and what filtering is to be applied
to the workload.
To set these parameters, use the SET_TASK_PARAMETER and SET_SQLWKLD_
PARAMETER procedures. Parameters are persistent in that they remain set for the
lifespan of the task or workload object. When a parameter value is set using the
SET_TASK_PARAMETER procedure, it does not change until you make another call
to SET_TASK_PARAMETER.
See "Defining the Contents of a Workload" on page 17-14 for more information
about workloads.
Not all recommendations have to be accepted and you can mark the ones that
should be included in the recommendation script.
The final step is then implementing the recommendations and verifying that query
performance has improved.
■ Recommendation Options
■ Generating Recommendations
■ Viewing the Recommendations
■ Access Advisor Journal
■ Stopping the Recommendation Process
■ Marking Recommendations
■ Modifying Recommendations
■ Generating SQL Scripts
■ When Recommendations are No Longer Required
■ Performing a Quick Tune
■ Managing Tasks
■ Using SQLAccess Advisor Constants
Step 1 Step 4
Create a SQLAccess Task
Create and Manage Prepare and
Tasks and Data Create a SQLWkld data Object Analyze Data
CREATE_TASK
ADD_SQLWKLD_REF
DELETE_TASK
UPDATE_TASK_ATT..
CREATE_SQLWKLD SQL Wkld SQLAccess
DELETE_SQLWKLD Object Task
EXECUTE_TASK
Step 2 Step 3
MARK_RECOMMENDATIONS
Prepare Tasks for Gather and Manage
Various Operations Workload UPDATE_REC_ATTRIBUTES
SET_TASK_PARAMETER IMPORT_SQLWKLD...
SET_SQLWKLD_PARAMETER
GET_TASK_SCRIPT
ADD_SQLWKLD_STAT..
UPDATE_SQLWKLD_STAT..
DELETE_SQLWKLD_STAT..
Recommendations
To avoid missing critical workload queries, the current database user must have
SELECT privileges on the tables targeted for materialized view analysis. For those
tables, these SELECT privileges cannot be obtained through a role.
Creating Tasks
An Advisor task is where you define what it is you want to analyze and where the
results of this analysis should go. A user can create any number of tasks, each with
its own specialization. All are based on the same Advisor task model and share the
same repository.
You create a task using the CREATE_TASK procedure. The syntax is as follows:
DBMS_ADVISOR.CREATE_TASK (
advisor_name IN VARCHAR2,
task_id OUT NUMBER,
task_name IN OUT VARCHAR2,
task_desc IN VARCHAR2 := NULL,
task_or_template IN VARCHAR2 := NULL,
is_template IN VARCHAR2 := 'FALSE');
See PL/SQL Packages and Types Reference for more information regarding the
CREATE_TASK and CREATE_SQLWKLD procedures and their parameters.
task to be a template by setting the template attribute when creating the task or later
using the UPDATE_TASK_ATTRIBUTE procedure.
To use a task as a template, you tell the SQLAccess Advisor to use a task when a
new task is created. At that time, the SQLAccess Advisor copies the task template's
data and parameter settings into the newly created task. You can also set an existing
task to be a template by setting the template attribute.
A workload object can also be used as a template for creating new workload objects.
Following the same guidelines for using a task as a template, a workload object can
benefit from having a well-defined starting point. Like a task template, a template
workload object can only be used to create similar workload objects.
Creating Templates
You can create a template as in the following example.
1. Create a template called MY_TEMPLATE.
VARIABLE template_id NUMBER;
VARIABLE template_name VARCHAR2(255);
EXECUTE :template_name := 'MY_TEMPLATE';
EXECUTE DBMS_ADVISOR.CREATE_TASK('SQL Access Advisor',:template_id, -
:template_name, is_template => 'TRUE');
2. Set template parameters. For example, the following sets the naming
conventions for recommended indexes and materialized views and the default
tablespaces:
-- set naming conventions for recommended indexes/mvs
EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( -
:template_name, 'INDEX_NAME_TEMPLATE', 'SH_IDX$$_<SEQ>');
EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( -
:template_name, 'MVIEW_NAME_TEMPLATE', 'SH_MV$$_<SEQ>');
EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( -
:template_name, 'DEF_MVIEW_TABLESPACE', 'SH_MVIEWS');
3. This template can now be used as a starting point to create a task as follows:
VARIABLE task_id NUMBER;
Workload Objects
Because the workload is stored as a separate workload object, it can easily be shared
among many Advisor tasks. Once a workload object has been referenced by an
Advisor task, a workload object cannot be deleted or modified until all Advisor
tasks have removed their dependency on the data. A workload reference will be
removed when a parent Advisor task is deleted or when the workload reference is
manually removed from the Advisor task by the user.
The SQLAccess Advisor performs best when a workload based on usage is
available. The SQLAccess Workload Repository is capable of storing multiple
workloads, so that the different uses of a real-world data warehousing or
transaction processing environment can be viewed over a long period of time and
across the life cycle of database instance startup and shutdown.
Before the actual SQL statements for a workload can be defined, the workload must
be created using the CREATE_SQLWKLD procedure. Then, the workload is loaded
using the appropriate IMPORT_SQLWKLD procedure. A specific workload can be
removed by calling the DELETE_SQLWKLD procedure and passing it a valid
workload name. To remove all workloads for the current user, call DELETE_
SQLWKLD and pass the constant value ADVISOR_ALL or %.
Managing Workloads
The CREATE_SQLWKLD procedure creates a workload and it must exist prior to
performing any other workload operations, such as importing or updating SQL
statements. The workload is identified by its name, so you should define a name
that is unique and is relevant for the operation.
Its syntax is as follows:
DBMS_ADVISOR.CREATE_SQLWKLD (
workload_name IN VARCHAR2,
description IN VARCHAR2 := NULL,
3. Set template parameters. For example, the following sets the filter so only tables
in the sh schema are tuned:
-- set USERNAME_LIST filter to SH
EXECUTE DBMS_ADVISOR.SET_SQLWKLD_PARAMETER( -
:template_name, 'USERNAME_LIST', 'SH');
See PL/SQL Packages and Types Reference for more information regarding the
CREATE_SQLWKLD procedure and its parameters.
between an Advisor task and a workload is made, the workload is protected from
removal. The syntax is as follows:
DBMS_ADVISOR.ADD_SQLWKLD_REF (task_name IN VARCHAR2,
workload_name IN VARCHAR2);
The following example links the MYTASK task created to the MYWORKLOAD SQL
workload.
EXECUTE DBMS_ADVISOR.ADD_SQLWKLD_REF('MYTASK', 'MYWORKLOAD');
See PL/SQL Packages and Types Reference for more information regarding the ADD_
SQLWKLD_REF procedure and its parameters.
After a workload has been collected and the statements filtered, the SQLAccess
Advisor computes usage statistics with respect to the DML statements in the
workload.
The following example creates a workload from a SQL Tuning Set named MY_STS_
WORKLOAD.
VARIABLE sqlsetname VARCHAR2(30);
VARIABLE workload_name VARCHAR2(30);
VARIABLE saved_stmts NUMBER;
VARIABLE failed_stmts NUMBER;
EXECUTE :sqlsetname := 'MY_STS_WORKLOAD';
EXECUTE :workload_name := 'MY_WORKLOAD';
EXECUTE DBMS_ADVISOR.CREATE_SQLWKLD (:workload_name);
EXECUTE DBMS_ADVISOR.IMPORT_SQLWKLD_STS (:workload_name , -
:sqlsetname, 'NEW', 1, :saved_stmts, :failed_stmts);
The following example loads MYWORKLOAD workload created earlier, using a user
table SH.USER_WORKLOAD. The table is assumed to be populated with SQL
statements and conforms to the format specified in Table 17–1.
VARIABLE saved_stmts NUMBER;
VARIABLE failed_stmts NUMBER;
EXECUTE DBMS_ADVISOR.IMPORT_SQLWKLD_USER( -
'MYWORKLOAD', 'NEW', 'SH', 'USER_WORKLOAD', :saved_stmts, :failed_stmts);
See PL/SQL Packages and Types Reference for more information regarding the
IMPORT_SQLWKLD_USER procedure and its parameters.
See PL/SQL Packages and Types Reference for more information regarding the
IMPORT_SQLWKLD_SQLCACHE procedure and its parameters.
The following example loads the MYWORKLOAD workload created earlier from the
SQL Cache. The priority of the loaded workload statements is 2 (medium).
VARIABLE saved_stmts NUMBER;
VARIABLE failed_stmts NUMBER;
EXECUTE DBMS_ADVISOR.IMPORT_SQLWKLD_SQLCACHE (-
'MYWORKLOAD', 'APPEND', 2, :saved_stmts, :failed_stmts);
The SQLAccess Advisor can retrieve workload information from the SQL cache. If
the collected data was retrieved from a server with the instance parameter cursor_
sharing set to SIMILAR or FORCE, then user queries with embedded literal values
will be converted to a statement that contains system-generated bind variables. If
you are going to use the SQLAccess Advisor to recommend materialized views,
then the server should set the instance parameter cursor_sharing to EXACT so
that materialized views with WHERE clauses can be recommended.
See PL/SQL Packages and Types Reference for more information regarding the
IMPORT_SQLWKLD_SCHEMA procedure and its parameters. You must configure
external procedures to use this procedure.
The following example creates a hypothetical workload called SCHEMA_WKLD, sets
VALID_TABLE_LIST to sh and calls IMPORT_SQLWKLD_SCHEMA to generate a
hypothetical workload.
VARIABLE workload_name VARCHAR2(255);
VARIABLE saved_stmts NUMBER;
VARIABLE failed_stmts NUMBER;
EXECUTE :workload_name := 'SCHEMA_WKLD';
EXECUTE DBMS_ADVISOR.CREATE_SQLWKLD(:workload_name);
EXECUTE DBMS_ADVISOR.SET_SQLWKLD_PARAMETER (:workload_name, -
VALID_TABLE_LIST, 'SH');
EXECUTE DBMS_ADVISOR.IMPORT_SQLWKLD_SCHEMA ( -
:workload_name, 'NEW', 2, :saved_stmts, :failed_stmts);
import_mode IN VARCHAR2,
priority IN NUMBER := 2,
sumadv_id IN NUMBER,
saved_rows OUT NUMBER,
failed_rows OUT NUMBER);
See PL/SQL Packages and Types Reference for more information regarding the
IMPORT_SQLWKLD_SUMADV procedure and its parameters.
The following example creates a SQL workload from a Oracle9i Summary Advisor
workload. The workload_id of the Oracle9i workload is 777.
1. Create some variables.
VARIABLE workload_name VARCHAR2(255);
VARIABLE saved_stmts NUMBER;
VARIABLE failed_stmts NUMBER;
See PL/SQL Packages and Types Reference for details of all the settings for the
JOURNALING parameter.
The information in the journal is for diagnostic purposes only and subject to change
in future releases. It should not be used within any application.
See PL/SQL Packages and Types Reference for more information regarding the ADD_
SQLWKLD_STATEMENT procedure and its parameters. The following example adds
a single statement to the MYWORKLOAD workload.
VARIABLE sql_text VARCHAR2(400);
EXECUTE :sql_text := 'SELECT AVG(amount_sold) FROM sales';
EXECUTE DBMS_ADVISOR.ADD_SQLWKLD_STATEMENT ( -
'MYWORKLOAD', 'MONTHLY', 'ROLLUP', priority=>1, executions=>10, -
username => 'SH', sql_text => :sql_text);
The second form lets you delete statements that match a specified search condition.
DBMS_ADVISOR.DELETE_SQLWKLD_STATEMENT (workload_name IN VARCHAR2,
search IN VARCHAR2,
deleted OUT NUMBER);
The following example deletes from MYWORKLOAD all statements that satisfy the
condition executions less than 5:
VARIABLE deleted_stmts NUMBER;
EXECUTE DBMS_ADVISOR.DELETE_SQLWKLD_STATEMENT ( -
'MYWORKLOAD', 'executions < 5', :deleted_stmts);
The second forms enables you to update all SQL statements satisfying a given
search condition.
DBMS_ADVISOR.UPDATE_SQLWKLD_STATEMENT (workload_name IN VARCHAR2,
search IN VARCHAR2,
updated OUT NUMBER,
module IN VARCHAR2,
action IN VARCHAR2,
priority IN NUMBER,
username IN VARCHAR2);
The following examples changes the priority to 3 for all statements in MYWORKLOAD
that have executions less than 10. The count of updated statements is returned in
the updated_stmts variable.
VARIABLE updated_stmts NUMBER;
EXECUTE DBMS_ADVISOR.UPDATE_SQLWKLD_STATEMENT ( -
'MYWORKLOAD', 'executions < 10', :updated_stmts, priority => 3);
See PL/SQL Packages and Types Reference for more information regarding the
UPDATE_SQLWKLD_STATEMENT procedure and its parameters.
Maintaining Workloads
There are several other operations that can be performed upon a workload,
including the following:
See PL/SQL Packages and Types Reference for more information regarding the
UPDATE_SQLWKLD_ATTRIBUTES procedure and its parameters.
Resetting Workloads
The RESET_SQLWKLD procedure resets a workload to its initial starting point. This
has the effect of removing all journal and log messages, recalculating volatility
statistics, while the workload data remains untouched. This procedure should be
executed after any workload adjustments such as adding or removing SQL
statements. The following example resets workload MYWORKLOAD.
EXECUTE DBMS_ADVISOR.RESET_SQLWKLD('MYWORKLOAD');
See PL/SQL Packages and Types Reference for more information regarding the RESET_
SQLWKLD procedure and its parameters.
Removing Workloads
When workloads are no longer needed, they can be removed using the procedure
DELETE_SQLWKLD. You can delete all workloads or a specific collection, but a
workload cannot be deleted if it is still linked to a task.
The following procedure is an example of removing a specific workload. It deletes
an existing workload from the repository.
DBMS_ADVISOR.DELETE_SQLWKLD (workload_name IN VARCHAR2);
EXECUTE DBMS_ADVISOR.DELETE_SQLWKLD('MYWORKLOAD');
See PL/SQL Packages and Types Reference for more information regarding the
DELETE_SQLWKLD procedure and its parameters.
Recommendation Options
Before recommendations can be generated, the parameters for the task must first be
defined using the SET_TASK_PARAMETER procedure. If parameters are not defined,
then the defaults are used.
You can set task parameters by using the SET_TASK_PARAMETER procedure. The
syntax is as follows.
DBMS_ADVISOR.SET_TASK_PARAMETER (
task_name IN VARCHAR2
parameter IN VARCHAR2,
value IN VARCHAR2);
There are many task parameters and, to help identify the relevant ones, they have
been grouped into categories in Table 17–2.
Table 17–2 (Cont.) Types of Advisor Task Parameters And Their Uses
Recommendation
Workload Filtering Task Configuration Schema Attributes Options
SQL_LIMIT DEF_MVIEW_TABLESPACE MODE
START_TIME DEF_MVLOG_TABLSPACE REFRESH_MODE
USERNAME_LIST INDEX_NAME_TEMPLATE STORAGE_CHANGE
VALID_TABLE_LIST MVIEW_NAME_TEMPLATE CREATION_COST
WORKLOAD_SCOPE
MODULE_LIMIT
TIME_LIMIT
END_TIME
COMMENTED_FILTER_LIST
In the following example, set the storage change of task MYTASK to 100MB. This
indicates 100MB of additional space for recommendations. A zero value would
indicate that no additional space can be allocated. A negative value indicates that
the advisor must attempt to trim the current space utilization by the specified
amount.
EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER('MYTASK','STORAGE_CHANGE', 100000000);
In the following example, we set the VALID_TABLE_LIST parameter to filter out all
queries that do no consist of tables SH.SALES and SH.CUSTOMERS.
EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( -
'MYTASK', 'VALID_TABLE_LIST', 'SH.SALES, SH.CUSTOMERS');
See PL/SQL Packages and Types Reference for more information regarding the SET_
TASK_PARAMETER procedure and its parameters.
Generating Recommendations
You can generate recommendations by using the EXECUTE_TASK procedure with
your task name. After the procedure finishes, you can check the DBA_ADVISOR_
LOG table for the actual execution status and the number of recommendations and
actions that have been produced. The recommendations can be queried by task
name in {DBA, USER}_ADVISOR_RECOMMENDATIONS and the actions for these
recommendations can be viewed by task in {DBA, USER}_ADVISOR_ACTIONS.
EXECUTE_TASK Procedure
This procedure performs the SQLAccess Advisor analysis or evaluation for the
specified task. Task execution is a synchronous operation, so control will not be
returned to the user until the operation has completed, or a user-interrupt was
detected. Upon return or execution of the task, you can check the DBA_ADVISOR_
LOG table for the actual execution status.
Running EXECUTE_TASK generates recommendations, where a recommendation
comprises one or more actions, such as creating a materialized view log and a
materialized view. The syntax is as follows:
DBMS_ADVISOR.EXECUTE_TASK (task_name IN VARCHAR2);
See PL/SQL Packages and Types Reference for more information regarding the
EXECUTE_TASK procedure and its parameters.
To identify which query benefits from which recommendation, you can use the
views DBA_* and USER_ADVISOR_SQLA_WK_STMTS. The precost and postcost
numbers are in terms of the estimated optimizer cost (shown in EXPLAIN PLAN)
without and with the recommended access structure changes, respectively. To see
recommendations for each query, issue the following statement:
SELECT sql_id, rec_id, precost, postcost,
(precost-postcost)*100/precost AS percent_benefit
FROM USER_ADVISOR_SQLA_WK_STMTS
WHERE TASK_NAME = :task_name AND workload_name = :workload_name;
'ACTIONCOUNT CNT
------------ ----------
Action Count 20
Each action has several attributes that pertain to the properties of the access
structure. The name and tablespace for each access structure when applicable are
placed in attr1 and attr2 respectively. The space occupied by each new access
structure is in num_attr1. All other attributes are different for each action.
Table 17–3 maps SQLAccess Advisor action information to the corresponding
column in DBA_ADVISOR_ACTIONS.
The following PL/SQL procedure can be used to print out some of the attributes of
the recommendations.
CONNECT SH/SH;
CREATE OR REPLACE PROCEDURE show_recm (in_task_name IN VARCHAR2) IS
CURSOR curs IS
SELECT DISTINCT action_id, command, attr1, attr2, attr3, attr4
FROM user_advisor_actions
WHERE task_name = in_task_name
ORDER BY action_id;
v_action number;
v_command VARCHAR2(32);
v_attr1 VARCHAR2(4000);
v_attr2 VARCHAR2(4000);
v_attr3 VARCHAR2(4000);
v_attr4 VARCHAR2(4000);
v_attr5 VARCHAR2(4000);
BEGIN
OPEN curs;
DBMS_OUTPUT.PUT_LINE('=========================================');
DBMS_OUTPUT.PUT_LINE('Task_name = ' || in_task_name);
LOOP
FETCH curs INTO
v_action, v_command, v_attr1, v_attr2, v_attr3, v_attr4 ;
EXIT when curs%NOTFOUND;
DBMS_OUTPUT.PUT_LINE('Action ID: ' || v_action);
DBMS_OUTPUT.PUT_LINE('Command : ' || v_command);
DBMS_OUTPUT.PUT_LINE('Attr1 (name) : ' || SUBSTR(v_attr1,1,30));
See PL/SQL Packages and Types Reference for details regarding Attr5 and Attr6.
See PL/SQL Packages and Types Reference for details of all the settings for the
journaling parameter.
The information in the journal is for diagnostic purposes only and subject to change
in future releases. It should not be used within any application.
Canceling Tasks
The CANCEL_TASK procedure causes a currently executing operation to terminate.
An Advisor operation may take a few seconds to respond to the call. Because all
Advisor task procedures are synchronous, to cancel an operation, you must use a
separate database session.
A cancel command effective restores the task to its condition prior to the start of the
cancelled operation. Therefore, a cancelled task or data object cannot be restarted.
DBMS_ADVISOR.CANCEL_TASK (task_name IN VARCHAR2);
See PL/SQL Packages and Types Reference for more information regarding the
CANCEL_TASK procedure and its parameters.
Marking Recommendations
By default, all SQLAccess Advisor recommendations are ready to be implemented,
however, the user can choose to skip or exclude selected recommendations by using
the MARK_RECOMMENDATION procedure. MARK_RECOMMENDATION allows the user
to annotate a recommendation with a REJECT or IGNORE setting, which will cause
the GET_TASK_SCRIPT to skip it when producing the implementation procedure.
DBMS_ADVISOR.MARK_RECOMMENDATION (
task_name IN VARCHAR2
id IN NUMBER,
action IN VARCHAR2);
See PL/SQL Packages and Types Reference for more information regarding the MARK_
RECOMMENDATIONS procedure and its parameters.
Modifying Recommendations
Using the UPDATE_REC_ATTRIBUTES procedure, the SQLAccess Advisor names
and assigns ownership to new objects such as indexes and materialized views
during the analysis operation. However, it does not necessarily choose appropriate
names, so you may manually set the owner, name, and tablespace values for new
objects. For recommendations referencing existing database objects, owner and
name values cannot be changed. The syntax is as follows:
DBMS_ADVISOR.UPDATE_REC_ATTRIBUTES (
task_name IN VARCHAR2
rec_id IN NUMBER,
action_id IN NUMBER,
attribute_name IN VARCHAR2,
value IN VARCHAR2);
See PL/SQL Packages and Types Reference for more information regarding the
UPDATE_REC_ATTRIBUTES procedure and its parameters.
To save the script to a file, a directory path must be supplied so that the procedure
CREATE_FILE knows where to store the script. In addition, read and write
privileges must be granted on this directory. The following example shows how to
save an advisor script CLOB to a file:
-- create a directory and grant permissions to read/write to it
CONNECT SH/SH;
The following is a fragment of a script generated by this procedure. The script also
includes PL/SQL calls to gather stats on the recommended access structures and
marks the recommendations as IMPLEMENTED at the end.
Rem Access Advisor V10.0.0.0.0 - Beta
Rem
Rem Username: SH
Rem Task: MYTASK
Rem Execution date: 15/04/2003 11:35
Rem
set feedback 1
set linesize 80
set trimspool on
set tab off
set pagesize 60
whenever sqlerror CONTINUE
DBMS_ADVISOR.MARK_RECOMMENDATION('"MYTASK"',2,'IMPLEMENTED');
DBMS_ADVISOR.MARK_RECOMMENDATION('"MYTASK"',3,'IMPLEMENTED');
DBMS_ADVISOR.MARK_RECOMMENDATION('"MYTASK"',4,'IMPLEMENTED');
END;
/
See PL/SQL Packages and Types Reference for more information regarding the RESET_
TASK procedure and its parameters.
The following example shows how to quick tune a single SQL statement:
VARIABLE task_name VARCHAR2(255);
See PL/SQL Packages and Types Reference for more information regarding the QUICK_
TUNE procedure and its parameters.
Managing Tasks
Every time recommendations are generated, tasks are created and, unless some
maintenance is performed on these tasks, they will grow over time and will occupy
storage space. There may be tasks that you want to keep and prevent accidental
deletion. Therefore, there are several management operations that can be performed
on tasks:
■ Updating Task Attributes
■ Deleting Tasks
■ Setting DAYS_TO_EXPIRE
See PL/SQL Packages and Types Reference for more information regarding the
UPDATE_TASK_ATTRIBUTES procedure and its parameters.
Deleting Tasks
The DELETE_TASK procedure deletes existing Advisor tasks from the repository.
The syntax is as follows:
DBMS_ADVISOR.DELETE_TASK (task_name IN VARCHAR2);
See PL/SQL Packages and Types Reference for more information regarding the
DELETE_TASK procedure and its parameters.
Setting DAYS_TO_EXPIRE
When a task or workload object is created, the parameter DAYS_TO_EXPIRE is set
to 30. The value indicates the number of days until the task or object will
automatically be deleted by the system. If you wish to save a task or workload
indefinitely, the DAYS_TO_EXPIRE parameter should be set to ADVISOR_
UNLIMITED.
-- order by
INSERT INTO user_workload (username, module, action, priority, sql_text)
VALUES ('SH', 'Example1', 'Action', 2,
'SELECT c.country_id, c.cust_city, c.cust_last_name
FROM customers c WHERE c.country_id IN (52790, 52789)
ORDER BY c.country_id, c.cust_city, c.cust_last_name')
/
COMMIT;
CONNECT SH/SH;
set serveroutput on;
-- Order by
SELECT c.country_id, c.cust_city, c.cust_last_name
FROM customers c WHERE c.country_id IN ('52790', '52789')
ORDER BY c.country_id, c.cust_city, c.cust_last_name;
CONNECT sh/sh
VARIABLE task_id NUMBER;
VARIABLE task_name VARCHAR2(255);
VARIABLE workload_name VARCHAR2(255);
VARIABLE saved_stmts NUMBER;
VARIABLE failed_stmts NUMBER;
PRINT :updated_stmts;
PRINT :saved_stmts;
PRINT :failed_stmts;
DBMS_ADVISOR.TUNE_MVIEW Procedure
This section discusses the following information:
■ TUNE_MVIEW Syntax and Operations
■ Accessing TUNE_MVIEW Output Results
■ USER_TUNE_MVIEW and DBA_TUNE_MVIEW Views
The TUNE_MVIEW procedure takes two input parameters: task_name and mv_
create_stmt. task_name is a user-provided task identifier used to access the
output results. mv_create_stmt is a complete CREATE MATERIALIZED VIEW
statement that is to be tuned. If the input CREATE MATERIALIZED VIEW statement
does not have the clauses of REFRESH FAST or ENABLE QUERY REWRITE, or both,
TUNE_MVIEW will use the default clauses REFRESH FORCE and DISABLE QUERY
REWRITE to tune the statement to be fast refreshable if possible or only complete
refreshable otherwise.
The TUNE_MVIEW procedure handles a broad range of CREATE MATERIALIZED
VIEW statements that can have arbitrary defining queries in them. The defining
query could be a simple SELECT statement or a complex query with set operators
or inline views. When the defining query of the materialized view contains the
clause REFRESH FAST, TUNE_MVIEW analyzes the query and checks to see if it is
fast refreshable. If it is already fast refreshable, the procedure will return a message
saying "the materialized view is already optimal and cannot be further tuned".
Otherwise, the TUNE_MVIEW procedure will start the tuning work on the given
statement.
The TUNE_MVIEW procedure can generate the output statements that correct the
defining query by adding extra columns such as required aggregate columns or fix
the materialized view logs to achieve the FAST REFRESH goal. In the case of a
complex defining query, the TUNE_MVIEW procedure decomposes the query and
generates two or more fast refreshable materialized views or will restate the
materialized view in a way to fulfill fast refresh requirements as much as possible.
The TUNE_MVIEW procedure supports defining queries with the following complex
query constructs:
■ Set operators (UNION, UNION ALL, MINUS, and INTERSECT)
■ COUNT DISTINCT
■ SELECT DISTINCT
■ Inline views
When the ENABLE QUERY REWRITE clause is specified, TUNE_MVIEW will also fix
the statement using a process similar to REFRESH FAST, that will redefine the
materialized view so that as many of the advanced forms of query rewrite are
possible.
The TUNE_MVIEW procedure generates two sets of output results as executable
statements. One set of the output (IMPLEMENTATION) is for implementing
materialized views and required components such as materialized view logs or
rewrite equivalences to achieve fast refreshability and query rewritablity as much as
possible. The other set of the output (UNDO) is for dropping the materialized views
and the rewrite equivalences in case you decide they are not required.
The output statements for the IMPLEMENTATION process include:
■ CREATE MATERIALIZED VIEW LOG statements: creates any missing
materialized view logs required for fast refresh.
■ ALTER MATERIALIZED VIEW LOG FORCE statements: fixes any materialized
view log related requirements such as missing filter columns, sequence, and so
on, required for fast refresh.
■ One or more CREATE MATERIALIZED VIEW statements: In case of one output
statement, the original defining query is directly restated and transformed.
Simple query transformation could be just adding required columns. For
example, add rowid column for materialized join view and add aggregate
column for materialized aggregate view. In the case of decomposition, multiple
CREATE MATERIALIZED VIEW statements are generated and form a nested
materialized view hierarchy in which one or more submaterialized views are
referenced by a new top-level materialized view modified from the original
statement. This is to achieve fast refresh and query rewrite as much as possible.
Submaterialized views are often fast refreshable.
■ BUILD_SAFE_REWRITE_EQUIVALENCE statement: enables the rewrite of
top-level materialized view using submaterialized views. It is required to
enable query rewrite when a composition occurs.
Note that the decomposition result implies no sharing of submaterialized views.
That is, in the case of decomposition, the TUNE_MVIEW output will always contain
new submaterialized view and it will not reference existing materialized views.
The output statements for the UNDO process include:
■ DROP MATERIALIZED VIEW statements to reverse the materialized view
creations (including submaterialized views) in the IMPLEMENTATION process.
■ DROP_REWRITE_EQUIVALENCE statement to remove the rewrite equivalence
relationship built in the IMPLEMENTATION process if needed.
Note that the UNDO process does not include statement to drop materialized view
logs. This is because materialized view logs can be shared by many different
materialized views, some of which may reside on remote Oracle instances.
Now generate both the implementation and undo scripts and place them in
/tmp/script_dir/mv_create.sql and /tmp/script_dir/mv_undo.sql,
respectively.
EXECUTE DBMS_ADVISOR.CREATE_FILE(DBMS_ADVISOR.GET_TASK_SCRIPT(:task_name),-
'TUNE_RESULTS', 'mv_create.sql');
EXECUTE DBMS_ADVISOR.CREATE_FILE(DBMS_ADVISOR.GET_TASK_SCRIPT(:task_name, -
'UNDO'), 'TUNE_RESULTS', 'mv_undo.sql');
COUNT(*) M3
FROM SH.SALES, SH.CUSTOMERS
WHERE SH.CUSTOMERS.CUST_ID = SH.SALES.CUST_ID
GROUP BY SH.SALES.PROD_ID, SH.CUSTOMERS.CUST_ID;
The original defining query of cust_mv has been modified by adding aggregate
columns in order to be fast refreshable.
EXECUTE DBMS_ADVISOR.CREATE_FILE(DBMS_ADVISOR.GET_TASK_SCRIPT(:task_cust_mv), -
'TUNE_RESULTS', 'mv_create.sql');
The materialized view defining query contains a UNION set operator and does not
support general query rewrite. In order to support general query rewrite, the
MATERIALIZED VIEW defining query will be decomposed.
The projected output for the IMPLEMENTATION statement will be created along
with materialized view log statements and two submaterialized views as follows:
CREATE MATERIALIZED VIEW LOG ON "SH"."SALES"
WITH ROWID, SEQUENCE("CUST_ID")
INCLUDING NEW VALUES;
BEGIN
DBMS_ADVANCED_REWRITE.BUILD_SAFE_REWRITE_EQUIVALENCE ('SH.CUST_MV$RWEQ',
'SELECT s.prod_id, s.cust_id, COUNT(*) cnt,
SUM(s.amount_sold) sum_amount
FROM sales s, customers cs, countries cn
WHERE s.cust_id = cs.cust_id AND cs.country_id = cn.country_id
AND cn.country_name IN (''USA'',''Canada'')
GROUP BY s.prod_id, s.cust_id
UNION
SELECT s.prod_id, s.cust_id, COUNT(*) cnt,
SUM(s.amount_sold) sum_amount
FROM sales s, customers cs
WHERE s.cust_id = cs.cust_id AND s.cust_id IN (1005,1010,1012)
GROUP BY s.prod_id, s.cust_id',
'(SELECT "CUST_MV$SUB2"."C3" "PROD_ID","CUST_MV$SUB2"."C2" "CUST_ID",
SUM("CUST_MV$SUB2"."M3") "CNT",
SUM("CUST_MV$SUB2"."M1") "SUM_AMOUNT"
FROM "SH"."CUST_MV$SUB2" "CUST_MV$SUB2"
GROUP BY "CUST_MV$SUB2"."C3","CUST_MV$SUB2"."C2")
UNION
(SELECT "CUST_MV$SUB1"."C2" "PROD_ID","CUST_MV$SUB1"."C1" "CUST_ID",
"CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT"
FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1")',-1553577441)
END;
/;
The original defining query of cust_mv has been decomposed into two
submaterialized views seen as cust_mv$SUB1 and cust_mv$SUB2. One
additional column count(amount_sold) has been added in cust_mv$SUB1 to
make that materialized view fast refreshable.
The original defining query of cust_mv has been modified to query the two
submaterialized views instead where both submaterialized views are fast
refreshable and support general query rewrite.
The required materialized view logs are added to enable fast refresh of the
submaterialized views. It is noted that for each detail table, two materialized view
log statements are generated: one is CREATE MATERIALIZED VIEW statement and
the other is ALTER MATERIALIZED VIEW FORCE statement. This is to ensure the
CREATE script can be run multiple times.
The BUILD_SAFE_REWRITE_EQUIVALENCE statement is to connect the old
defining query to the defining query of the new top-level materialized view. It is to
ensure that query rewrite will make use of the new top-level materialized view to
answer the query.
The materialized view defining query contains a UNION set-operator so that the
materialized view itself is not fast-refreshable. However, two subselect queries in
the materialized view defining query can be combined as one single query.
The projected output for CREATE statement will be created with an optimized
submaterialized view combining the two subselect queries and the submaterialized
view is referenced by a new top-level materialized view as follows:
CREATE MATERIALIZED VIEW LOG ON "SH"."SALES"
WITH ROWID, SEQUENCE ("PROD_ID","CUST_ID","AMOUNT_SOLD")
INCLUDING NEW VALUES
ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."SALES"
ADD ROWID, SEQUENCE ("PROD_ID","CUST_ID","AMOUNT_SOLD")
INCLUDING NEW VALUES
CREATE MATERIALIZED VIEW LOG ON "SH"."CUSTOMERS"
WITH ROWID, SEQUENCE ("CUST_ID") INCLUDING NEW VALUES
ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."CUSTOMERS"
ADD ROWID, SEQUENCE ("CUST_ID") INCLUDING NEW VALUES
CREATE MATERIALIZED VIEW SH.CUST_MV$SUB1
REFRESH FAST WITH ROWID
ENABLE QUERY REWRITE AS
SELECT SH.SALES.CUST_ID C1, SH.SALES.PROD_ID C2,
SUM("SH"."SALES"."AMOUNT_SOLD") M1,
COUNT("SH"."SALES"."AMOUNT_SOLD")M2, COUNT(*) M3
FROM SH.CUSTOMERS, SH.SALES
WHERE SH.SALES.CUST_ID = SH.CUSTOMERS.CUST_ID AND
(SH.SALES.CUST_ID IN (2005, 1020, 1012, 1010, 1005))
GROUP BY SH.SALES.CUST_ID, SH.SALES.PROD_ID
CREATE MATERIALIZED VIEW SH.CUST_MV
REFRESH FORCE WITH ROWID ENABLE QUERY REWRITE AS
(SELECT "CUST_MV$SUB1"."C2" "PROD_ID","CUST_MV$SUB1"."C1" "CUST_ID",
"CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT"
FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1"
WHERE "CUST_MV$SUB1"."C1"=2005 OR "CUST_MV$SUB1"."C1"=1020)
UNION
(SELECT "CUST_MV$SUB1"."C2" "PROD_ID","CUST_MV$SUB1"."C1" "CUST_ID",
"CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT"
DBMS_ADVANCED_REWRITE.BUILD_SAFE_REWRITE_EQUIVALENCE ('SH.CUST_MV$RWEQ',
'SELECT s.prod_id, s.cust_id, COUNT(*) cnt,
SUM(s.amount_sold) sum_amount
FROM sales s, customers cs
WHERE s.cust_id = cs.cust_id AND s.cust_id in (2005,1020)
GROUP BY s.prod_id, s.cust_id UNION
SELECT s.prod_id, s.cust_id, COUNT(*) cnt,
SUM(s.amount_sold) sum_amount
FROM sales s, customers cs
WHERE s.cust_id = cs.cust_id AND s.cust_id IN (1005,1010,1012)
GROUP BY s.prod_id, s.cust_id',
'(SELECT "CUST_MV$SUB1"."C2" "PROD_ID",
"CUST_MV$SUB1"."C1" "CUST_ID",
"CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT"
FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1"
WHERE "CUST_MV$SUB1"."C1"=2005OR "CUST_MV$SUB1"."C1"=1020)
UNION
(SELECT "CUST_MV$SUB1"."C2" "PROD_ID",
"CUST_MV$SUB1"."C1" "CUST_ID",
"CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT"
FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1"
WHERE "CUST_MV$SUB1"."C1"=1012 OR "CUST_MV$SUB1"."C1"=1010 OR
"CUST_MV$SUB1"."C1"=1005)',
1811223110)
The original defining query of cust_mv has been optimized by combining the
predicate of the two subselect queries in the sub-materialized view CUST_MV$SUB1.
The required materialized view logs are also added to enable fast refresh of the
submaterialized views.
This section deals with ways to improve your data warehouse's performance, and
contains the following chapters:
■ Chapter 18, "Query Rewrite"
■ Chapter 19, "Schema Modeling Techniques"
■ Chapter 20, "SQL for Aggregation in Data Warehouses"
■ Chapter 21, "SQL for Analysis and Reporting"
■ Chapter 22, "SQL for Modeling"
■ Chapter 23, "OLAP and Data Mining"
■ Chapter 24, "Using Parallel Execution"
18
Query Rewrite
Cost-Based Rewrite
Query rewrite is available with cost-based optimization. Oracle Database optimizes
the input query with and without rewrite and selects the least costly alternative.
The optimizer rewrites a query by rewriting one or more query blocks, one at a
time.
If query rewrite has a choice between several materialized views to rewrite a query
block, it will select the ones which can result in reading in the least amount of data.
After a materialized view has been selected for a rewrite, the optimizer then tests
whether the rewritten query can be rewritten further with other materialized views.
This process continues until no further rewrites are possible. Then the rewritten
query is optimized and the original query is optimized. The optimizer compares
these two optimizations and selects the least costly alternative.
Because optimization is based on cost, it is important to collect statistics both on
tables involved in the query and on the tables representing materialized views.
Statistics are fundamental measures, such as the number of rows in a table, that are
used to calculate the cost of a rewritten query. They are created by using the DBMS_
STATS package.
Queries that contain inline or named views are also candidates for query rewrite.
When a query contains a named view, the view name is used to do the matching
between a materialized view and the query. When a query contains an inline view,
the inline view can be merged into the query before matching between a
materialized view and the query occurs.
In addition, if the inline view's text definition exactly matches with that of an inline
view present in any eligible materialized view, general rewrite may be possible.
This is because, whenever a materialized view contains exactly identical inline view
text to the one present in a query, query rewrite treats such an inline view as a
named view or a table.
Figure 18–1 presents a graphical view of the cost-based approach used during the
rewrite process.
User's SQL
Oracle
Generate Rewrite
plan
Generate
plan
Choose
(based on cost)
Execute
■ Either all or part of the results requested by the query must be obtainable from
the precomputed result stored in the materialized view or views.
To determine this, the optimizer may depend on some of the data relationships
declared by the user using constraints and dimensions. Such data relationships
include hierarchies, referential integrity, and uniqueness of key data, and so on.
The NOREWRITE hint disables query rewrite in a SQL statement, overriding the
QUERY_REWRITE_ENABLED parameter, and the REWRITE hint (when used with
mv_name) restricts the eligible materialized views to those named in the hint.
You can use TUNE_MVIEW to optimize a CREATE MATERIALIZED VIEW statement to
enable general QUERY REWRITE. This procedure is described in "Tuning
Materialized Views for Fast Refresh and Query Rewrite" on page 17-47.
You can set the level of query rewrite for a session, thus allowing different users to
work at different integrity levels. The possible statements are:
ALTER SESSION SET QUERY_REWRITE_INTEGRITY = STALE_TOLERATED;
ALTER SESSION SET QUERY_REWRITE_INTEGRITY = TRUSTED;
ALTER SESSION SET QUERY_REWRITE_INTEGRITY = ENFORCED;
pending following bulk load or DML operations to one or more detail tables of
a materialized view. At some data warehouse sites, this situation is desirable
because it is not uncommon for some materialized views to be refreshed at
certain time intervals.
■ The relationships implied by the dimension objects are invalid. For example,
values at a certain level in a hierarchy do not roll up to exactly one parent value.
■ The values stored in a prebuilt materialized view table might be incorrect.
■ A wrong answer can occur because of bad data relationships defined by
unenforced table or view constraints.
Note that the scope of a rewrite hint is a query block. If a SQL statement consists of
several query blocks (SELECT clauses), you need to specify a rewrite hint on each
query block to control the rewrite for the entire statement.
Using the REWRITE_OR_ERROR hint in a query causes the following error if the
query failed to rewrite:
ORA-30393: a query block in the statement did not rewrite
For example, the following query issues the ORA-30393 error when there are no
suitable materialized views for query rewrite to use:
SELECT /*+ REWRITE_OR_ERROR */ p.prod_subcategory, SUM(s.amount_sold)
FROM sales s, products p WHERE s.prod_id = p.prod_id
GROUP BY p.prod_subcategory;
The following illustrates a statistics collection for all newly created objects without
statistics:
EXECUTE DBMS_STATS.GATHER_SCHEMA_STATS ( 'SH', -
is, the entire SELECT expression), ignoring the white space during text comparison.
Given the following query:
SELECT p.prod_subcategory, t.calendar_month_desc, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold,
COUNT(s.amount_sold) AS count_amount_sold
FROM sales s, products p, times t, customers c
WHERE s.time_id=t.time_id
AND s.prod_id=p.prod_id
AND s.cust_id=c.cust_id
GROUP BY p.prod_subcategory, t.calendar_month_desc, c.cust_city;
When full text match fails, the optimizer then attempts a partial text match. In this
method, the text starting from the FROM clause of a query is compared against the
text starting with the FROM clause of a materialized view definition. Therefore, the
following query can be rewritten:
SELECT p.prod_subcategory, t.calendar_month_desc, c.cust_city,
AVG(s.amount_sold)
FROM sales s, products p, times t, customers c
WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id
AND s.cust_id=c.cust_id
GROUP BY p.prod_subcategory, t.calendar_month_desc, c.cust_city;
Note that, under the partial text match rewrite method, the average of sales
aggregate required by the query is computed using the sum of sales and count of
sales aggregates stored in the materialized view.
When neither text match succeeds, the optimizer uses a general query rewrite
method.
Table 18–1 (Cont.) Dimension and Constraint Requirements for Query Rewrite
Primary Key/Foreign Key/Not Null
Rewrite Checks Dimensions Constraints
Join Back Required OR Required
Rollup Using a Dimension Required Not Required
Aggregate Rollup Not Required Not Required
Compute Aggregates Not Required Not Required
Join Back
If some column data requested by a query cannot be obtained from a materialized
view, the optimizer further determines if it can be obtained based on a data
relationship called a functional dependency. When the data in a column can
determine data in another column, such a relationship is called a functional
dependency or functional determinance. For example, if a table contains a primary
key column called prod_id and another column called prod_name, then, given a
prod_id value, it is possible to look up the corresponding prod_name. The
opposite is not true, which means a prod_name value need not relate to a unique
prod_id.
When the column data required by a query is not available from a materialized
view, such column data can still be obtained by joining the materialized view back
to the table that contains required column data provided the materialized view
contains a key that functionally determines the required column data. For example,
consider the following query:
SELECT p.prod_category, t.week_ending_day, SUM(s.amount_sold)
FROM sales s, products p, times t
WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id AND p.prod_category='CD'
GROUP BY p.prod_category, t.week_ending_day;
Here the products table is called a joinback table because it was originally joined
in the materialized view but joined again in the rewritten query.
You can declare functional dependency in two ways:
■ Using the primary key constraint (as shown in the previous example)
■ Using the DETERMINES clause of a dimension
The DETERMINES clause of a dimension definition might be the only way you could
declare functional dependency when the column that determines another column
cannot be a primary key. For example, the products table is a denormalized
dimension table that has columns prod_id, prod_name, and prod_
subcategory that functionally determines prod_subcat_desc and prod_
category that determines prod_cat_desc.
The first functional dependency can be established by declaring prod_id as the
primary key, but not the second functional dependency because the prod_
subcategory column contains duplicate values. In this situation, you can use the
DETERMINES clause of a dimension to declare the second functional dependency.
The following dimension definition illustrates how functional dependencies are
declared:
CREATE DIMENSION products_dim
LEVEL product IS (products.prod_id)
LEVEL subcategory IS (products.prod_subcategory)
LEVEL category IS (products.prod_category)
HIERARCHY prod_rollup (
product CHILD OF
subcategory CHILD OF
category
)
ATTRIBUTE product DETERMINES products.prod_name
ATTRIBUTE product DETERMINES products.prod_desc
ATTRIBUTE subcategory DETERMINES products.prod_subcategory_desc
ATTRIBUTE category DETERMINES products.prod_category_desc;
The hierarchy prod_rollup declares hierarchical relationships that are also 1:n
functional dependencies. The 1:1 functional dependencies are declared using the
DETERMINES clause, as seen when prod_subcategory functionally determines
prod_subcat_desc.
Consider the following query:
SELECT p.prod_subcategory_desc, t.week_ending_day, SUM(s.amount_sold)
FROM sales s, products p, times t
Compute Aggregates
Query rewrite can also occur when the optimizer determines if the aggregates
requested by a query can be derived or computed from one or more aggregates
stored in a materialized view. For example, if a query requests AVG(X) and a
materialized view contains SUM(X) and COUNT(X), then AVG(X) can be computed
as SUM(X)/COUNT(X).
In addition, if it is determined that the rollup of aggregates stored in a materialized
view is required, then, if it is possible, query rewrite also rolls up each aggregate
requested by the query using aggregates in the materialized view.
For example, SUM(sales) at the city level can be rolled up to SUM(sales) at the
state level by summing all SUM(sales) aggregates in a group with the same state
value. However, AVG(sales) cannot be rolled up to a coarser level unless
COUNT(sales) is also available in the materialized view. Similarly,
VARIANCE(sales) or STDDEV(sales) cannot be rolled up unless
COUNT(sales) and SUM(sales) are also available in the materialized view. For
example, consider the following query:
ALTER TABLE times MODIFY CONSTRAINT time_pk RELY;
ALTER TABLE customers MODIFY CONSTRAINT customers_pk RELY;
ALTER TABLE sales MODIFY CONSTRAINT sales_time_pk RELY;
ALTER TABLE sales MODIFY CONSTRAINT sales_customer_fk RELY;
SELECT p.prod_subcategory, AVG(s.amount_sold) AS avg_sales
FROM sales s, products p WHERE s.prod_id = p.prod_id
GROUP BY p.prod_subcategory;
the materialized view will have to be rolled up. The optimizer rewrites the query as
the following:
SELECT mv.prod_subcategory, SUM(mv.sum_amount_sold)/COUNT(mv.count_amount_sold)
AS avg_sales
FROM sum_sales_pscat_month_city_mv mv
GROUP BY mv.prod_subcategory;
Query Rewrite Definitions Before describing what is possible when query rewrite
works with filtered data, the following definitions are useful:
■ join relop
Is one of the following (=, <, <=, >, >=)
■ selection relop
Is one of the following (=, <, <=, >, >=, !=, [NOT] BETWEEN | IN|
LIKE |NULL)
■ join predicate
Is of the form (column1 join relop column2), where columns are from different
tables within the same FROM clause in the current query block. So, for example,
an outer reference is not possible.
■ selection predicate
Is of the form LHS-expression relop RHS-expression, where LHS means left-hand
side and RHS means right-hand side. All non-join predicates are selection
predicates. The left-hand side usually contains a column and the right-hand
side contains the values. For example, color='red' means the left-hand side
is color and the right-hand side is 'red' and the relational operator is (=).
■ LHS-constrained
When comparing a selection from the query with a selection from the
materialized view, if the left-hand side of both selections match, the selections
are said to be LHS-constrained or just constrained for short.
■ RHS-constrained
When comparing a selection from the query with a selection from the
materialized view, if the right-hand side of both selections match, the selections
are said to be RHS-constrained or just constrained. Note that before comparing
the selections, the LHS/RHS-expression is converted to a canonical form and
then the comparison is done. This means that expressions such as column1 + 5
and 5 + column1 will match and be constrained.
WHERE Clause Guidelines Although query rewrite on filtered data does not restrict
the general form of the WHERE clause, there is an optimal pattern and, normally,
most queries fall into this pattern as follows:
(join predicate AND join predicate AND ....) AND
(selection predicate AND|OR selection predicate .... )
If the WHERE clause has an OR at the top, then the optimizer first checks for common
predicates under the OR. If found, the common predicates are factored out from
under the OR, then joined with an AND back to the OR. This helps to put the WHERE
clause into the optimal pattern. This is done only if OR occurs at the top of the
WHERE clause. For example, if the WHERE clause is the following:
(sales.prod_id = prod.prod_id AND prod.prod_name = 'Kids Polo Shirt')
OR (sales.prod_id = prod.prod_id AND prod.prod_name = 'Kids Shorts')
Thus putting the WHERE clause into the most optimal pattern.
When comparing a selection from the query with a selection from the materialized
view, the left-hand side of both selections are compared and if they match they are
said to be LHS-constrained or constrained for short.
If the selections are constrained, then the right-hand side values are checked for
containment. That is, the RHS values of the query selection must be contained by
right-hand side values of the materialized view selection.
Examples of Query Rewrite Selection Here are a number of examples showing how
query rewrite can still occur when the data is being filtered.
Then, the selections are constrained on prod_id and the right-hand side value of
the query 102 is within the range of the materialized view, so query rewrite is
possible.
Then, the selections are constrained on prod_id and the query range is within the
materialized view range. In this example, notice that both query selections are
constrained by the same materialized view selection.
If the left-hand side and the right-hand side are constrained and the selection_relop is
the same, then the selection can usually be dropped from the rewritten query.
Otherwise, the selection must be kept to filter out extra data from the materialized
view.
If query rewrite can drop the selection from the rewritten query, all columns from
the selection may not have to be in the materialized view so more rewrites can be
done. This ensures that the materialized view data is not more restrictive that the
query.
the materialized view selects prod_name or selects a column that can be joined
back to the detail table to get prod_name, then the query rewrite is possible.
Then, the materialized view selection with prod_name is not constrained. The
materialized view is more restrictive that the query because it only contains the
product Shorts, therefore, query rewrite will not occur.
Then, the materialized view IN-lists are constrained by the columns in the query
multi-column IN-list. Furthermore, the right-hand side values of the query selection
are contained by the materialized view so that rewrite will occur.
Then, the materialized view IN-list columns are fully constrained by the columns in
the query selections. Furthermore, the right-hand side values of the query selection
are contained by the materialized view. So rewrite succeeds.
Then, the query has a single disjunct (group of selections separated by AND) and the
materialized view has two disjuncts separated by OR. The query disjunct is
contained by the second materialized view disjunct so selection compatibility
succeeds. It is clear that the materialized view contains more data than needed by
the query so the query can be rewritten.
Because the predicate s.cust_id = 10 selects the same data in the query and in
the materialized view, it is dropped from the rewritten query. This means the
rewritten query is the following:
SELECT mv.calendar_month_desc, mv.dollars FROM cal_month_sales_id_mv mv;
You can also use expressions in selection predicates. This process resembles the
following:
expression relational operator constant
Where expression can be any arbitrary arithmetic expression allowed by the Oracle
Database. The expression in the materialized view and the query must match.
Oracle attempts to discern expressions that are logically equivalent, such as A+B
and B+A, and will always recognize identical expressions as being equivalent.
You can also use queries with an expression on both sides of the operator or
user-defined functions as operators. Query rewrite occurs when the complex
predicate in the materialized view and the query are logically equivalent. This
means that, unlike exact text match, terms could be in a different order and rewrite
can still occur, as long as the expressions are equivalent.
In addition, selection predicates can be joined with an AND operator in a query and
the query can still be rewritten to use a materialized view as long as every
restriction on the data selected by the query is matched by a restriction in the
definition of the materialized view. Again, this does not mean an exact text match,
but that the restrictions on the data selected must be a logical match. Also, the query
may be more restrictive in its selection of data and still be eligible, but it can never
be less restrictive than the definition of the materialized view and still be eligible for
rewrite. For example, given the preceding materialized view definition, a query
such as the following can be rewritten:
SELECT p.promo_name, SUM(s.amount_sold)
FROM promotions p, sales s
WHERE s.promo_id = p.promo_id AND promo_name = 'coupon'
GROUP BY promo_name
HAVING SUM(s.amount_sold) > 1000;
In this case, the query is more restrictive than the definition of the materialized
view, so rewrite can occur. However, if the query had selected promo_category,
then it could not have been rewritten against the materialized view, because the
materialized view definition does not contain that column.
For another example, if the definition of a materialized view restricts a city name
column to Boston, then a query that selects Seattle as a value for this column can
never be rewritten with that materialized view, but a query that restricts city name
to Boston and restricts a column value that is not restricted in the materialized view
could be rewritten to use the materialized view.
All the rules noted previously also apply when predicates are combined with an OR
operator. The simple predicates, or simple predicates connected by OR operators,
are considered separately. Each predicate in the query must be contained in the
materialized view if rewrite is to occur. For example, the query could have a
restriction such as city='Boston' OR city ='Seattle' and to be eligible for
rewrite, the materialized view that the query might be rewritten against must have
the same restriction. In fact, the materialized view could have additional
restrictions, such as city='Boston' OR city='Seattle' OR
city='Cleveland' and rewrite might still be possible.
Note, however, that the reverse is not true. If the query had the restriction city =
'Boston' OR city='Seattle' OR city='Cleveland' and the materialized
view only had the restriction city='Boston' OR city='Seattle', then rewrite
would not be possible because, with a single materialized view, the query seeks
more data than is contained in the restricted subset of data stored in the
materialized view.
Materialized
view join
graph
customers products times
sales
Common MV
subgraph delta
Common Joins The common join pairs between the two must be of the same type, or
the join in the query must be derivable from the join in the materialized view. For
example, if a materialized view contains an outer join of table A with table B, and a
query contains an inner join of table A with table B, the result of the inner join can
be derived by filtering the antijoin rows from the result of the outer join. For
example, consider the following query:
SELECT p.prod_name, t.week_ending_day, SUM(amount_sold)
FROM sales s, products p, times t
WHERE s.time_id=t.time_id AND s.prod_id = p.prod_id
AND t. week_ending_day BETWEEN TO_DATE('01-AUG-1999', 'DD-MON-YYYY')
AND TO_DATE('10-AUG-1999', 'DD-MON-YYYY')
GROUP BY prod_name, week_ending_day;
The common joins between this query and the materialized view join_sales_
time_product_mv are:
s.time_id = t.time_id AND s.prod_id = p.prod_id
In general, if you use an outer join in a materialized view containing only joins, you
should put in the materialized view either the primary key or the rowid on the right
side of the outer join. For example, in the previous example, join_sales_time_
product_oj_mv, there is a primary key on both sales and products.
Another example of when a materialized view containing only joins is used is the
case of a semijoin rewrites. That is, a query contains either an EXISTS or an IN
subquery with a single table. Consider the following query, which reports the
products that had sales greater than $1,000:
SELECT DISTINCT prod_name
FROM products p
WHERE EXISTS (SELECT * FROM sales s
WHERE p.prod_id=s.prod_id AND s.amount_sold > 1000);
Rewrites with semi-joins are restricted to materialized views with joins only and are
not possible for materialized views with joins and aggregates.
Query Delta Joins A query delta join is a join that appears in the query but not in the
materialized view. Any number and type of delta joins in a query are allowed and
they are simply retained when the query is rewritten with a materialized view. In
order for the retained join to work, the materialized view must contain the joining
key. Upon rewrite, the materialized view is joined to the appropriate tables in the
query delta. For example, consider the following query:
SELECT p.prod_name, t.week_ending_day, c.cust_city, SUM(s.amount_sold)
FROM sales s, products p, times t, customers c
WHERE s.time_id=t.time_id AND s.prod_id = p.prod_id
AND s.cust_id = c.cust_id
GROUP BY prod_name, week_ending_day, cust_city;
Materialized View Delta Joins A materialized view delta join is a join that appears in
the materialized view but not the query. All delta joins in a materialized view are
required to be lossless with respect to the result of common joins. A lossless join
guarantees that the result of common joins is not restricted. A lossless join is one
where, if two tables called A and B are joined together, rows in table A will always
match with rows in table B and no data will be lost, hence the term lossless join. For
example, every row with the foreign key matches a row with a primary key
provided no nulls are allowed in the foreign key. Therefore, to guarantee a lossless
join, it is necessary to have FOREIGN KEY, PRIMARY KEY, and NOT NULL constraints
on appropriate join keys. Alternatively, if the join between tables A and B is an outer
join (A being the outer table), it is lossless as it preserves all rows of table A.
All delta joins in a materialized view are required to be non-duplicating with
respect to the result of common joins. A non-duplicating join guarantees that the
result of common joins is not duplicated. For example, a non-duplicating join is one
where, if table A and table B are joined together, rows in table A will match with at
most one row in table B and no duplication occurs. To guarantee a non-duplicating
join, the key in table B must be constrained to unique values by using a primary key
or unique constraint.
Consider the following query that joins sales and times:
SELECT t.week_ending_day, SUM(s.amount_sold)
FROM sales s, times t
WHERE s.time_id = t.time_id AND t.week_ending_day BETWEEN TO_DATE
('01-AUG-1999', 'DD-MON-YYYY') AND TO_DATE('10-AUG-1999', 'DD-MON-YYYY')
GROUP BY week_ending_day;
The query can also be rewritten with the materialized view join_sales_time_
product_mv_oj where foreign key constraints are not needed. This view contains
an outer join (s.prod_id=p.prod_id(+)) between sales and products. This
makes the join lossless. If p.prod_id is a primary key, then the non-duplicating
condition is satisfied as well and optimizer rewrites the query as follows:
SELECT week_ending_day, SUM(amount_sold)
FROM join_sales_time_product_oj_mv
column A.X in a query with column B.X in a materialized view or vice versa. For
example, consider the following query:
SELECT p.prod_name, s.time_id, t.week_ending_day, SUM(s.amount_sold)
FROM sales s, products p, times t
WHERE s.time_id=t.time_id AND s.prod_id = p.prod_id
GROUP BY p.prod_name, s.time_id, t.week_ending_day;
Also suppose new data will be inserted for December 2000, which will be assigned
to partition sales_q4_2000. For testing purposes, you can apply an arbitrary
DML operation on sales, changing a different partition than sales_q1_2000 as
the following query requests data in this partition when this materialized view is
fresh. For example, the following:
INSERT INTO SALES VALUES(17, 10, '01-DEC-2000', 4, 380, 123.45, 54321);
Until a refresh is done, the materialized view is generically stale and cannot be used
for unlimited rewrite in enforced mode. However, because the table sales is
partitioned and not all partitions have been modified, Oracle can identify all
partitions that have not been touched. The optimizer can identify the fresh rows in
the materialized view (the data which is unaffected by updates since the last refresh
operation) by implicitly adding selection predicates to the materialized view
defining query as follows:
SELECT s.time_id, p.prod_subcategory, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id
AND s.time_id < TO_DATE('01-OCT-2000','DD-MON-YYYY')
OR s.time_id >= TO_DATE('01-OCT-2001','DD-MON-YYYY'))
GROUP BY time_id, prod_subcategory, cust_city;
Oracle Database knows that those ranges of rows in the materialized view are fresh
and can therefore rewrite the query with the materialized view. The rewritten query
looks as follows:
SELECT time_id, prod_subcategory, cust_city, sum_amount_sold
FROM sum_sales_per_city_mv
WHERE time_id BETWEEN TO_DATE('01-JAN-2000', 'DD-MON-YYYY')
AND TO_DATE('01-JUL-2000', 'DD-MON-YYYY');
Instead of the partitioning key, a partition marker (a function that identifies the
partition given a rowid) can be present in the select (and GROUP BY list) of the
materialized view. You can use the materialized view to rewrite queries that require
data from only certain partitions (identifiable by the partition-marker), for instance,
queries that have a predicate specifying ranges of the partitioning keys containing
entire partitions. See Chapter 9, "Advanced Materialized Views" for details
regarding the supplied partition marker function DBMS_MVIEW.PMARKER.
The following example illustrates the use of a partition marker in the materialized
view instead of directly using the partition key column:
CREATE MATERIALIZED VIEW sum_sales_per_city_2_mv
ENABLE QUERY REWRITE AS
SELECT DBMS_MVIEW.PMARKER(s.rowid) AS pmarker,
t.fiscal_quarter_desc, p.prod_subcategory, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id
AND s.time_id = t.time_id
GROUP BY DBMS_MVIEW.PMARKER(s.rowid),
prod_subcategory, cust_city, fiscal_quarter_desc;
Suppose you know that the partition sales_q1_2000 is fresh and DML changes
have taken place for other partitions of the sales table. For testing purposes, you
can apply an arbitrary DML operation on sales, changing a different partition
than sales_q1_2000 when the materialized view is fresh. An example is the
following:
INSERT INTO SALES VALUES(17, 10, '01-DEC-2000', 4, 380, 123.45, 54321);
Note that rewrite with a partially stale materialized view that contains a PMARKER
function can only take place when the complete data content of one or more
partitions is accessed and the predicate condition is on the partitioned fact table
itself, as shown in the earlier example.
The DBMS_MVIEW.PMARKER function gives you exactly one distinct value for each
partition. This dramatically reduces the number of rows in a potential materialized
view compared to the partitioning key itself, but you are also giving up any
detailed information about this key. The only thing you know is the partition
number and, therefore, the lower and upper boundary values. This is the trade-off
for reducing the cardinality of the range partitioning column and thus the number
of rows.
Assuming the value of p_marker for partition sales_q1_2000 is 31070, the
previously shown queries can be rewritten against the materialized view as follows:
SELECT mv.prod_subcategory, mv.cust_city, SUM(mv.sum_amount_sold)
FROM sum_sales_per_city_2_mv mv
WHERE mv.pmarker = 31070 AND mv.cust_city= 'Nuernberg'
GROUP BY prod_subcategory, cust_city;
So the query can be rewritten against the materialized view without accessing stale
data.
Oracle first tries to rewrite it with a materialized aggregate view and finds there is
none eligible (note that single-table aggregate materialized view sum_sales_
store_time_mv cannot yet be used), and then tries a rewrite with a materialized
join view and finds that join_sales_time_product_mv is eligible for rewrite.
The rewritten query has this form:
SELECT mv.prod_name, mv.week_ending_day, SUM(mv.amount_sold)
FROM join_sales_time_product_mv mv
GROUP BY mv.prod_name, mv.week_ending_day;
Because a rewrite occurred, Oracle tries the process again. This time, the query can
be rewritten with single-table aggregate materialized view sum_sales_store_
time into the following form:
SELECT mv.prod_name, mv.week_ending_day, mv.sum_amount_sold
FROM sum_sales_time_product_mv mv;
The term base grouping for queries with GROUP BY extensions denotes all unique
expressions present in the GROUP BY clause. In the previous query, the following
grouping (p.prod_subcategory, t.calendar_month_desc, c.cust_
city) is a base grouping.
The extensions can be present in user queries and in the queries defining
materialized views. In both cases, materialized view rewrite applies and you can
distinguish rewrite capabilities into the following scenarios:
■ Materialized View has Simple GROUP BY and Query has Extended GROUP BY
■ Materialized View has Extended GROUP BY and Query has Simple GROUP BY
■ Both Materialized View and Query Have Extended GROUP BY
Materialized View has Simple GROUP BY and Query has Extended GROUP BY When a query
contains an extended GROUP BY clause, it can be rewritten with a materialized view
if its base grouping can be rewritten using the materialized view as listed in the
rewrite rules explained in "When Does Oracle Rewrite a Query?" on page 18-4. For
example, in the following query:
SELECT p.prod_subcategory, t.calendar_month_desc, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, customers c, products p, times t
WHERE s.time_id=t.time_id AND s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY GROUPING SETS
((p.prod_subcategory, t.calendar_month_desc),
(c.cust_city, p.prod_subcategory));
A special situation arises if the query uses the EXPAND_GSET_TO_UNION hint. See
"Hint for Queries with Extended GROUP BY" on page 18-44 for an example of using
EXPAND_GSET_TO_UNION.
Materialized View has Extended GROUP BY and Query has Simple GROUP BY In order for a
materialized view with an extended GROUP BY to be used for rewrite, it must satisfy
two additional conditions:
■ It must contain a grouping distinguisher, which is the GROUPING_ID function
on all GROUP BY expressions. For example, if the GROUP BY clause of the
materialized view is GROUP BY CUBE(a, b), then the SELECT list should
contain GROUPING_ID(a, b).
■ The GROUP BY clause of the materialized view should not result in any
duplicate groupings. For example, GROUP BY GROUPING SETS ((a, b),
(a, b)) would disqualify a materialized view from general rewrite.
A materialized view with an extended GROUP BY contains multiple groupings.
Oracle finds the grouping with the lowest cost from which the query can be
computed and uses that for rewrite. For example, consider the following
materialized view:
CREATE MATERIALIZED VIEW sum_grouping_set_mv
ENABLE QUERY REWRITE AS
SELECT p.prod_category, p.prod_subcategory, c.cust_state_province, c.cust_city,
GROUPING_ID(p.prod_category,p.prod_subcategory,
c.cust_state_province,c.cust_city) AS gid,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY GROUPING SETS
((p.prod_category, p.prod_subcategory, c.cust_city),
(p.prod_category, p.prod_subcategory, c.cust_state_province, c.cust_city),
(p.prod_category, p.prod_subcategory));
This query will be rewritten with the closest matching grouping from the
materialized view. That is, the (prodcategory, prod_subcategory, cust_
city) grouping:
SELECT prod_subcategory, cust_city, SUM(sum_amount_sold) AS sum_amount_sold
FROM sum_grouping_set_mv
WHERE gid = grouping identifier of (prod_category,prod_subcategory, cust_city)
GROUP BY prod_subcategory, cust_city;
Both Materialized View and Query Have Extended GROUP BY When both materialized
view and the query contain GROUP BY extensions, Oracle uses two strategies for
rewrite: grouping match and UNION ALL rewrite. First, Oracle tries grouping match.
The groupings in the query are matched against groupings in the materialized view
and if all are matched with no rollup, Oracle selects them from the materialized
view. For example, consider the following query:
SELECT p.prod_category, p.prod_subcategory, c.cust_city,
SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY GROUPING SETS
((p.prod_category, p.prod_subcategory, c.cust_city),
(p.prod_category, p.prod_subcategory));
If grouping match fails, Oracle tries a general rewrite mechanism called UNION ALL
rewrite. Oracle first represents the query with the extended GROUP BY clause as an
equivalent UNION ALL query. Every grouping of the original query is placed in a
separate UNION ALL branch. The branch will have a simple GROUP BY clause. For
example, consider this query:
SELECT p.prod_category, p.prod_subcategory, c.cust_state_province,
t.calendar_month_desc, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c, times t
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY GROUPING SETS
((p.prod_subcategory, t.calendar_month_desc),
(t.calendar_month_desc),
(p.prod_category, p.prod_subcategory, c.cust_state_province),
(p.prod_category, p.prod_subcategory));
UNION ALL
SELECT null, null, null,
t.calendar_month_desc, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c, times t
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY t.calendar_month_desc
UNION ALL
SELECT p.prod_category, p.prod_subcategory, c.cust_state_province,
null, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c, times t
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY p.prod_category, p.prod_subcategory, c.cust_state_province
UNION ALL
SELECT p.prod_category, p.prod_subcategory, null,
null, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c, times t
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY p.prod_category, p.prod_subcategory;
Each branch is then rewritten separately using the rules from "When Does Oracle
Rewrite a Query?" on page 18-4. Using the materialized view sum_grouping_
set_mv, Oracle can rewrite only branches three (which requires materialized view
rollup) and four (which matches the materialized view exactly). The unrewritten
branches will be converted back to the extended GROUP BY form. Thus, eventually,
the query is rewritten as:
SELECT null, p.prod_subcategory, null,
t.calendar_month_desc, SUM(s.amount_sold) AS sum_amount_sold
FROM sales s, products p, customers c, times t
WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id
GROUP BY GROUPING SETS
((p.prod_subcategory, t.calendar_month_desc),
(t.calendar_month_desc),)
UNION ALL
SELECT prod_category, prod_subcategory, cust_state_province,
null, SUM(sum_amount_sold) AS sum_amount_sold
FROM sum_grouping_set_mv
WHERE gid = <grouping id of (prod_category,prod_subcategory, cust_city)>
GROUP BY p.prod_category, p.prod_subcategory, c.cust_state_province
UNION ALL
SELECT prod_category, prod_subcategory, null,
null, sum_amount_sold
FROM sum_grouping_set_mv
WHERE gid = <grouping id of (prod_category,prod_subcategory)>
And here is the query that will be rewritten to use the materialized view:
SELECT t.calendar_month_name, t.calendar_year, p.prod_category,
SUM(X1.revenue) AS sum_revenue
FROM times t, products p,
(SELECT time_id, prod_id, amount_sold*0.2 AS revenue FROM sales) X1
WHERE t.time_id = X1.time_id AND p.prod_id = X1.prod_id
GROUP BY calendar_month_name, calendar_year, prod_category;
The following query fails the exact text match test but is rewritten because the
aliases for the table references match:
SELECT s.prod_id, t2.fiscal_week_number - t1.fiscal_week_number AS lag
FROM times t1, sales s, times t2
WHERE t1.time_id = s.time_id AND t2.time_id = s.time_id_ship;
Note that Oracle performs other checks to ensure the correct match of an instance of
a multiply instanced table in the request query with the corresponding table
instance in the materialized view. For instance, in the following example, Oracle
correctly determines that the matching alias names used for the multiple instances
of table time does not establish a match between the multiple instances of table
time in the materialized view.
The following query cannot be rewritten using sales_shipping_lag_mv, even
though the alias names of the multiply instanced table time match because the
joins are not compatible between the instances of time aliased by t2:
SELECT s.prod_id, t2.fiscal_week_number - t1.fiscal_week_number AS lag
FROM times t1, sales s, times t2
WHERE t1.time_id = s.time_id AND t2.time_id = s.time_id_paid;
This request query joins the instance of the time table aliased by t2 on the
s.time_id_paid column, while the materialized views joins the instance of the
time table aliased by t2 on the s.time_id_ship column. Because the join
conditions differ, Oracle correctly determines that rewrite cannot occur.
The following query does not have any matching alias in the materialized view,
sales_shipping_lag_mv, for the table, times. But query rewrite will now
compare the joins between the query and the materialized view and correctly match
the multiple instances of times.
SELECT s.prod_id, x2.fiscal_week_number - x1.fiscal_week_number AS lag
FROM times x1, sales s, times x2
WHERE x1.time_id = s.time_id AND x2.time_id = s.time_id_ship;
constraints on base tables is necessary, not only for data correctness and cleanliness,
but also for materialized view query rewrite purposes using the original base
objects.
Materialized view rewrite extensively uses constraints for query rewrite. They are
used for determining lossless joins, which, in turn, determine if joins in the
materialized view are compatible with joins in the query and thus if rewrite is
possible.
DISABLE NOVALIDATE is the only valid state for a view constraint. However, you
can choose RELY or NORELY as the view constraint state to enable more
sophisticated query rewrites. For example, a view constraint in the RELY state
allows query rewrite to occur when the query integrity level is set to TRUSTED.
Table 18–2 illustrates when view constraints are used for determining lossless joins.
Note that view constraints cannot be used for query rewrite integrity level
ENFORCED. This level enforces the highest degree of constraint enforcement
ENABLE VALIDATE.
You can now establish a foreign key/primary key relationship (in RELY ON) mode
between the view and the fact table, and thus rewrite will take place as described in
Table 18–2, by adding the following constraints. Rewrite will then work for example
in TRUSTED mode.
ALTER VIEW time_view ADD (CONSTRAINT time_view_pk
PRIMARY KEY (time_id) DISABLE NOVALIDATE);
ALTER VIEW time_view MODIFY CONSTRAINT time_view_pk RELY;
ALTER TABLE sales ADD (CONSTRAINT time_view_fk FOREIGN KEY (time_id)
REFERENCES time_view(time_id) DISABLE NOVALIDATE);
The following query, omitting the dimension table products, will also be rewritten
without the primary key/foreign key relationships, because the suppressed join
between sales and products is known to be lossless.
SELECT t.day_in_year, SUM(s.amount_sold) AS sum_amount_sold
FROM time_view t, sales s WHERE t.time_id = s.time_id
GROUP BY t.day_in_year;
To undo the changes you have made to the sh schema, issue the following
statements:
ALTER TABLE sales DROP CONSTRAINT time_view_fk;
DROP VIEW time_view;
an Oracle DATE. The expression matching is done based on the use of canonical
forms for the expressions.
DATE is a built-in datatype which represents ordered time units such as seconds,
days, and months, and incorporates a time hierarchy (second -> minute -> hour ->
day -> month -> quarter -> year). This hard-coded knowledge about DATE is used
in folding date ranges from lower-date granules to higher-date granules.
Specifically, folding a date value to the beginning of a month, quarter, year, or to the
end of a month, quarter, year is supported. For example, the date value
1-jan-1999 can be folded into the beginning of either year 1999 or quarter
1999-1 or month 1999-01. And, the date value 30-sep-1999 can be folded into
the end of either quarter 1999-03 or month 1999-09.
Note: Due to the way date folding works, you should be careful
when using BETWEEN and date columns. The best way to use
BETWEEN and date columns is to increment the later date by 1. In
other words, instead of using date_col BETWEEN
'1-jan-1999' AND '30-jun-1999', you should use date_
col BETWEEN '1-jan-1999' AND '1-jul-1999'. You could
also use the TRUNC function to get the equivalent result, as in
TRUNC(date_col) BETWEEN '1-jan-1999' AND
'30-jun-1999'. TRUNC will, however, strip time values.
Because date values are ordered, any range predicate specified on date columns can
be folded from lower level granules into higher level granules provided the date
range represents an integral number of higher level granules. For example, the
range predicate date_col >= '1-jan-1999' AND date_col <
'30-jun-1999' can be folded into either a month range or a quarter range using
the TO_CHAR function, which extracts specific date components from a date value.
The advantage of aggregating data by folded date values is the compression of data
achieved. Without date folding, the data is aggregated at the lowest granularity
level, resulting in increased disk space for storage and increased I/O to scan the
materialized view.
Consider a query that asks for the sum of sales by product types for the years 1998.
SELECT p.prod_category, SUM(s.amount_sold)
FROM sales s, products p
WHERE s.prod_id=p.prod_id AND s.time_id >= TO_DATE('01-jan-1998', 'dd-mon-yyyy')
AND s.time_id < TO_DATE('01-jan-1999', 'dd-mon-yyyy')
GROUP BY p.prod_category;
The range specified in the query represents an integral number of years, quarters, or
months. Assume that there is a materialized view mv3 that contains
pre-summarized sales by prod_type and is defined as follows:
CREATE MATERIALIZED VIEW mv3
ENABLE QUERY REWRITE AS
SELECT prod_type, TO_CHAR(sale_date,'yyyy-mm') AS month, SUM(sales) AS sum_sales
FROM sales, products WHERE sales.prod_id = products.prod_id
GROUP BY prod_type, TO_CHAR(sale_date, 'yyyy-mm');
The query can be rewritten by first folding the date range into the month range and
then matching the expressions representing the months with the month expression
in mv3. This rewrite is shown in two steps (first folding the date range followed by
the actual rewrite).
SELECT prod_type, SUM(sales) AS sum_sales
FROM sales, products
WHERE sales.prod_id = products.prod_id AND TO_CHAR(sale_date, 'yyyy-mm') >=
TO_CHAR('01-jan-1998', 'yyyy-mm') AND TO_CHAR('01-jan-1999', 'yyyy-mm')
GROUP BY prod_type;
GROUP BY prod_type;
PARTITION Europe
VALUES ('France', 'Spain', 'Ireland'))
AS SELECT t.calendar_year, t.calendar_month_number,
t.day_number_in_month, c1.country_name, s.prod_id,
s.quantity_sold, s.amount_sold
FROM times t, countries c1, sales s, customers c2
WHERE s.time_id = t.time_id and s.cust_id = c2.cust_id and
c2.country_id = c1.country_id and
c1.country_name IN ('United States of America', 'Argentina',
'Japan', 'India', 'France', 'Spain', 'Ireland');
MV
sales_per_country_mv
FRESHNESS regions
Table determined by
sales_par_list country_name
Stale
Table
Products
You have deleted rows from partition Asia in table sales_par_list. Now
sales_per_dt_partition_mv is stale, but PCT rewrite (in ENFORCED and
TRUSTED modes) is possible as this materialized view supports PCT (pmarker
based) against table sales_par_list.
Now consider the following query:
SELECT p.prod_name, SUM(s.amount_sold) AS sum_sales, COUNT(*) AS cnt
FROM sales_par_list s, products p
WHERE s.prod_id = p.prod_id AND s.calendar_year = 2001 AND
s.country_name IN ('United States of America', 'Argentina')
GROUP BY p.prod_name;
An example of a statement that does rewrite after the INSERT statement is the
following, because it accesses fresh material:
SELECT s.calendar_year, SUM(s.amount_sold) AS sum_sales, COUNT(*) AS cnt
FROM sales_par_range_list s
WHERE s.calendar_year = 2000 AND s.calendar_month_number BETWEEN 2 AND 6
GROUP BY s.calendar_year;
Figure 18–4 offers a graphical illustration of what is stale and what is fresh.
sum_sales_per_year_month_mv
America Fresh FRESHNESS regions determined by
calendar_month_number <4 calendar_month_number
q1 Asia Stale
(updated)
Europe Fresh
America
q2 Asia
Europe
America
q3 Asia
Europe
America
q4 Asia
Europe
sales_par_range_list
All the limitations that apply to pmarker rewrite will apply here as well. The
incoming query should access a whole partition for the query to be rewritten. The
following pmarker table used in this case:
product_par_list pmarker value
---------------- -------------
prod_cat1 1000
prod_cat2 1001
prod_cat3 1002
Consider the following query, which has a user bind variable, :user_id, in its
WHERE clause:
SELECT CUST_ID, PROD_ID, SUM(AMOUNT_SOLD) AS SUM_AMOUNT
FROM SALES WHERE CUST_ID > :user_id
GROUP BY CUST_ID, PROD_ID;
Because the materialized view, customer_mv, has a selection in its WHERE clause,
query rewrite is dependent on the actual value of the user bind variable, user_id,
to compute the containment. Because user_id is not available during query
rewrite time and query rewrite is dependent on the bind value of user_id, this
query cannot be rewritten.
Even though the preceding example has a user bind variable in the WHERE clause,
the same is true regardless of where the user bind variable appears in the query. In
other words, irrespective of where a user bind variable appears in a query, if query
rewrite is dependent on its value, then the query cannot be rewritten.
Now consider the following query which has a user bind variable, :user_id, in its
SELECT list:
SELECT CUST_ID + :user_id, PROD_ID, SUM(AMOUNT_SOLD) AS TOTAL_AMOUNT
FROM SALES WHERE CUST_ID >= 2000
GROUP BY CUST_ID, PROD_ID;
Because the value of the user bind variable, user_id, is not required during query
rewrite time, the preceding query will rewrite.
If you have the following query, which displays the postal codes for male customers
from San Francisco or Los Angeles:
SELECT c.cust_city, c.cust_postal_code
FROM customers c
WHERE c.cust_city = 'Los Angeles' AND c.cust_gender = 'M'
UNION ALL
SELECT c.cust_city, c.cust_postal_code
FROM customers c
WHERE c.cust_city = 'San Francisco' AND c.cust_gender = '';
The rewritten query has dropped the UNION ALL and replaced it with the
materialized view. Normally, query rewrite has to use the existing set of general
eligibility rules to determine if the SELECT subselections under the UNION ALL are
equivalent in the query and the materialized view.
If, for example, you have a query that retrieves the postal codes for male customers
from San Francisco, Palmdale, or Los Angeles, the same rewrite can occur as in the
previous example but query rewrite must keep the UNION ALL with the base tables,
as in the following:
SELECT c.cust_city, c.cust_postal_code
FROM customers c
WHERE c.cust_city= 'Palmdale' AND c.cust_gender ='M'
UNION ALL
SELECT c.cust_city, c.cust_postal_code
FROM customers c
WHERE c.cust_city = 'Los Angeles' AND c.cust_gender = 'M'
UNION ALL
SELECT c.cust_city, c.cust_postal_code
FROM customers c
WHERE c.cust_city = 'San Francisco' AND c.cust_gender = 'M';
So query rewrite will detect the case where a subset of the UNION ALL can be
rewritten using the materialized view cust_male_postal_mv.
UNION, UNION ALL, and INTERSECT are commutative, so query rewrite can rewrite
regardless of the order the subselects are found in the query or materialized view.
However, MINUS is not commutative. A MINUS B is not equivalent to B MINUS A.
Therefore, the order in which the subselects appear under the MINUS operator in the
query and the materialized view must be in the same order for rewrite to happen.
As an example, consider the case where there exists an old version of the customer
table called customer_old and you want to find the difference between the old
one and the current customer table only for male customers who live in London.
That is, you want to find those customers in the current one that were not in the old
one. The following example shows how this is done using a MINUS:
SELECT c.cust_city, c.cust_postal_code
FROM customers c
WHERE c.cust_city= 'Los Angeles' AND c.cust_gender = 'M'
MINUS
SELECT c.cust_city, c.cust_postal_code
FROM customers_old c
WHERE c.cust_city = 'Los Angeles' AND c.cust_gender = 'M';
Switching the subselects would yield a different answer. This illustrates that MINUS
is not commutative.
The WHERE clause of the first subselect includes mv.marker = 2 and mv.cust_
gender = 'M', which selects only the rows that represent male customers in the
second subselect of the UNION ALL. The WHERE clause of the second subselect
includes mv.marker = 1 and mv.cust_gender = 'F', which selects only those
rows that represent female customers in the first subselect of the UNION ALL. Note
that query rewrite cannot take advantage of set operators that drop duplicate or
distinct rows. For example, UNION drops duplicates so query rewrite cannot tell
what rows have been dropped.
The rules for using a marker are that it must:
■ Be a constant number or string and be the same datatype for all UNION ALL
subselects.
■ Yield a constant, distinct value for each UNION ALL subselect. You cannot reuse
the same value in more than one subselect.
■ Be in the same ordinal position for all subselects.
Explain Plan
The EXPLAIN PLAN facility is used as described in Oracle Database SQL Reference.
For query rewrite, all you need to check is that the object_name column in PLAN_
TABLE contains the materialized view name. If it does, then query rewrite has
occurred when this query is executed. An example is the following, which creates
the materialized view cal_month_sales_mv:
CREATE MATERIALIZED VIEW cal_month_sales_mv
ENABLE QUERY REWRITE AS
SELECT t.calendar_month_desc, SUM(s.amount_sold) AS dollars
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
If EXPLAIN PLAN is used on the following SQL statement, the results are placed in
the default table PLAN_TABLE. However, PLAN_TABLE must first be created using
the utlxplan.sql script. Note that EXPLAIN PLAN does not actually execute the
query.
EXPLAIN PLAN FOR
SELECT t.calendar_month_desc, SUM(s.amount_sold)
For the purposes of query rewrite, the only information of interest from PLAN_
TABLE is the OBJECT_NAME, which identifies the objects that will be used to
execute this query. Therefore, you would expect to see the object name calendar_
month_sales_mv in the output as illustrated in the following:
SELECT OPERATION, OBJECT_NAME FROM PLAN_TABLE;
OPERATION OBJECT_NAME
-------------------- -----------
SELECT STATEMENT
MAT_VIEW REWRITE ACCESS CALENDAR_MONTH_SALES_MV
DBMS_MVIEW.EXPLAIN_REWRITE Procedure
It can be difficult to understand why a query did not rewrite. The rules governing
query rewrite eligibility are quite complex, involving various factors such as
constraints, dimensions, query rewrite integrity modes, freshness of the
materialized views, and the types of queries themselves. In addition, you may want
to know why query rewrite chose a particular materialized view instead of another.
To help with this matter, Oracle provides the DBMS_MVIEW.EXPLAIN_REWRITE
procedure to advise you when a query can be rewritten and, if not, why not. Using
the results from DBMS_MVIEW.EXPLAIN_REWRITE, you can take the appropriate
action needed to make a query rewrite if at all possible.
Note that the query specified in the EXPLAIN_REWRITE statement is never actually
executed.
DBMS_MVIEW.EXPLAIN_REWRITE Syntax
You can obtain the output from DBMS_MVIEW.EXPLAIN_REWRITE in two ways.
The first is to use a table, while the second is to create a varray. The following shows
the basic syntax for using an output table:
DBMS_MVIEW.EXPLAIN_REWRITE (
query VARCHAR2,
mv VARCHAR2(30),
statement_id VARCHAR2(30));
The query parameter is a text string representing the SQL query. The parameter,
mv, is a fully qualified materialized view name in the form of schema.mv. This is
an optional parameter. When it is not specified, EXPLAIN_REWRITE returns any
relevant messages regarding all the materialized views considered for rewriting the
given query. When schema is omitted and only mv is specified, EXPLAIN_REWRITE
looks for the materialized view in the current schema.
Therefore, to call the EXPLAIN_REWRITE procedure using an output table is as
follows:
DBMS_MVIEW.EXPLAIN_REWRITE (
query [VARCHAR2 | CLOB],
mv VARCHAR2(30),
statement_id VARCHAR2(30));
Note that if the query is less than 256 characters long, EXPLAIN_REWRITE can be
easily invoked with the EXECUTE command from SQL*Plus. Otherwise, the
recommended method is to use a PL/SQL BEGIN... END block, as shown in the
examples in /rdbms/demo/smxrw*.
Using REWRITE_TABLE
Output of EXPLAIN_REWRITE can be directed to a table named REWRITE_TABLE.
You can create this output table by running the utlxrw.sql script. This script can
be found in the admin directory. The format of REWRITE_TABLE is as follows.
CREATE TABLE REWRITE_TABLE(
statement_id VARCHAR2(30), -- ID for the query
mv_owner VARCHAR2(30), -- MV's schema
mv_name VARCHAR2(30), -- Name of the MV
sequence INTEGER, -- Seq # of error msg
query VARCHAR2(2000), -- user query
message VARCHAR2(512), -- EXPLAIN_REWRITE error msg
pass VARCHAR2(3), -- Query Rewrite pass no
mv_in_msg VARCHAR2(30), -- MV in current message
measure_in_msg VARCHAR2(30), -- Measure in current message
join_back_tbl VARCHAR2(30), -- Join back table in current msg
The following is another example where you can see a more detailed explanation of
why some materialized views were not considered and eventually the materialized
view sales_mv was chosen as the best one.
DECLARE
qrytext VARCHAR2(500) :='SELECT cust_first_name, cust_last_name,
SUM(amount_sold) AS dollar_sales FROM sales s, customers c WHERE s.cust_id=
c.cust_id GROUP BY cust_first_name, cust_last_name';
idno VARCHAR2(30) :='ID1';
BEGIN
DBMS_MVIEW.EXPLAIN_REWRITE(qrytext, '', idno);
END;
/
SELECT message FROM rewrite_table ORDER BY sequence;
SQL> MESSAGE
--------------------------------------------------------------------------------
QSM-01082: Joining materialized view, CAL_MONTH_SALES_MV, with table, SALES, not possible
QSM-01022: a more optimal materialized view than PRODUCT_SALES_MV was used to rewrite
QSM-01022: a more optimal materialized view than FWEEK_PSCAT_SALES_MV was used to rewrite
QSM-01033: query rewritten with materialized view, SALES_MV
Using a Varray
You can save the output of EXPLAIN_REWRITE in a PL/SQL varray. The elements
of this array are of the type RewriteMessage, which is predefined in the SYS
schema as shown in the following:
TYPE RewriteMessage IS record(
mv_owner VARCHAR2(30), -- MV's schema
mv_name VARCHAR2(30), -- Name of the MV
sequence INTEGER, -- Seq # of error msg
query_text VARCHAR2(2000),-- user query
message VARCHAR2(512), -- EXPLAIN_REWRITE error msg
pass VARCHAR2(3), -- Query Rewrite pass no
mv_in_msg VARCHAR2(30), -- MV in current message
measure_in_msg VARCHAR2(30), -- Measure in current message
join_back_tbl VARCHAR2(30), -- Join back table in current msg
join_back_col VARCHAR2(30), -- Join back column in current msg
original_cost NUMBER(10), -- Cost of original query
rewritten_cost NUMBER(10), -- Cost of rewritten query
flags NUMBER, -- Associated flags
reserved1 NUMBER, -- For future use
reserved2 VARCHAR2(10) -- For future use
);
■ The mv_name field defines the name of a materialized view that is relevant to
the message.
■ The sequence field defines the sequence in which messages should be
ordered.
■ The query_text field contains the first 2000 characters of the query text under
analysis.
■ The message field contains the text of message relevant to rewrite processing
of query.
■ The flags, reserved1, and reserved2 fields are reserved for future use.
The query will not rewrite with this materialized view. This can be quite confusing
to a novice user as it seems like all information required for rewrite is present in the
materialized view. You can find out from DBMS_MVIEW.EXPLAIN_REWRITE that
AVG cannot be computed from the given materialized view. The problem is that a
ROLLUP is required here and AVG requires a COUNT or a SUM to do ROLLUP.
An example PL/SQL block for the previous query, using a varray as its output, is as
follows:
SET SERVEROUTPUT ON
DECLARE
Rewrite_Array SYS.RewriteArrayType := SYS.RewriteArrayType();
querytxt VARCHAR2(1500) := 'SELECT c.cust_state_province,
AVG(s.amount_sold)
FROM sales s, customers c WHERE s.cust_id = c.cust_id
GROUP BY c.cust_state_province';
i NUMBER;
BEGIN
DBMS_MVIEW.EXPLAIN_REWRITE(querytxt, 'AVG_SALES_CITY_STATE_MV',
Rewrite_Array);
FOR i IN 1..Rewrite_Array.count
LOOP
DBMS_OUTPUT.PUT_LINE(Rewrite_Array(i).message);
END LOOP;
END;
/
The second argument, mv, and the third argument, statement_id, can be NULL.
Similarly, the syntax for using EXPLAIN_REWRITE using CLOB to obtain the output
into a varray is shown as follows:
DBMS_MVIEW.EXPLAIN_REWRITE(
query IN CLOB,
mv IN VARCHAR2,
msg_array IN OUT SYS.RewriteArrayType);
As before, the second argument, mv, can be NULL. Note that long query texts in
CLOB can be generated using the procedures provided in the DBMS_LOB package.
You should avoid using the ON DELETE clause as it can lead to unexpected results.
cost-based choice. Materialized views should thus have statistics collected using the
DBMS_STATS package.
OLAP Server would want to materialize this query for quick results. Unfortunately,
the resulting materialized view occupies too much disk space. However, if you have
a dimension rolling up city to state to region, you can easily compress the three
grouping columns into one column using a decode statement. (This is also known
as an embedded total):
DECODE (gid, 0, city, 1, state, 3, region, 7, "grand_total")
What this does is use the lowest level of the hierarchy to represent the entire
information. For example, saying Boston means Boston, MA, New England
Region and saying CA to mean CA, Western Region. OLAP Server stores these
embedded total results into a table, say, embedded_total_sales.
However, when returning the result back to the user, you would want to have all
the data columns (city, state, region). In order to return the results efficiently and
quickly, OLAP Server may use a custom table function (et_function) to retrieve
the data back from the embedded_total_sales table in the expanded form as
follows:
SELECT * FROM TABLE (et_function);
In other words, this feature allows OLAP Server to declare the equivalence of the
user's preceding query to the alternative query OLAP Server uses to compute it, as
in the following:
DBMS_ADVANCED_REWRITE.DECLARE_REWRITE_EQUIVALENCE (
'OLAPI_EMBEDDED_TOTAL',
'SELECT g.region, g.state, g.city,
GROUPING_ID(g.city, g.state, g.region), SUM(sales)
FROM sales_fact f, geog_dim g
WHERE f.geog_key = g.geog_key
GROUP BY ROLLUP(g.region, g.state, g.city)',
'SELECT * FROM TABLE(et_function)');
By specifying this equivalence, Oracle would use the more efficient second form of
the query to compute the ROLLUP query asked by the user.
DBMS_ADVANCED_REWRITE.DECLARE_REWRITE_EQUIVALENCE (
'OLAPI_ROLLUP',
'SELECT g.region, g.state, g.city,
GROUPING_ID(g.city, g.state, g.region), SUM(sales)
FROM sales_fact f, geog_dim g
WHERE f.geog_key = g.geog_key
GROUP BY ROLLUP(g.region, g.state, g.city ',
' SELECT * FROM T1
UNION ALL
SELECT region, state, NULL, 1 as gid, sales FROM T2
UNION ALL
Instead of asking the user to write SQL that does the extra computation, OLAP
Server does it for them by using this feature. In this example, Seasonal_Agg is
computed using the spreadsheet functionality (see Chapter 22, "SQL for Modeling").
Note that even though Seasonal_Agg is a user-defined aggregate, the required
behavior is to add extra rows to the query's answer, which cannot be easily done
with simple PL/SQL functions.
DBMS_ADVANCED_REWRITE.DECLARE_REWRITE_EQUIVALENCE (
'OLAPI_SEASONAL_AGG',
SELECT g.region, t,monthname, Seasonal_Agg(sales, region) AS sales
FROM sales_fact f, geog_dim g, time t
WHERE f.geog_key = g.geog_key and f.time_key = t.time_key
GROUP BY g.region, t.monthname',
'SELECT g,region, t.monthname, SUM(sales) AS sales
FROM sales_fact f, geog_dim g
WHERE f.geog_key = g.geog_key and t.time_key = f.time_key
GROUP BY g.region, g.state, g.city, t.monthname
DIMENSION BY g.region, t.monthname
(sales ['New England', 'Winter'] = AVG(sales) OVER monthname IN
('Dec', 'Jan', 'Feb', 'Mar'),
sales ['Western', 'Summer' ] = AVG(sales) OVER monthname IN
('May', 'Jun', 'July', 'Aug'), .);
customers orders
order products
items
Star Schemas
The star schema is perhaps the simplest data warehouse schema. It is called a star
schema because the entity-relationship diagram of this schema resembles a star,
with points radiating from a central table. The center of the star consists of a large
fact table and the points of the star are the dimension tables.
A star query is a join between a fact table and a number of dimension tables. Each
dimension table is joined to the fact table using a primary key to foreign key join,
but the dimension tables are not joined to each other. The optimizer recognizes star
queries and generates efficient execution plans for them.
A typical fact table contains keys and measures. For example, in the sh sample
schema, the fact table, sales, contain the measures quantity_sold, amount, and
cost, and the keys cust_id, time_id, prod_id, channel_id, and promo_id.
The dimension tables are customers, times, products, channels, and
promotions. The products dimension table, for example, contains information
about each product number that appears in the fact table.
A star join is a primary key to foreign key join of the dimension tables to a fact table.
The main advantages of star schemas are that they:
■ Provide a direct and intuitive mapping between the business entities being
analyzed by end users and the schema design.
■ Provide highly optimized performance for typical star queries.
■ Are widely supported by a large number of business intelligence tools, which
may anticipate or even require that the data warehouse schema contain
dimension tables.
Star schemas are used for both simple data marts and very large data warehouses.
Figure 19–2 presents a graphical representation of a star schema.
products times
sales
(amount_sold,
quantity_sold)
Fact Table
customers channels
Snowflake Schemas
The snowflake schema is a more complex data warehouse model than a star
schema, and is a type of star schema. It is called a snowflake schema because the
diagram of the schema resembles a snowflake.
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
dimension data has been grouped into multiple tables instead of one large table. For
example, a product dimension table in a star schema might be normalized into a
products table, a product_category table, and a product_manufacturer table
in a snowflake schema. While this saves space, it increases the number of dimension
tables and requires more foreign key joins. The result is more complex queries and
reduced query performance. Figure 19–3 presents a graphical representation of a
snowflake schema.
suppliers
products times
sales
(amount_sold,
quantity_sold)
customers channels
countries
Note: Bitmap indexes are available only if you have purchased the
Oracle Database Enterprise Edition. In Oracle Database Standard
Edition, bitmap indexes and star transformation are not available.
For example, the sales table of the sh sample schema has bitmap indexes on the
time_id, channel_id, cust_id, prod_id, and promo_id columns.
Consider the following star query:
SELECT ch.channel_class, c.cust_city, t.calendar_quarter_desc,
SUM(s.amount_sold) sales_amount
FROM sales s, times t, customers c, channels ch
WHERE s.time_id = t.time_id
AND s.cust_id = c.cust_id
AND s.channel_id = ch.channel_id
AND c.cust_state_province = 'CA'
AND ch.channel_desc in ('Internet','Catalog')
AND t.calendar_quarter_desc IN ('1999-Q1','1999-Q2')
GROUP BY ch.channel_class, c.cust_city, t.calendar_quarter_desc;
This query is processed in two phases. In the first phase, Oracle Database uses the
bitmap indexes on the foreign key columns of the fact table to identify and retrieve
only the necessary rows from the fact table. That is, Oracle Database will retrieve
the result set from the fact table using essentially the following query:
SELECT ... FROM sales
WHERE time_id IN
(SELECT time_id FROM times
WHERE calendar_quarter_desc IN('1999-Q1','1999-Q2'))
AND cust_id IN
(SELECT cust_id FROM customers WHERE cust_state_province='CA')
AND channel_id IN
(SELECT channel_id FROM channels WHERE channel_desc
IN('Internet','Catalog'));
This is the transformation step of the algorithm, because the original star query has
been transformed into this subquery representation. This method of accessing the
fact table leverages the strengths of bitmap indexes. Intuitively, bitmap indexes
provide a set-based processing scheme within a relational database. Oracle has
implemented very fast methods for doing set operations such as AND (an
intersection in standard set-based terminology), OR (a set-based union), MINUS, and
COUNT.
In this star query, a bitmap index on time_id is used to identify the set of all rows
in the fact table corresponding to sales in 1999-Q1. This set is represented as a
bitmap (a string of 1's and 0's that indicates which rows of the fact table are
members of the set).
A similar bitmap is retrieved for the fact table rows corresponding to the sale from
1999-Q2. The bitmap OR operation is used to combine this set of Q1 sales with the
set of Q2 sales.
Additional set operations will be done for the customer dimension and the
product dimension. At this point in the star query processing, there are three
bitmaps. Each bitmap corresponds to a separate dimension table, and each bitmap
represents the set of rows of the fact table that satisfy that individual dimension's
constraints.
These three bitmaps are combined into a single bitmap using the bitmap AND
operation. This final bitmap represents the set of rows in the fact table that satisfy
all of the constraints on the dimension table. This is the result set, the exact set of
rows from the fact table needed to evaluate the query. Note that none of the actual
data in the fact table has been accessed. All of these operations rely solely on the
bitmap indexes and the dimension tables. Because of the bitmap indexes'
compressed data representations, the bitmap set-based operations are extremely
efficient.
Once the result set is identified, the bitmap is used to access the actual data from the
sales table. Only those rows that are required for the end user's query are retrieved
from the fact table. At this point, Oracle has effectively joined all of the dimension
tables to the fact table using bitmap indexes. This technique provides excellent
performance because Oracle is joining all of the dimension tables to the fact table
with one logical join operation, rather than joining each dimension table to the fact
table independently.
The second phase of this query is to join these rows from the fact table (the result
set) to the dimension tables. Oracle will use the most efficient method for accessing
and joining the dimension tables. Many dimension are very small, and table scans
are typically the most efficient access method for these dimension tables. For large
dimension tables, table scans may not be the most efficient access method. In the
previous example, a bitmap index on product.department can be used to
quickly identify all of those products in the grocery department. Oracle's optimizer
automatically determines which access method is most appropriate for a given
dimension table, based upon the optimizer's knowledge about the sizes and data
distributions of each dimension table.
The specific join method (as well as indexing method) for each dimension table will
likewise be intelligently determined by the optimizer. A hash join is often the most
efficient algorithm for joining the dimension tables. The final answer is returned to
the user once all of the dimension tables have been joined. The query technique of
retrieving only the matching rows from one table and then joining to another table
is commonly known as a semijoin.
In this plan, the fact table is accessed through a bitmap access path based on a
bitmap AND, of three merged bitmaps. The three bitmaps are generated by the
BITMAP MERGE row source being fed bitmaps from row source trees underneath it.
Each such row source tree consists of a BITMAP KEY ITERATION row source which
fetches values from the subquery row source tree, which in this example is a full
table access. For each such value, the BITMAP KEY ITERATION row source retrieves
the bitmap from the bitmap index. After the relevant fact table rows have been
retrieved using this access path, they are joined with the dimension tables and
temporary tables to produce the answer to the query.
The processing of the same star query using the bitmap join index is similar to the
previous example. The only difference is that Oracle will utilize the join index,
instead of a single-table bitmap index, to access the customer data in the first phase
of the star query.
The difference between this plan as compared to the previous one is that the inner
part of the bitmap index scan for the customer dimension has no subselect. This is
because the join predicate information on customer.cust_state_province
can be satisfied with the bitmap join index sales_c_state_bjix.
■ Tables that are really unmerged views, which are not view partitions
The star transformation may not be chosen by the optimizer for the following cases:
■ Tables that have a good single-table access path
■ Tables that are too small for the transformation to be worthwhile
In addition, temporary tables will not be used by star transformation under the
following conditions:
■ The database is in read-only mode
■ The star query is part of a transaction that is in serializable mode
■ Show total sales across all products at increasing aggregation levels for a
geography dimension, from state to country to region, for 1999 and 2000.
■ Create a cross-tabular analysis of our operations showing expenses by territory
in South America for 1999 and 2000. Include all possible subtotals.
■ List the top 10 sales representatives in Asia according to 2000 sales revenue for
automotive products, and rank their commissions.
All these requests involve multiple dimensions. Many multidimensional questions
require aggregated data and comparisons of data sets, often across time, geography
or budgets.
To visualize data that has many dimensions, analysts commonly use the analogy of
a data cube, that is, a space where facts are stored at the intersection of n
dimensions. Figure 20–1 shows a data cube and how it can be used differently by
various groups. The cube stores sales data organized by the dimensions of product,
market, sales, and time. Note that this is only a metaphor: the actual data is
physically stored in normal tables. The cube data consists of both detail and
aggregated data.
Time
You can retrieve slices of data from the cube. These correspond to cross-tabular
reports such as the one shown in Table 20–1. Regional managers might study the
data by comparing slices of the cube applicable to different markets. In contrast,
product managers might compare slices that apply to different products. An ad hoc
user might work with a wide variety of constraints, working in a subset cube.
Answering multidimensional questions often involves accessing and querying huge
quantities of data, sometimes in millions of rows. Because the flood of detailed data
generated by large organizations cannot be interpreted at the lowest level,
aggregated views of the information are essential. Aggregations, such as sums and
counts, across many dimensions are vital to multidimensional analyses. Therefore,
analytical tasks require convenient and efficient data aggregation.
Optimized Performance
Not only multidimensional issues, but all types of processing can benefit from
enhanced aggregation facilities. Transaction processing, financial and
manufacturing systems—all of these generate large numbers of production reports
needing substantial system resources. Improved efficiency when creating these
reports will reduce system load. In fact, any computer process that aggregates data
from details to higher levels will benefit from optimized aggregation performance.
These extensions provide aggregation features and bring many benefits, including:
■ Simplified programming requiring less SQL code for many tasks.
■ Quicker and more efficient query processing.
■ Reduced client processing loads and network traffic because aggregation work
is shifted to servers.
■ Opportunities for caching aggregations because similar queries can leverage
existing work.
An Aggregate Scenario
To illustrate the use of the GROUP BY extension, this chapter uses the sh data of the
sample schema. All the examples refer to data from this scenario. The hypothetical
company has sales across the world and tracks sales by both dollars and quantities
information. Because there are many rows of data, the queries shown here typically
have tight constraints on their WHERE clauses to limit the results to a small number
of rows.
Consider that even a simple report such as this, with just nine values in its grid,
generates four subtotals and a grand total. Half of the values needed for this report
would not be calculated with a query that requested SUM(amount_sold) and did
a GROUP BY(channel_desc, country_id). To get the higher-level aggregates
would require additional queries. Database commands that offer improved
calculation of subtotals bring major benefits to querying, reporting, and analytical
operations.
SELECT channels.channel_desc, countries.country_iso_code,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND channels.channel_desc IN
('Direct Sales', 'Internet') AND times.calendar_month_desc='2000-09'
AND customers.country_id=countries.country_id
AND countries.country_iso_code IN ('US','FR')
GROUP BY CUBE(channels.channel_desc, countries.country_iso_code);
CHANNEL_DESC CO SALES$
-------------------- -- --------------
833,224
FR 70,799
US 762,425
Internet 133,821
Internet FR 9,597
Internet US 124,224
Direct Sales 699,403
Direct Sales FR 61,202
Direct Sales US 638,201
ROLLUP Syntax
ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
Partial Rollup
You can also roll up so that only some of the sub-totals will be included. This partial
rollup uses the following syntax:
GROUP BY expr1, ROLLUP(expr2, expr3);
In this case, the GROUP BY clause creates subtotals at (2+1=3) aggregation levels.
That is, at level (expr1, expr2, expr3), (expr1, expr2), and (expr1).
CUBE Syntax
CUBE appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT … GROUP BY CUBE (grouping_column_reference_list)
Partial CUBE
Partial CUBE resembles partial ROLLUP in that you can limit it to certain dimensions
and precede it with columns outside the CUBE operator. In this case, subtotals of all
possible combinations are limited to the dimensions within the cube list (in
parentheses), and they are combined with the preceding items in the GROUP BY list.
The syntax for partial CUBE is as follows:
GROUP BY expr1, CUBE(expr2, expr3)
GROUPING Functions
Two challenges arise with the use of ROLLUP and CUBE. First, how can you
programmatically determine which result set rows are subtotals, and how do you
find the exact level of aggregation for a given subtotal? You often need to use
subtotals in calculations such as percent-of-totals, so you need an easy way to
determine which rows are the subtotals. Second, what happens if query results
contain both stored NULL values and "NULL" values created by a ROLLUP or CUBE?
How can you differentiate between the two? See Oracle Database SQL Reference for
syntax and restrictions.
GROUPING Function
GROUPING handles these problems. Using a single column as its argument,
GROUPING returns 1 when it encounters a NULL value created by a ROLLUP or CUBE
operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1.
Any other type of value, including a stored NULL, returns a 0.
GROUPING appears in the selection list portion of a SELECT statement. Its form is:
SELECT … [GROUPING(dimension_column)…] …
GROUP BY … {CUBE | ROLLUP| GROUPING SETS} (dimension_column)
A program can easily identify the detail rows by a mask of "0 0 0" on the T, R, and D
columns. The first level subtotal rows have a mask of "0 0 1", the second level
subtotal rows have a mask of "0 1 1", and the overall total row has a mask of "1 1 1".
You can improve the readability of result sets by using the GROUPING and DECODE
functions as shown in Example 20–7.
To understand the previous statement, note its first column specification, which
handles the channel_desc column. Consider the first line of the previous statement:
SELECT DECODE(GROUPING(channel_desc), 1, 'All Channels', channel_desc)AS Channel
CHANNEL_DESC C CO SALES$ CH MO CO
-------------------- - -- -------------- ---------- ---------- ----------
GB 2,910,870 1 1 0
US 2,910,870 1 1 0
Direct Sales 4,886,784 0 1 1
Internet 934,955 0 1 1
5,821,739 1 1 1
Compare the result set of Example 20–8 with that in Example 20–3 on page 20-8 to
see how Example 20–8 is a precisely specified group: it contains only the yearly
totals, regional totals aggregated over time and department, and the grand total.
GROUPING_ID Function
To find the GROUP BY level of a particular row, a query must return GROUPING
function information for each of the GROUP BY columns. If we do this using the
GROUPING function, every GROUP BY column requires another column using the
GROUPING function. For instance, a four-column GROUP BY clause needs to be
analyzed with four GROUPING functions. This is inconvenient to write in SQL and
increases the number of columns required in the query. When you want to store the
query result sets in tables, as with materialized views, the extra columns waste
storage space.
To address these problems, you can use the GROUPING_ID function. GROUPING_
ID returns a single number that enables you to determine the exact GROUP BY level.
For each row, GROUPING_ID takes the set of 1's and 0's that would be generated if
you used the appropriate GROUPING functions and concatenates them, forming a
bit vector. The bit vector is treated as a binary number, and the number's base-10
value is returned by the GROUPING_ID function. For instance, if you group with the
expression CUBE(a, b) the possible values are as shown in Table 20–2.
GROUP_ID Function
While the extensions to GROUP BY offer power and flexibility, they also allow
complex result sets that can include duplicate groupings. The GROUP_ID function
lets you distinguish among duplicate groupings. If there are multiple sets of rows
calculated for a given level, GROUP_ID assigns the value of 0 to all the rows in the
first set. All other sets of duplicate rows for a particular grouping are assigned
higher values, starting with 1. For example, consider the following query, which
generates a duplicate grouping:
ROLLUP(country_iso_code, cust_state_province));
This statement computes all the 8 (2 *2 *2) groupings, though only the previous 3
groups are of interest to you.
Another alternative is the following statement, which is lengthy due to several
unions. This statement requires three scans of the base table, making it inefficient.
CUBE and ROLLUP can be thought of as grouping sets with very specific semantics.
For example, consider the following statement:
CUBE(a, b, c)
In the absence of an optimizer that looks across query blocks to generate the
execution plan, a query based on UNION would need multiple scans of the base
table, sales. This could be very inefficient as fact tables will normally be huge. Using
GROUPING SETS statements, all the groupings of interest are available in the same
query block.
Composite Columns
A composite column is a collection of columns that are treated as a unit during the
computation of groupings. You specify the columns in parentheses as in the
following statement:
ROLLUP (year, (quarter, month), day)
In this statement, the data is not rolled up across year and quarter, but is instead
equivalent to the following groupings of a UNION ALL:
■ (year, quarter, month, day),
■ (year, quarter, month),
■ (year)
■ ()
Here, (quarter, month) form a composite column and are treated as a unit. In
general, composite columns are useful in ROLLUP, CUBE, GROUPING SETS, and
concatenated groupings. For example, in CUBE or ROLLUP, composite columns
would mean skipping aggregation across certain levels. That is, the following
statement:
GROUP BY ROLLUP(a, (b, c))
Here, (b, c) are treated as a unit and rollup will not be applied across (b, c). It is
as if you have an alias, for example z, for (b, c) and the GROUP BY expression
reduces to GROUP BY ROLLUP(a, z). Compare this with the normal rollup as in the
following:
GROUP BY ROLLUP(a, b, c)
Concatenated Groupings
Concatenated groupings offer a concise way to generate useful combinations of
groupings. Groupings specified with concatenated groupings yield the
cross-product of groupings from each grouping set. The cross-product operation
enables even a small number of concatenated groupings to generate a large number
of final groups. The concatenated groupings are specified simply by listing multiple
grouping sets, cubes, and rollups, and separating them with commas. Here is an
example of concatenated grouping sets:
GROUP BY GROUPING SETS(a, b), GROUPING SETS(c, d)
The ROLLUPs in the GROUP BY specification generate the following groups, four for
each dimension.
The concatenated grouping sets specified in the previous SQL will take the ROLLUP
aggregations listed in the table and perform a cross-product on them. The
cross-product will create the 96 (4x4x6) aggregate groups needed for a hierarchical
cube of the data. There are major advantages in using three ROLLUP expressions to
replace what would otherwise require 96 grouping set expressions: the concise SQL
is far less error-prone to develop and far easier to maintain, and it enables much
better query optimization. You can picture how a cube with more dimensions and
more levels would make the use of concatenated groupings even more
advantageous.
See "Working with Hierarchical Cubes in SQL" on page 20-29 for more information
regarding hierarchical cubes.
CHANNEL_DESC CHANNEL_TOTAL
-------------------- -------------
Direct Sales 57875260.6
Note that this example could also be performed efficiently using the reporting
aggregate functions described in Chapter 21, "SQL for Analysis and Reporting".
This concatenated rollup takes the ROLLUP aggregations similar to those listed in
Table 20–4, "Hierarchical CUBE Example" in the prior section and performs a
cross-product on them. The cross-product will create the 16 (4x4) aggregate groups
needed for a hierarchical cube of the data.
The inner hierarchical cube specified defines a simple cube, with two dimensions
and four levels in each dimension. It would generate 16 groups (4 Time levels * 4
Product levels). The GROUPING_ID function in the query identifies the specific
group each row belongs to, based on the aggregation level of the grouping-columns
in its argument.
The outer query applies the constraints needed for our specific query, limiting
Division to a value of 25 and Month to a value of 200201 (representing January 2002
in this case). In conceptual terms, it slices a small chunk of data from the cube. The
outer query's constraint on the GID column, indicated in the query by
gid-for-division-month would be the value of a key indicating that the data is
grouped as a combination of division and month. The GID constraint selects only
those rows that are aggregated at the level of a GROUP BY month, division clause.
Oracle Database removes unneeded aggregation groups from query processing
based on the outer query conditions. The outer conditions of the previous query
limit the result set to a single group aggregating division and month. Any other
groups involving year, month, brand, and item are unnecessary here. The group
pruning optimization recognizes this and transforms the query into:
SELECT month, division, sum_sales
FROM (SELECT null, null, month, division, null, null, SUM(sales) sum_sales,
GROUPING_ID(grouping-columns) gid
FROM sales, products, time WHERE join-condition
GROUP BY month, division)
WHERE division = 25 AND month = 200201 AND gid = gid-for-Division-Month;
The bold items highlight the changed SQL. The inner query now has a simple
GROUP BY clause of month, division. The columns year, quarter, brand and
item have been converted to null to match the simplified GROUP BY clause. Because
the query now requests just one group, fifteen out of sixteen groups are removed
from the processing, greatly reducing the work. For a cube with more dimensions
and more levels, the savings possible through group pruning can be far greater.
Note that the group pruning transformation works with all the GROUP BY
extensions: ROLLUP, CUBE, and GROUPING SETS.
While the optimizer has simplified the previous query to a simple GROUP BY, faster
response times can be achieved if the group is precomputed and stored in a
materialized view. Because OLAP queries can ask for any slice of the cube many
groups may need to be precomputed and stored in a materialized view. This is
discussed in the next section.
than a small set of aggregate groups. The trade-off in processing time and disk
space versus query performance needs to be considered before deciding to create it.
An additional possibility you could consider is to use data compression to lessen
your disk space requirements.
See Oracle Database SQL Reference for compression syntax and restrictions and
"Storage And Table Compression" on page 8-22 for details regarding compression.
These materialized views can be created as BUILD DEFERRED and then, you can
execute DBMS_MVIEW.REFRESH_DEPENDENT(number_of_failures,
'SALES', 'C' ...) so that the complete refresh of each of the materialized
views defined on the detail table SALES is scheduled in the most efficient order.
Please refer to section on "Scheduling Refresh" on page 15-22.
Because each of these materialized views is partitioned on the time level (month,
quarter, or year) present in the SELECT list, PCT is enabled on SALES table for each
one of them, thus providing an opportunity to apply PCT refresh method in
addition to FAST and COMPLETE refresh methods.
The following topics provide information about how to improve analytical SQL
queries in a data warehouse:
■ Overview of SQL for Analysis and Reporting
■ Ranking Functions
■ Windowing Aggregate Functions
■ Reporting Aggregate Functions
■ LAG/LEAD Functions
■ FIRST/LAST Functions
■ Inverse Percentile Functions
■ Hypothetical Rank and Distribution Functions
■ Linear Regression Functions
■ Frequent Itemsets
■ Other Statistical Functions
■ WIDTH_BUCKET Function
■ User-Defined Aggregate Functions
■ CASE Expressions
■ Data Densification for Reporting
■ Time Series Calculations on Densified Data
To perform these operations, the analytic functions add several new elements to
SQL processing. These elements build on existing SQL to allow flexible and
powerful calculation expressions. With just a few exceptions, the analytic functions
have these new elements. The processing flow is represented in Figure 21–1.
result set may be partitioned into just one partition holding all the rows, a few
large partitions, or many small partitions holding just a few rows each.
■ Window
For each row in a partition, you can define a sliding window of data. This
window determines the range of rows used to perform the calculations for the
current row. Window sizes can be based on either a physical number of rows or
a logical interval such as time. The window has a starting row and an ending
row. Depending on its definition, the window may move at one or both ends.
For instance, a window defined for a cumulative sum function would have its
starting row fixed at the first row of its partition, and its ending row would
slide from the starting point all the way to the last row of the partition. In
contrast, a window defined for a moving average would have both its starting
and end points slide so that they maintain a constant physical or logical range.
A window can be set as large as all the rows in a partition or just a sliding
window of one row within a partition. When a window is near a border, the
function returns results for only the available rows, rather than warning you
that the results are not what you want.
When using window functions, the current row is included during calculations,
so you should only specify (n-1) when you are dealing with n items.
■ Current row
Each calculation performed with an analytic function is based on a current row
within a partition. The current row serves as the reference point determining
the start and end of the window. For instance, a centered moving average
calculation could be defined with a window that holds the current row, the six
preceding rows, and the following six rows. This would create a sliding
window of 13 rows, as shown in Figure 21–2.
Window Start
Window Finish
Ranking Functions
A ranking function computes the rank of a record compared to other records in the
data set based on the values of a set of measures. The types of ranking function are:
■ RANK and DENSE_RANK Functions
■ CUME_DIST Function
■ PERCENT_RANK Function
■ NTILE Function
■ ROW_NUMBER Function
The difference between RANK and DENSE_RANK is that DENSE_RANK leaves no gaps
in ranking sequence when there are ties. That is, if you were ranking a competition
using DENSE_RANK and had three people tie for second place, you would say that
all three were in second place and that the next person came in third. The RANK
function would also give three people in second place, but the next person would
be in fifth place.
The following are some relevant points about RANK:
■ Ascending is the default sort order, which you may want to change to
descending.
■ The expressions in the optional PARTITION BY clause divide the query result
set into groups within which the RANK function operates. That is, RANK gets
reset whenever the group changes. In effect, the value expressions of the
PARTITION BY clause define the reset boundaries.
■ If the PARTITION BY clause is missing, then ranks are computed over the entire
query result set.
■ The ORDER BY clause specifies the measures (<value expression>) on which
ranking is done and defines the order in which rows are sorted in each group
(or partition). Once the data is sorted within each partition, ranks are given to
each row starting from 1.
■ The NULLS FIRST | NULLS LAST clause indicates the position of NULLs in the
ordered sequence, either first or last in the sequence. The order of the sequence
would make NULLs compare either high or low with respect to non-NULL
values. If the sequence were in ascending order, then NULLS FIRST implies that
NULLs are smaller than all other non-NULL values and NULLS LAST implies
they are larger than non-NULL values. It is the opposite for descending order.
See the example in "Treatment of NULLs" on page 21-10.
■ If the NULLS FIRST | NULLS LAST clause is omitted, then the ordering of the
null values depends on the ASC or DESC arguments. Null values are considered
larger than any other values. If the ordering sequence is ASC, then nulls will
appear last; nulls will appear first otherwise. Nulls are considered equal to
other nulls and, therefore, the order in which nulls are presented is
non-deterministic.
Ranking Order
The following example shows how the [ASC | DESC] option changes the ranking
order.
While the data in this result is ordered on the measure SALES$, in general, it is not
guaranteed by the RANK function that the data will be sorted on the measures. If
you want the data to be sorted on SALES$ in your result, you must specify it
explicitly with an ORDER BY clause, at the end of the SELECT statement.
The sales_count column breaks the ties for three pairs of values.
Note that, in the case of DENSE_RANK, the largest rank value gives the number of
distinct values in the data set.
A single query block can contain more than one ranking function, each partitioning
the data into different groups (that is, reset on different boundaries). The groups can
be mutually exclusive. The following query ranks products based on their dollar
sales within each month (rank_of_product_per_region) and within each
channel (rank_of_product_total).
Treatment of NULLs
NULLs are treated like normal values. Also, for rank computation, a NULL value is
assumed to be equal to another NULL value. Depending on the ASC | DESC options
provided for measures and the NULLS FIRST | NULLS LAST clause, NULLs will
either sort low or high and hence, are given ranks appropriately. The following
example shows how NULLs are ranked in different cases:
SELECT times.time_id time, sold,
RANK() OVER (ORDER BY (sold) DESC NULLS LAST) AS NLAST_DESC,
22-JAN-99 20 1 1 20
04-JAN-99 20 1 1 20
24-JAN-99 20 1 1 20
15-JAN-99 20 1 1 20
Bottom N Ranking
Bottom N is similar to top N except for the ordering sequence within the rank
expression. Using the previous example, you can order SUM(s_amount) ascending
instead of descending.
CUME_DIST Function
The CUME_DIST function (defined as the inverse of percentile in some statistical
books) computes the position of a specified value relative to a set of values. The
order can be ascending or descending. Ascending is the default. The range of values
for CUME_DIST is from greater than 0 to 1. To compute the CUME_DIST of a value x
in a set S of size N, you use the formula:
CUME_DIST(x) = number of values in S coming before
and including x in the specified order/ N
The semantics of various options in the CUME_DIST function are similar to those in
the RANK function. The default order is ascending, implying that the lowest value
gets the lowest CUME_DIST (as all other values come later than this value in the
order). NULLs are treated the same as they are in the RANK function. They are
counted toward both the numerator and the denominator as they are treated like
non-NULL values. The following example finds cumulative distribution of sales by
channel within each month:
SELECT calendar_month_desc AS MONTH, channel_desc,
TO_CHAR(SUM(amount_sold) , '9,999,999,999') SALES$,
CUME_DIST() OVER (PARTITION BY calendar_month_desc ORDER BY
SUM(amount_sold) ) AS CUME_DIST_BY_CHANNEL
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND
times.calendar_month_desc IN ('2000-09', '2000-07','2000-08')
GROUP BY calendar_month_desc, channel_desc;
PERCENT_RANK Function
PERCENT_RANK is similar to CUME_DIST, but it uses rank values rather than row
counts in its numerator. Therefore, it returns the percent rank of a value relative to a
group of values. The function is available in many popular spreadsheets. PERCENT_
RANK of a row is calculated as:
(rank of row in its partition - 1) / (number of rows in the partition - 1)
PERCENT_RANK returns values in the range zero to one. The row(s) with a rank of 1
will have a PERCENT_RANK of zero. Its syntax is:
PERCENT_RANK () OVER ([query_partition_clause] order_by_clause)
NTILE Function
NTILE allows easy calculation of tertiles, quartiles, deciles and other common
summary statistics. This function divides an ordered partition into a specified
number of groups called buckets and assigns a bucket number to each row in the
partition. NTILE is a very useful calculation because it lets users divide a data set
into fourths, thirds, and other groupings.
The buckets are calculated so that each bucket has exactly the same number of rows
assigned to it or at most 1 row more than the others. For instance, if you have 100
rows in a partition and ask for an NTILE function with four buckets, 25 rows will be
assigned a value of 1, 25 rows will have value 2, and so on. These buckets are
referred to as equiheight buckets.
If the number of rows in the partition does not divide evenly (without a remainder)
into the number of buckets, then the number of rows assigned for each bucket will
differ by one at most. The extra rows will be distributed one for each bucket starting
from the lowest bucket number. For instance, if there are 103 rows in a partition
which has an NTILE(5) function, the first 21 rows will be in the first bucket, the
next 21 in the second bucket, the next 21 in the third bucket, the next 20 in the
fourth bucket and the final 20 in the fifth bucket.
The NTILE function has the following syntax:
NTILE (expr) OVER ([query_partition_clause] order_by_clause)
ROW_NUMBER Function
The ROW_NUMBER function assigns a unique number (sequentially, starting from 1,
as defined by ORDER BY) to each row within the partition. It has the following
syntax:
ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_clause )
Note that there are three pairs of tie values in these results. Like NTILE, ROW_
NUMBER is a non-deterministic function, so each tied value could have its row
number switched. To ensure deterministic results, you must order on a unique key.
Inmost cases, that will require adding a new tie breaker column to the query and
using it in the ORDER BY specification.
provide access to more than one row of a table without a self-join. The syntax of the
windowing functions is:
{SUM|AVG|MAX|MIN|COUNT|STDDEV|VARIANCE|FIRST_VALUE|LAST_VALUE}
({value expression1 | *}) OVER
([PARTITION BY value expression2[,...])
ORDER BY value expression3 [collate clause>]
[ASC| DESC] [NULLS FIRST | NULLS LAST] [,...]
{ROWS | RANGE} {BETWEEN
{UNBOUNDED PRECEDING | CURRENT ROW | value_expr {PRECEDING | FOLLOWING}} AND
{ UNBOUNDED FOLLOWING | CURRENT ROW | value_expr { PRECEDING | FOLLOWING } }
| { UNBOUNDED PRECEDING | CURRENT ROW | value_expr PRECEDING}}
In this example, the analytic function SUM defines, for each row, a window that
starts at the beginning of the partition (UNBOUNDED PRECEDING) and ends, by
default, at the current row.
Nested SUMs are needed in this example since we are performing a SUM over a value
that is itself a SUM. Nested aggregations are used very often in analytic aggregate
functions.
Note that the first two rows for the three month moving average calculation in the
output data are based on a smaller interval size than specified because the window
calculation cannot reach past the data retrieved by the query. You need to consider
the different window sizes found at the borders of result sets. In other words, you
may need to modify the query to include exactly what you want.
The starting and ending rows for each product's centered moving average
calculation in the output data are based on just two days, since the window
calculation cannot reach past the data retrieved by the query. Users need to consider
the different window sizes found at the borders of result sets: the query may need
to be adjusted.
In the output of this example, all dates except May 6 and May 12 return two rows
with duplicate dates. Examine the commented numbers to the right of the output to
see how the values are calculated. Note that each group in parentheses represents
the values returned for a single day.
Note that this example applies only when you use the RANGE keyword rather than
the ROWS keyword. It is also important to remember that with RANGE, you can only
use 1 ORDER BY expression in the analytic function's ORDER BY clause. With the
ROWS keyword, you can use multiple order by expressions in the analytic function's
ORDER BY clause.
One way to handle this problem would be to add the prod_id column to the result
set and order on both time_id and prod_id.
If the IGNORE NULLS option is used with FIRST_VALUE, it will return the first
non-null value in the set, or NULL if all values are NULL. If IGNORE NULLS is used
with LAST_VALUE, it will return the last non-null value in the set, or NULL if all
values are NULL. The IGNORE NULLS option is particularly useful in populating an
inventory table properly.
RATIO_TO_REPORT Function
The RATIO_TO_REPORT function computes the ratio of a value to the sum of a set
of values. If the expression value expression evaluates to NULL, RATIO_TO_
REPORT also evaluates to NULL, but it is treated as zero for computing the sum of
values for the denominator. Its syntax is:
RATIO_TO_REPORT ( expr ) OVER ( [query_partition_clause] )
LAG/LEAD Functions
The LAG and LEAD functions are useful for comparing values when the relative
positions of rows can be known reliably. They work by specifying the count of rows
which separate the target row from the current row. Because the functions provide
access to more than one row of a table at the same time without a self-join, they can
enhance processing speed. The LAG function provides access to a row at a given
offset prior to the current position, and the LEAD function provides access to a row
at a given offset after the current position.
LAG/LEAD Syntax
These functions have the following syntax:
{LAG | LEAD} ( value_expr [, offset] [, default] )
OVER ( [query_partition_clause] order_by_clause )
See "Data Densification for Reporting" for information showing how to use the
LAG/LEAD functions for doing period-to-period comparison queries on sparse data.
FIRST/LAST Functions
The FIRST/LAST aggregate functions allow you to rank a data set and work with
its top-ranked or bottom-ranked rows. After finding the top or bottom ranked rows,
an aggregate function is applied to any desired column. That is, FIRST/LAST lets
you rank on column A but return the result of an aggregate applied on the
first-ranked or last-ranked rows of column B. This is valuable because it avoids the
need for a self-join or subquery, thus improving performance. These functions'
syntax begins with a regular aggregate function (MIN, MAX, SUM, AVG, COUNT,
VARIANCE, STDDEV) that produces a single return value per group. To specify the
ranking used, the FIRST/LAST functions add a new clause starting with the word
KEEP.
FIRST/LAST Syntax
These functions have the following syntax:
aggregate_function KEEP ( DENSE_RANK LAST ORDER BY
expr [ DESC | ASC ] [NULLS { FIRST | LAST }]
[, expr [ DESC | ASC ] [NULLS { FIRST | LAST }]]...)
[OVER query_partitioning_clause]
AS LP_OF_HI_MINP,
MAX(prod_min_price) AS HI_MINP
FROM products WHERE prod_category='Electronics'
GROUP BY prod_subcategory;
Using the FIRST and LAST functions as reporting aggregates makes it easy to
include the results in calculations such "Salary as a percent of the highest salary."
PERC_DISC PERC_CONT
--------- ---------
5000 5000
linear interpolation between the values from rows at row numbers CRN =
CEIL(RN) and FRN = FLOOR(RN).
The final result will be: PERCENTILE_CONT(X) = if (CRN = FRN = RN), then
(value of expression from row at RN) else (CRN - RN) * (value of expression for row
at FRN) + (RN -FRN) * (value of expression for row at CRN).
Consider the previous example query, where we compute PERCENTILE_
CONT(0.5). Here n is 17. The row number RN = (1 + 0.5*(n-1))= 9 for both groups.
Putting this into the formula, (FRN=CRN=9), we return the value from row 9 as the
result.
Another example is, if you want to compute PERCENTILE_CONT(0.66). The
computed row number RN=(1 + 0.66*(n-1))= (1 + 0.66*16)= 11.67. PERCENTILE_
CONT(0.66) = (12-11.67)*(value of row 11)+(11.67-11)*(value of row 12). These results
are:
SELECT PERCENTILE_DISC(0.66) WITHIN GROUP
(ORDER BY cust_credit_limit) AS perc_disc, PERCENTILE_CONT(0.66) WITHIN GROUP
(ORDER BY cust_credit_limit) AS perc_cont
FROM customers WHERE cust_city='Marshal';
PERC_DISC PERC_CONT
---------- ----------
9000 8040
Inverse percentile aggregate functions can appear in the HAVING clause of a query
like other existing aggregate functions.
As Reporting Aggregates
You can also use the aggregate functions PERCENTILE_CONT, PERCENTILE_DISC
as reporting aggregate functions. When used as reporting aggregate functions, the
syntax is similar to those of other reporting aggregates.
[PERCENTILE_CONT | PERCENTILE_DISC](constant expression)
WITHIN GROUP ( ORDER BY single order by expression
[ASC|DESC] [NULLS FIRST| NULLS LAST])
OVER ( [PARTITION BY value expression [,...]] )
This query computes the same thing (median credit limit for customers in this result
set, but reports the result for every row in the result set, as shown in the following
output:
SELECT cust_id, cust_credit_limit, PERCENTILE_DISC(0.5) WITHIN GROUP
(ORDER BY cust_credit_limit) OVER () AS perc_disc,
can use the NULLS FIRST/NULLS LAST option in the ORDER BY clause, but they
will be ignored as NULLs are ignored.
Unlike the inverse percentile aggregates, the ORDER BY clause in the sort
specification for hypothetical rank and distribution functions may take multiple
expressions. The number of arguments and the expressions in the ORDER BY clause
should be the same and the arguments must be constant expressions of the same or
compatible type to the corresponding ORDER BY expression. The following is an
example using two arguments in several hypothetical ranking functions.
These functions can appear in the HAVING clause of a query just like other
aggregate functions. They cannot be used as either reporting aggregate functions or
windowing aggregate functions.
■ REGR_COUNT Function
■ REGR_AVGY and REGR_AVGX Functions
■ REGR_SLOPE and REGR_INTERCEPT Functions
■ REGR_R2 Function
■ REGR_SXX, REGR_SYY, and REGR_SXY Functions
Oracle applies the function to the set of (e1, e2) pairs after eliminating all pairs for
which either of e1 or e2 is null. e1 is interpreted as a value of the dependent
variable (a "y value"), and e2 is interpreted as a value of the independent variable
(an "x value"). Both expressions must be numbers.
The regression functions are all computed simultaneously during a single pass
through the data. They are frequently combined with the COVAR_POP, COVAR_
SAMP, and CORR functions.
REGR_COUNT Function
REGR_COUNT returns the number of non-null number pairs used to fit the
regression line. If applied to an empty set (or if there are no (e1, e2) pairs where
neither of e1 or e2 is null), the function returns 0.
REGR_R2 Function
The REGR_R2 function computes the coefficient of determination (usually called
"R-squared" or "goodness of fit") for the regression line.
REGR_R2 returns values between 0 and 1 when the regression line is defined (slope
of the line is not null), and it returns NULL otherwise. The closer the value is to 1,
the better the regression line fits the data.
RSQR are slope, intercept, and coefficient of determination of the regression line,
respectively. The (integer) value COUNT is the number of products in each channel
for whom both quantity sold and list price data are available.
SELECT s.channel_id, REGR_SLOPE(s.quantity_sold, p.prod_list_price) SLOPE,
REGR_INTERCEPT(s.quantity_sold, p.prod_list_price) INTCPT,
REGR_R2(s.quantity_sold, p.prod_list_price) RSQR,
REGR_COUNT(s.quantity_sold, p.prod_list_price) COUNT,
REGR_AVGX(s.quantity_sold, p.prod_list_price) AVGLISTP,
REGR_AVGY(s.quantity_sold, p.prod_list_price) AVGQSOLD
FROM sales s, products p WHERE s.prod_id=p.prod_id
AND p.prod_category='Electronics' AND s.time_id=to_DATE('10-OCT-2000')
GROUP BY s.channel_id;
Frequent Itemsets
Instead of counting how often a given event occurs (for example, how often
someone has purchased milk at the grocery), frequent itemsets provides a
mechanism for counting how often multiple events occur together (for example,
how often someone has purchased both milk and cereal together at the grocery
store).
The input to the frequent-itemsets operation is a set of data that represents
collections of items (itemsets). Some examples of itemsets could be all of the
products that a given customer purchased in a single trip to the grocery store
(commonly called a market basket), the web-pages that a user accessed in a single
session, or the financial services that a given customer utilizes. The notion of a
frequent itemset is to find those itemsets that occur most often. If you apply the
frequent-itemset operator to a grocery store's point-of-sale data, you might, for
example, discover that milk and bananas are the most commonly bought pair of
items.
Frequent itemsets have thus been used in business intelligence environments for
many years, with the most common one being for market basket analysis in the
retail industry. Frequent itemsets are integrated with the database, operating on top
of relational tables and accessed through SQL. This integration provides a couple of
key benefits:
Descriptive Statistics
You can calculate the following descriptive statistics:
■ Median of a Data Set
Median (expr) [OVER (query_partition_clause)]
■ Paired-Samples T-Test
STATS_T_TEST_PAIRED (expr1, expr2 [, return_value])
■ The F-Test
STATS_F_TEST (expr1, expr2 [, return_value])
■ One-Way ANOVA
STATS_ONE_WAY_ANOVA (expr1, expr2 [, return_value])
Crosstab Statistics
You can calculate crosstab statistics using the following syntax:
STATS_CROSSTAB (expr1, expr2 [, return_value])
■ Mann-Whitney Test
STATS_MW_TEST (expr1, expr2 [, return_value])
■ Kolmogorov-Smirnov Test
STATS_KS_TEST (expr1, expr2 [, return_value])
Non-Parametric Correlation
You can calculate the following parametric statistics:
■ Spearman's rho Coefficient
CORR_S (expr1, expr2 [, return_value])
In addition to the functions, this release has a new PL/SQL package, DBMS_STAT_
FUNCS. It contains the descriptive statistical function SUMMARY along with functions
to support distribution fitting. The SUMMARY function summarizes a numerical
column of a table with a variety of descriptive statistics. The five distribution fitting
functions support normal, uniform, Weibull, Poisson, and exponential distributions.
WIDTH_BUCKET Function
For a given expression, the WIDTH_BUCKET function returns the bucket number
that the result of this expression will be assigned after it is evaluated. You can
generate equiwidth histograms with this function. Equiwidth histograms divide
data sets into buckets whose interval size (highest value to lowest value) is equal.
The number of rows held by each bucket will vary. A related function, NTILE,
creates equiheight buckets.
Equiwidth histograms can be generated only for numeric, date or datetime types.
So the first three parameters should be all numeric expressions or all date
expressions. Other types of expressions are not allowed. If the first parameter is
NULL, the result is NULL. If the second or the third parameter is NULL, an error
message is returned, as a NULL value cannot denote any end point (or any point) for
a range in a date or numeric value dimension. The last parameter (number of
buckets) should be a numeric expression that evaluates to a positive integer value;
0, NULL, or a negative value will result in an error.
Buckets are numbered from 0 to (n+1). Bucket 0 holds the count of values less than
the minimum. Bucket(n+1) holds the count of values greater than or equal to the
maximum specified value.
WIDTH_BUCKET Syntax
The WIDTH_BUCKET takes four expressions as parameters. The first parameter is the
expression that the equiwidth histogram is for. The second and third parameters are
expressions that denote the end points of the acceptable range for the first
parameter. The fourth parameter denotes the number of buckets.
WIDTH_BUCKET(expression, minval expression, maxval expression, num buckets)
Consider the following data from table customers, that shows the credit limits of
17 customers. This data is gathered in the query shown in Example 21–19 on
page 21-41.
CUST_ID CUST_CREDIT_LIMIT
--------- -----------------
10346 7000
35266 7000
41496 15000
35225 11000
3424 9000
28344 1500
31112 7000
8962 1500
15192 3000
21380 5000
36651 1500
30420 5000
8270 3000
17268 11000
14459 11000
13808 5000
32497 1500
100977 9000
102077 3000
103066 10000
101784 5000
100421 11000
102343 3000
overflow bucket which is numbered 5 (num buckets + 1 in general). See Figure 21–3
for a graphical illustration of how the buckets are assigned.
Credit Limits
0 5000 10000 15000 20000
Bucket #
0 1 2 3 4 5
You can specify the bounds in the reverse order, for example, WIDTH_BUCKET
(cust_credit_limit, 20000, 0, 4). When the bounds are reversed, the buckets
will be open-closed intervals. In this example, bucket number 1 is (15000,20000],
bucket number 2 is (10000,15000], and bucket number 4, is (0,5000]. The
overflow bucket will be numbered 0 (20000, +infinity), and the underflow
bucket will be numbered 5 (-infinity, 0].
It is an error if the bucket count parameter is 0 or negative.
36651 1500 1 4
30420 5000 2 4
8270 3000 1 4
17268 11000 3 2
14459 11000 3 2
13808 5000 2 4
32497 1500 1 4
100977 9000 2 3
102077 3000 1 4
103066 10000 3 3
101784 5000 2 4
100421 11000 3 2
102343 3000 1 4
USERDEF_SKEW
============
0.583891
CASE Expressions
Oracle now supports simple and searched CASE statements. CASE statements are
similar in purpose to the DECODE statement, but they offer more flexibility and
logical power. They are also easier to read than traditional DECODE statements, and
offer better performance as well. They are commonly used when breaking
categories into buckets like age (for example, 20-29, 30-39, and so on). The syntax
for simple statements is:
expr WHEN comparison_expr THEN return_expr
[, WHEN comparison_expr THEN return_expr]...
You can specify only 255 arguments and each WHEN ... THEN pair counts as two
arguments. For a workaround to this limit, see Oracle Database SQL Reference.
In this, foo is a function that returns its input if the input is greater than 2000, and
returns 2000 otherwise. The query has performance implications because it needs to
invoke a function for each row. Writing custom functions can also add to the
development load.
Using CASE expressions in the database without PL/SQL, this query can be
rewritten as:
SELECT AVG(CASE when e.sal > 2000 THEN e.sal ELSE 2000 end) FROM emps e;
Using a CASE expression lets you avoid developing custom functions and can also
perform faster.
BUCKET COUNT_IN_GROUP
------------- --------------
0 - 3999 8
4000 - 7999 7
8000 - 11999 7
12000 - 16000 1
SELECT .....
FROM table_reference
LEFT OUTER JOIN table_reference
PARTITION BY {expr [,expr ]...)
Note that FULL OUTER JOIN is not supported with a partitioned outer join.
In this example, we would expect 22 rows of data (11 weeks each from 2 years) if
the data were dense. However we get only 18 rows because weeks 25 and 26 are
missing in 2000, and weeks 26 and 28 in 2001.
Bounce 2001 26 0
Bounce 2001 27 2125.12
Bounce 2001 28 0
Bounce 2001 29 2467.92
Bounce 2001 30 2620.17
Note that in this query, a WHERE condition was placed for weeks between 20 and 30
in the inline view for the time dimension. This was introduced to keep the result set
small.
In this query, the WITH sub-query factoring clause v1, summarizes sales data at the
product, country, and year level. This result is sparse but users may want to see all
the country, year combinations for each product. To achieve this, we take each
partition of v1 based on product values and outer join it on the country dimension
first. This will give us all values of country for each product. We then take that
result and partition it on product and country values and then outer join it on time
dimension. This will give us all time values for each product and country
combination.
PROD_ID COUNTRY_ID CALENDAR_YEAR UNITS SALES
---------- ---------- ------------- ---------- ----------
147 52782 1998
147 52782 1999 29 209.82
147 52782 2000 71 594.36
147 52782 2001 345 2754.42
147 52782 2002
147 52785 1998 1 7.99
147 52785 1999
147 52785 2000
147 52785 2001
147 52785 2002
147 52786 1998 1 7.99
147 52786 1999
147 52786 2000 2 15.98
147 52786 2001
147 52786 2002
147 52787 1998
147 52787 1999
147 52787 2000
147 52787 2001
147 52787 2002
147 52788 1998
147 52788 1999
147 52788 2000 1 7.99
147 52788 2001
147 52788 2002
148 52782 1998 139 4046.67
148 52782 1999 228 5362.57
148 52782 2000 251 5629.47
148 52782 2001 308 7138.98
148 52782 2002
148 52785 1998
148 52785 1999
148 52785 2000
148 52785 2001
148 52785 2002
148 52786 1998
148 52786 1999
For reporting purposes, users may want to see this inventory data differently. For
example, they may want to see all values of time for each product. This can be
accomplished using partitioned outer join. In addition, for the newly inserted rows
of missing time periods, users may want to see the values for quantity of units
column to be carried over from the most recent existing time period. The latter can
be accomplished using analytic window function LAST_VALUE value. Here is the
query and the desired output:
WITH v1 AS
(SELECT time_id
FROM times
WHERE times.time_id BETWEEN
TO_DATE('01/04/01', 'DD/MM/YY')
AND TO_DATE('07/04/01', 'DD/MM/YY'))
SELECT product, time_id, quant quantity,
LAST_VALUE(quant IGNORE NULLS)
OVER (PARTITION BY product ORDER BY time_id)
repeated_quantity
FROM
(SELECT product, v1.time_id, quant
FROM invent_table PARTITION BY (product)
RIGHT OUTER JOIN v1
ON (v1.time_id = invent_table.time_id))
ORDER BY 1, 2;
The inner query computes a partitioned outer join on time within each product. The
inner query densifies the data on the time dimension (meaning the time dimension
will now have a row for each day of the week). However, the measure column
quantity will have nulls for the newly added rows (see the output in the column
quantity in the following results.
The outer query uses the analytic function LAST_VALUE. Applying this function
partitions the data by product and orders the data on the time dimension column
(time_id). For each row, the function finds the last non-null value in the window
due to the option IGNORE NULLS, which you can use with both LAST_VALUE and
FIRST_VALUE. We see the desired output in the column repeated_quantity in
the following output:
PRODUCT TIME_ID QUANTITY REPEATED_QUANTITY
---------- --------- -------- -----------------
bottle 01-APR-01 10 10
bottle 02-APR-01 10
bottle 03-APR-01 10
bottle 04-APR-01 10
bottle 05-APR-01 10
bottle 06-APR-01 8 8
bottle 07-APR-01 8
can 01-APR-01 15 15
can 02-APR-01 15
can 03-APR-01 15
can 04-APR-01 11 11
can 05-APR-01 11
can 06-APR-01 11
can 07-APR-01 11
WITH V AS
(SELECT substr(p.prod_name,1,12) prod_name, calendar_month_desc,
SUM(quantity_sold) units, SUM(amount_sold) sales
FROM sales s, products p, times t
WHERE s.prod_id in (122,136) AND calendar_year = 2000
AND t.time_id = s.time_id
AND p.prod_id = s.prod_id
GROUP BY p.prod_name, calendar_month_desc)
SELECT v.prod_name, calendar_month_desc, units, sales,
NVL(units, AVG(units) OVER (partition by v.prod_name)) computed_units,
NVL(sales, AVG(sales) OVER (partition by v.prod_name)) computed_sales
FROM
(SELECT DISTINCT calendar_month_desc
FROM times
WHERE calendar_year = 2000) t
WITH v AS
(SELECT SUBSTR(p.Prod_Name,1,6) Prod, t.Calendar_Year Year,
t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales
FROM Sales s, Times t, Products p
WHERE s.Time_id = t.Time_id AND
s.Prod_id = p.Prod_id AND p.Prod_name in ('Y Box') AND
t.Calendar_Year in (2000,2001) AND
t.Calendar_Week_Number BETWEEN 30 AND 40
GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number)
SELECT Prod , Year, Week, Sales,
Weekly_ytd_sales, Weekly_ytd_sales_prior_year
FROM
(SELECT Prod, Year, Week, Sales, Weekly_ytd_sales,
LAG(Weekly_ytd_sales, 1) OVER
(PARTITION BY Prod , Week ORDER BY Year) Weekly_ytd_sales_prior_year
FROM
(SELECT v.Prod Prod , t.Year Year, t.Week Week,
NVL(v.Sales,0) Sales, SUM(NVL(v.Sales,0)) OVER
(PARTITION BY v.Prod , t.Year ORDER BY t.week) weekly_ytd_sales
FROM v
PARTITION BY (v.Prod )
RIGHT OUTER JOIN
(SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year
FROM Times
WHERE Calendar_Year IN (2000, 2001)) t
ON (v.week = t.week AND v.Year = t.Year)
) dense_sales
) year_over_year_sales
WHERE Year = 2001 AND Week BETWEEN 30 AND 40
ORDER BY 1, 2, 3;
Weekly_ytd
_sales_
PROD YEAR WEEK SALES WEEKLY_YTD_SALES prior_year
------ ---------- ---------- ---------- ---------------- ----------
Y Box 2001 30 7877.45 7877.45 0
Y Box 2001 31 13082.46 20959.91 1537.35
Y Box 2001 32 11569.02 32528.93 9531.57
Y Box 2001 33 38081.97 70610.9 39048.69
Y Box 2001 34 33109.65 103720.55 69100.79
Y Box 2001 35 0 103720.55 71265.35
Y Box 2001 36 4169.3 107889.85 81156.29
Y Box 2001 37 24616.85 132506.7 95433.09
Y Box 2001 38 37739.65 170246.35 107726.96
Y Box 2001 39 284.95 170531.3 118817.4
Y Box 2001 40 10868.44 181399.74 120969.69
In the FROM clause of the in-line view dense_sales, we use a partitioned outer
join of aggregate view v and time view t to fill gaps in the sales data along the time
dimension. The output of the partitioned outer join is then processed by the analytic
function SUM ... OVER to compute the weekly year-to-date sales (the weekly_
ytd_sales column). Thus, the view dense_sales computes the year-to-date
sales data for each week, including those missing in the aggregate view s. The
in-line view year_over_year_sales then computes the year ago weekly
year-to-date sales using the LAG function. The LAG function labeled weekly_ytd_
sales_prior_year specifies a PARTITION BY clause that pairs rows for the same
week of years 2000 and 2001 into a single partition. We then pass an offset of 1 to the
LAG function to get the weekly year to date sales for the prior year.
The outermost query block selects data from year_over_year_sales with the
condition yr = 2001, and thus the query returns, for each product, its weekly
year-to-date sales in the specified weeks of years 2001 and 2000.
GROUP BY
ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc, t.time_id),
ROLLUP(prod_category, prod_subcategory, p.prod_id);
Because this view is limited to two products, it returns just over 2200 rows. Note
that the column Hierarchical_Time contains string representations of time from
all levels of the time hierarchy. The CASE expression used for the Hierarchical_
Time column appends a marker (_0, _1, ...) to each date string to denote the time
level of the value. A _0 represents the year level, _1 is quarters, _2 is months, and _3
is day. Note that the GROUP BY clause is a concatenated ROLLUP which specifies the
rollup hierarchy for the time and product dimensions. The GROUP BY clause is what
determines the hierarchical cube contents.
Step 2 Create the view edge_time, which is a complete set of date values
edge_time is the source for filling time gaps in the hierarchical cube using a
partitioned outer join. The column Hierarchical_Time in edge_time will be
used in a partitioned join with the Hierarchical_Time column in the view
cube_prod_time. The following statement defines edge_time:
CREATE OR REPLACE VIEW edge_time AS
SELECT
(CASE
WHEN ((GROUPING(calendar_year)=0 )
AND (GROUPING(calendar_quarter_desc)=1 ))
THEN (TO_CHAR(calendar_year) || '_0')
WHEN ((GROUPING(calendar_quarter_desc)=0 )
AND (GROUPING(calendar_month_desc)=1 ))
THEN (TO_CHAR(calendar_quarter_desc) || '_1')
WHEN ((GROUPING(calendar_month_desc)=0 )
AND (GROUPING(time_id)=1 ))
THEN (TO_CHAR(calendar_month_desc) || '_2')
ELSE (TO_CHAR(time_id) || '_3')
END) Hierarchical_Time,
calendar_year yr, calendar_quarter_number qtr_num,
calendar_quarter_desc qtr, calendar_month_number mon_num,
calendar_month_desc mon, time_id - TRUNC(time_id, 'YEAR') + 1 day_num,
time_id day,
GROUPING_ID(calendar_year, calendar_quarter_desc,
calendar_month_desc, time_id) gid_t
FROM TIMES
GROUP BY ROLLUP
(calendar_year, (calendar_quarter_desc, calendar_quarter_number),
(calendar_month_desc, calendar_month_number), time_id);
Some of the calculations we can achieve for each time level are:
■ Sum of sales for prior period at all levels of time.
■ Variance in sales over prior period.
■ Sum of sales in the same period a year ago at all levels of time.
■ Variance in sales over the same period last year.
The following example performs all four of these calculations. It uses a partitioned
outer join of the views cube_prod_time and edge_time to create an in-line view
of dense data called dense_cube_prod_time. The query then uses the LAG
function in the same way as the prior single-level example. The outer WHERE clause
specifies time at three levels: the days of August 2001, the entire month, and the
entire third quarter of 2001. Note that the last two rows of the results contain the
month level and quarter level aggregations.
Note: To make the results easier to read if you are using SQL*Plus, the column
headings should be adjusted with the following commands. The commands will
fold the column headings to reduce line length:
col sales_prior_period heading 'sales_prior|_period'
col variance_prior_period heading 'variance|_prior|_period'
col sales_same_period_prior_year heading 'sales_same|_period_prior|_year'
col variance_same_period_p_year heading 'variance|_same_period|_prior_year'
Here is the query comparing current sales to prior and year ago sales:
SELECT SUBSTR(prod,1,4) prod, SUBSTR(Hierarchical_Time,1,12) ht,
sales, sales_prior_period,
sales - sales_prior_period variance_prior_period,
sales_same_period_prior_year,
sales - sales_same_period_prior_year variance_same_period_p_year
FROM
(SELECT cat, subcat, prod, gid_p, gid_t,
Hierarchical_Time, yr, qtr, mon, day, sales,
LAG(sales, 1) OVER (PARTITION BY gid_p, cat, subcat, prod,
gid_t ORDER BY yr, qtr, mon, day)
sales_prior_period,
LAG(sales, 1) OVER (PARTITION BY gid_p, cat, subcat, prod,
gid_t, qtr_num, mon_num, day_num ORDER BY yr)
sales_same_period_prior_year
FROM
(SELECT c.gid, c.cat, c.subcat, c.prod, c.gid_p,
t.gid_t, t.yr, t.qtr, t.qtr_num, t.mon, t.mon_num,
t.day, t.day_num, t.Hierarchical_Time, NVL(s_sold,0) sales
FROM cube_prod_time c
PARTITION BY (gid_p, cat, subcat, prod)
RIGHT OUTER JOIN edge_time t
ON ( c.gid_t = t.gid_t AND
c.Hierarchical_Time = t.Hierarchical_Time)
) dense_cube_prod_time
) --side by side current and prior year sales
WHERE prod IN (139) AND gid_p=0 AND --1 product and product level data
( (mon IN ('2001-08' ) AND gid_t IN (0, 1)) OR --day and month data
(qtr IN ('2001-03' ) AND gid_t IN (3))) --quarter level data
ORDER BY day;
139 30-AUG-01_3 0 0 0 0 0
139 31-AUG-01_3 0 0 0 0 0
139 2001-08_2 8347.43 7213.21 1134.22 8368.98 -21.55
139 2001-03_1 24356.8 28862.14 -4505.34 24168.99 187.81
The first LAG function (sales_prior_period) partitions the data on gid_p, cat,
subcat, prod, gid_t and orders the rows on all the time dimension columns. It
gets the sales value of the prior period by passing an offset of 1. The second LAG
function (sales_same_period_prior_year) partitions the data on additional
columns qtr_num, mon_num, and day_num and orders it on yr so that, with an
offset of 1, it can compute the year ago sales for the same period. The outermost
SELECT clause computes the variances.
In this statement, the view time_c is defined by performing a UNION ALL of the
edge_time view (defined in the prior example) and the user-defined 13th month.
The gid_t value of 8 was chosen to differentiate the custom member from the
standard members. The UNION ALL specifies the attributes for a 13th month
member by doing a SELECT from the DUAL table. Note that the grouping id,
column gid_t, is set to 8, and the quarter number is set to 5.
Then, the second step is to use an inline view of the query to perform a partitioned
outer join of cube_prod_time with time_c. This step creates sales data for the
13th month at each level of product aggregation. In the main query, the analytic
function SUM is used with a CASE expression to compute the 13th month, which is
defined as the summation of the first month's sales of each quarter.
SELECT * FROM
(SELECT SUBSTR(cat,1,12) cat, SUBSTR(subcat,1,12) subcat,
prod, mon, mon_num,
SUM(CASE WHEN mon_num IN (1, 4, 7, 10)
THEN s_sold
ELSE NULL
END)
OVER (PARTITION BY gid_p, prod, subcat, cat, yr) sales_month_13
FROM
(SELECT c.gid, c.prod, c.subcat, c.cat, gid_p,
t.gid_t, t.day, t.mon, t.mon_num,
t.qtr, t.yr, NVL(s_sold,0) s_sold
FROM cube_prod_time c
PARTITION BY (gid_p, prod, subcat, cat)
RIGHT OUTER JOIN time_c t
ON (c.gid_t = t.gid_t AND
c.Hierarchical_Time = t.Hierarchical_Time)
)
)
WHERE mon_num=13;
The SUM function uses a CASE to limit the data to months 1, 4, 7, and 10 within each
year. Due to the tiny data set, with just 2 products, the rollup values of the results
are necessarily repetitions of lower level aggregations. For more realistic set of
rollup values, you can include more products from the Game Console and Y Box
Games subcategories in the underlying materialized view.
Rules:
A Prod1 2000 10
A Prod1 2001 15
A Prod2 2000 12
A Prod2 2001 16
Original
Data
B Prod1 2000 21
B Prod1 2001 23
B Prod2 2000 28
B Prod2 2001 29
A Prod1 2002 25
A Prod2 2002 28
Rule
Results
B Prod1 2002 44
B Prod2 2002 57
clause and rearranged into an array. Once the array is defined, rules are applied one
by one to the data. Finally, the data, including both its updated values and newly
created values, is rearranged into row form and presented as the results of the
query.
MODEL
DIMENSION BY (prod, year)
MEASURES (sales s)
RULES UPSERT
(s[ANY, 2000]=s[CV(prod), CV(year-1]*2], --Rule 1
s[vcr, 2002]=s[vcr, 2001]+s[vcr, 2000], --Rule 2
s[dvd, 2002]=AVG(s)[CV(prod), year<2001]) --Rule 3
Rule 2 applied
Rule 1 applied 1 2 3 4 1999
1 2 3 4 1999 2 4 6 8 2000
2 4 6 8 2000 9 0 1 2 2001
9 0 1 2 2001 11 2002
vcr dvd tv pc vcr dvd tv pc
Note that, while the sales values for Bounce and Y Box exist in the input, the values
for All_Products are derived.
This sets the sales in Spain for the year 2001 to the sum of sales in Spain for 1999
and 2000. An example involving a range of cells is the following:
sales[country='Spain',year=2001] =
MAX(sales)['Spain',year BETWEEN 1997 AND 2000]
This sets the sales in Spain for the year 2001 equal to the maximum sales in
Spain between 1997 and 2000.
■ The UPSERT and UPDATE options
Using the UPSERT option, which is the default, you can create cell values that
do not exist in the input data. If the cell referenced exists in the data, it is
updated. Otherwise, it is inserted. The UPDATE option, on the other hand,
would not insert any new cells.
You can specify these options globally, in which case they apply to all rules, or
per each rule. If you specify an option at the rule level, it overrides the global
option. Consider the following rules:
UPDATE sales['Spain', 1999] = 3567.99,
UPSERT sales['Spain', 2001] = sales['Spain', 2000]+ sales['Spain', 1999]
The first rule updates the cell for sales in Spain for 1999. The second rule
updates the cell for sales in Spain for 2001 if it exists, otherwise, it creates a new
cell.
■ Wildcard specification of dimensions
You can use ANY and IS ANY to specify all values in a dimension. As an
example, consider the following statement:
sales[ANY, 2001] = sales['Japan', 2000]
This rule sets the 2001 sales of all countries equal to the sales value of Japan for
the year 2000. All values for the dimension, including nulls, satisfy the ANY
specification. You can specify the same in symbolic form using an IS ANY
predicate as in the following:
sales[country IS ANY, 2001] = sales['Japan', 2000]
Observe that the CV function passes the value for the country dimension from
the left to the right side of the rule.
■ Ordered computation
For rules updating a set of cells, the result may depend on the ordering of
dimension values. You can force a particular order for the dimension values by
specifying an ORDER BY in the rule. An example is the following rule:
sales[country IS ANY, year BETWEEN 2000 AND 2003] ORDER BY year =
1.05 * sales[CV(country), CV(year)-1]
This ensures that the years are referenced in increasing chronological order.
■ Automatic rule ordering
Rules in the MODEL clause can be automatically ordered based on dependencies
among the cells using the AUTOMATIC ORDER keywords. For example, in the
following assignments, the last two rules will be processed before the first rule
because the first depends on the second and third:
RULES AUTOMATIC ORDER
{sales[c='Spain', y=2001] = sales[c='Spain', y=2000]
+ sales[c='Spain', y=1999]
sales[c='Spain', y=2000] = 50000,
sales[c='Spain', y=1999] = 40000}
■ Scalable computation
You can partition data and evaluate rules within each partition independent of
other partitions. This enables parallelization of model computation based on
partitions. For example, consider the following model:
MODEL PARTITION BY (c) DIMENSION BY (y) MEASURES (s)
(sales[y=2001] = AVG(s)[y BETWEEN 1990 AND 2000]
The data is partitioned by country and, within each partition, you can compute
the sales in 2001 to be the average of sales in the years between 1990 and 2000.
Partitions can be processed in parallel and this results in a scalable execution of
the model.
■ Cell Referencing
■ Differences Between Update and Upsert
Base Schema
This chapter's examples are based on the following view sales_view, which is
derived from the sh sample schema.
CREATE VIEW sales_view AS
SELECT country_name country, prod_name product, calendar_year year,
SUM(amount_sold) sales, COUNT(amount_sold) cnt,
MAX(calendar_year) KEEP (DENSE_RANK FIRST ORDER BY SUM(amount_sold) DESC)
OVER (PARTITION BY country_name, prod_name) best_year,
MAX(calendar_year) KEEP (DENSE_RANK LAST ORDER BY SUM(amount_sold) DESC)
OVER (PARTITION BY country_name, prod_name) worst_year
FROM sales, times, customers, countries, products
WHERE sales.time_id = times.time_id AND sales.prod_id = products.prod_id AND
sales.cust_id =customers.cust_id AND customers.country_id=countries.country_id
GROUP BY country_name, prod_name, calendar_year;
This query computes SUM and COUNT aggregates on the sales data grouped by
country, product, and year. It will report for each product sold in a country, the year
when the sales were the highest for that product in that country. This is called the
best_year of the product. Similarly, worst_year gives the year when the sales
were the lowest.
MEASURES (<cols>)
[<reference options>]
[RULES] <rule options>
(<rule>, <rule>,.., <rule>)
Each rule represents an assignment. Its left side references a cell or a set of cells and
the right side can contain expressions involving constants, host variables,
individual cells or aggregates over ranges of cells. For example, consider
Example 22–1.
This query defines model computation on the rows from sales_view for the
countries Italy and Japan. This model has been given the name simple_model. It
partitions the data on country and defines, within each partition, a two-dimensional
array on product and year. Each cell in this array holds the measure value sales. The
first rule of this model sets the sales of Bounce in year 2001 to 1000. The next two
rules define that the sales of Bounce in 2002 are the sum of its sales in years 2001
and 2000, and the sales of Y Box in 2002 are same as that of the previous year 2001.
Specifying RETURN UPDATED ROWS makes the preceding query return only those
rows that are updated or inserted by the model computation. By default or if you
use RETURN ALL ROWS, you would get all rows not just the ones updated or
inserted by the MODEL clause. The query produces the following output:
COUNTRY PRODUCT YEAR SALES
-------------------- --------------- ---------- ----------
Italy Bounce 2001 1000
Italy Bounce 2002 5333.69
Italy Y Box 2002 81207.55
Japan Bounce 2001 1000
Japan Bounce 2002 6133.53
Japan Y Box 2002 89634.83
Note that the MODEL clause does not update or insert rows into database tables. The
following query illustrates this be showing that sales_view has not been altered:
SELECT SUBSTR(country,1,20) country, SUBSTR(product,1,15) product, year, sales
FROM sales_view
WHERE country IN ('Italy', 'Japan');
Observe that the update of the sales value for Bounce in the 2001 done by this
MODEL clause is not reflected in the database. If you want to update or insert rows
in the database tables, you should use the INSERT, UPDATE, or MERGE statements.
In the preceding example, columns are specified in the PARTITION BY, DIMENSION
BY, and MEASURES list. You can also specify constants, host variables, single-row
functions, aggregate functions, analytical functions, or expressions involving them
as partition and dimension keys and measures. However, you need to alias them in
PARTITION BY, DIMENSION BY, and MEASURES lists. You need to use aliases to
refer these expressions in the rules, SELECT list, and the query ORDER BY. The
following example shows how to use expressions and aliases:
SELECT country, p product, year, sales, profits
FROM sales_view
See Oracle Database SQL Reference for more information regarding MODEL clause
syntax.
This keeps null and absent cell values unchanged. It is useful for making
exceptions when IGNORE NAV is specified at the global level. This is the default,
and can be omitted.
Calculation Definition
■ MEASURES
The set of values that are modified or created by the model.
■ RULES
The rules that assign values to measures.
■ AUTOMATIC ORDER
This causes all rules to be evaluated in an order based on their dependencies.
■ SEQUENTIAL ORDER
This causes rules to be evaluated in the order they are written. This is the
default.
■ UNIQUE DIMENSION
This is the default, and it means that the PARTITION BY and DIMENSION BY
columns in the MODEL clause must uniquely identify each and every cell in the
model. This uniqueness is explicitly verified at run time when necessary, in
which case it causes some overhead.
■ UNIQUE SINGLE REFERENCE
The PARTITION BY and DIMENSION BY clauses uniquely identify single point
references on the right-hand side of the rules. This may reduce processing time
by avoiding explicit checks for uniqueness at run time.
■ RETURN [ALL|UPDATED] ROWS
This enables you to specify whether to return all rows selected or only those
rows updated by the rules. The default is ALL, while the alternative is UPDATED
ROWS.
Cell Referencing
In the MODEL clause, a relation can be viewed as a multi-dimensional array of cells.
A cell of this multi-dimensional array contains the measure values and is indexed
using DIMENSION BY keys, within each partition defined by the PARTITION BY
keys. For example, consider the following:
This partitions the data by country and defines within each partition, a
two-dimensional array on product and year. The cells of this array contain two
measures: sales and best_year.
Accessing the measure value of a cell by specifying the DIMENSION BY keys
constitutes a cell reference. An example of a cell reference is sales[product=
'Bounce', year=2000].
Here, we are accessing the sales value of a cell referenced by product Bounce and
the year 2000. In a cell reference, you can specify DIMENSION BY keys either
symbolically as in the preceding cell reference or positionally as in
sales['Bounce', 2000].
Only rules with positional references can insert new cells. Assuming DIMENSION
BY keys to be product and year in that order, it accesses the sales value for Bounce
and 2001.
Based on how they are specified, cell references are either single cell or multi-cell
reference.
This is a single cell reference in which a single value is specified for the first
dimension positionally and a single value for second dimension (year) is specified
symbolically.
Multi-Cell References
Cell references that are not single cell references are called multi-cell references.
Examples of multi-cell reference are:
sales[year>=2001],
sales[product='Bounce', year < 2001], AND
sales[product LIKE '%Finding Fido%', year IN (1999, 2000, 2001)]
Rules
Model computation is expressed in rules that manipulate the cells of the
multi-dimensional array defined by PARTITION BY, DIMENSION BY, and
MEASURES clauses. A rule is an assignment statement whose left side represents a
cell or a range of cells and whose right side is an expression involving constants,
bind variables, individual cells or an aggregate function on a range of cells. Rules
can use wild cards and looping constructs for maximum expressiveness. An
example of a rule is the following:
sales['Bounce', 2003] = 1.2 * sales['Bounce', 2002]
This rule says that, for the product Bounce, the sales for 2003 are 20% more than
that of 2002.
Note that this rule has single cell references on both left and right side and is
relatively simple. Complex rules can be written with multi-cell references,
aggregates, and nested cell references. These are discussed in the following sections.
This shows the usage of regression slope REGR_SLOPE function in rules. This
function computes the slope of the change of a measure with respect to a dimension
of the measure. In the preceding example, it gives the slope of the changes in the
sales value with respect to year. This model projects Finding Fido sales for 2003 to
be the sales in 2002 scaled by the growth (or slope) in sales for years less than 2002.
Aggregate functions can appear only on the right side of rules. Arguments to the
aggregate function can be constants, bind variables, measures of the MODEL clause,
or expressions involving them. For example, the rule computes the sales of Bounce
for 2003 to be the weighted average of its sales for years from 1998 to 2002 would
be:
sales['Bounce', 2003] =
AVG(sales * weight)['Bounce', year BETWEEN 1998 AND 2002]
This rule accesses a range of cells on the left side (cells for product Standard Mouse
Pad and year greater than 2000) and assigns sales measure of each such cell to the
value computed by the right side expression. Computation by the preceding rule is
described as "sales of Standard Mouse Pad for years after 2000 is 20% of the sales of
Finding Fido for year 2000". This computation is simple in that the right side cell
references and hence the right side expression are the same for all cells referenced
on the left.
sales[product='Standard Mouse Pad', year>2000] =
sales[CV(product), CV(year)] + 0.2 * sales['Finding Fido', 2000]
The CV function provides the value of a DIMENSION BY key of the cell currently
referenced on the left side. When the left side references the cell Standard Mouse
Pad and 2001, the right side expression would be:
sales['Standard Mouse Pad', 2001] + 0.2 * sales['Finding Fido', 2000]
Similarly, when the left side references the cell Standard Mouse Pad and 2002, the
right side expression we would evaluate is:
sales['Standard Mouse Pad', 2002] + 0.2 * sales['Finding Fido', 2000]
The use of the CV function provides the capability of relative indexing where
dimension values of the cell referenced on the left side are used on the right side cell
references. The CV function takes a dimension key as its argument. It is also possible
to use CV without any argument as in CV() and in which case, positional
referencing is implied. CV() may be used outside a cell reference, but when used in
this way its argument must contain the name of the dimension desired. You can also
write the preceding rule as:
sales[product='Standard Mouse Pad', year>2000] =
sales[CV(), CV()] + 0.2 * sales['Finding Fido', 2000]
The first CV() reference corresponds to CV(product) and the latter corresponds
to CV(year). The CV function can be used only in right side cell references.
Another example of the usage of CV function is the following:
sales[product IN ('Finding Fido','Standard Mouse Pad','Bounce'), year
BETWEEN 2002 AND 2004] = 2 * sales[CV(product), CV(year)-10]
This rule says that, for products Finding Fido, Standard Mouse Pad, and Bounce,
the sales for years between 2002 and 2004 will be twice of what their sales were 10
years ago.
Here, the nested cell reference best_year['Bounce', 2003] provides value for
the dimension key year and is used in the symbolic reference for year. Measures
best_year and worst_year give, for each year (y) and product (p) combination,
the year for which sales of product p were highest or lowest. The following rule
computes the sales of Standard Mouse Pad for 2003 to be the average of Standard
Mouse Pad sales for the years in which Finding Fido sales were highest and lowest:
sales['Standard Mouse Pad', 2003] = (sales[CV(), best_year['Finding Fido',
CV(year)]] + sales[CV(), worst_year['Finding Fido', CV(year)]]) / 2
Oracle allows only one level of nesting, and only single cell references can be used
as nested cell references. Aggregates on multi-cell references cannot be used in
nested cell references.
Alternatively, the option AUTOMATIC ORDER enables Oracle to determine the order
of evaluation of rules automatically. Oracle examines the cell references within rules
and constructs a dependency graph based on dependencies among rules. If cells
referenced on the left side of rule R1 are referenced on the right side of another rule
R2, then R2 is considered to depend on R1. In other words, rule R1 should be
evaluated before rule R2. If you specify AUTOMATIC ORDER in the preceding
example as in:
RULES AUTOMATIC ORDER
(sales['Bounce', 2001] = sales['Bounce', 2000] + sales['Bounce', 1999],
sales['Bounce', 2000] = 50000,
sales['Bounce', 1999] = 40000)
Rules 2 and 3 are evaluated, in some arbitrary order, before rule 1. This is because
rule 1 depends on rules 2 and 3 and hence need to be evaluated after rules 2 and 3.
The order of evaluation among second and third rules can be arbitrary as they do
not depend on one another. The order of evaluation among rules independent of
one another can be arbitrary. A dependency graph is analyzed to come up with the
rule evaluation order. SQL models with automatic order of evaluation, as in the
preceding fragment, are called automatic order models.
In an automatic order model, multiple assignments to the same cell are not allowed.
In other words, measure of a cell can be assigned only once. Oracle will return an
error in such cases as results would be non-deterministic. For example, the
following rule specification will generate an error as sales['Bounce', 2001] is
assigned more than once:
RULES AUTOMATIC ORDER
(sales['Bounce', 2001] = sales['Bounce', 2000] + sales['Bounce', 1999],
sales['Bounce', 2001] = 50000,
sales['Bounce', 2001] = 40000)
The rules assigning the sales of product Bounce for 2001 do not depend on one
another and hence, no particular evaluation order can be fixed among them. This
leads to non-deterministic results as the evaluation order is arbitrary -
sales['Bounce', 2001] can be 40000 or 50000 or sum of Bounce sales for years
1999 and 2000. Oracle prevents this by disallowing multiple assignments when
AUTOMATIC ORDER is specified. However, multiple assignments are fine in
sequential order models. If SEQUENTIAL ORDER was specified instead of
AUTOMATIC ORDER in the preceding example, the result of sales['Bounce',
2001] would be 40000.
The cell for product Bounce and year 2003, if it exists, gets updated with the sum of
Bounce sales for years 2001 and 2002, otherwise, it gets created. An optional
UPSERT keyword can be specified in the MODEL clause to make this upsert semantic
explicit.
Alternatively, the UPDATE option forces strict update mode. In this mode, the rule is
ignored if the cell it references on the left side does not exist.
You can specify an UPDATE or UPSERT option at the global level in the RULES
clause in which case all rules operate in the respective mode. These options can be
specified at a local level with each rule and in which case, they override the global
behavior. For example, in the following specification:
RULES UPDATE
(UPDATE s['Bounce',2001] = sales['Bounce',2000] + sales['Bounce',1999],
UPSERT s['Y Box', 2001] = sales['Y Box', 2000] + sales['Y Box', 1999],
sales['Mouse Pad', 2001] = sales['Mouse Pad', 2000] +
sales['Mouse Pad',1999])
The UPDATE option is specified at global level so, first and third rules operate in
update mode. The second rule operates in upsert mode as an UPSERT keyword is
specified with that rule. Note that no option was specified for the third rule and
hence it inherits the update behavior from the global option.
Using UPSERT would create a new cell corresponding to the one referenced on the
left side of the rule when the cell is missing, and the cell reference contains only
positional references qualified by constants.
Assuming we do not have cells for years greater than 2003, consider the following
rule:
UPSERT sales['Bounce', year = 2004] = 1.1 * sales['Bounce', 2002]
This would not create any new cell because of the symbolic reference year = 2004.
However, consider the following:
UPSERT sales['Bounce', 2004] = 1.1 * sales['Bounce', 2002]
This would create a new cell for product Bounce for year 2004. On a related note,
new cells will not be created if any of the positional reference is ANY. This is because
ANY is predicate that qualifies all dimensional values including NULL. If there is
positional reference ANY for a dimension d, then it can be considered as a predicate
(d IS NOT NULL OR d IS NULL).
provides options to treat them in other useful ways according to business logic, for
example, to treat nulls as zero for arithmetic operations.
By default, NULL cell measure values are treated the same way as nulls are treated
elsewhere in SQL. For example, in the following rule:
sales['Bounce', 2001] = sales['Bounce', 1999] + sales['Bounce', 2000]
The right side expression would evaluate to NULL if Bounce sales for one of the
years 1999 and 2000 is NULL. Similarly, aggregate functions in rules would treat
NULL values in the same way as their regular behavior where NULL values are
ignored during aggregation.
Missing cells are treated as cells with NULL measure values. For example, in the
preceding rule, if the cell for Bounce and 2000 is missing, then it is treated as a NULL
value and the right side expression would evaluate to NULL.
Distinguishing Missing Cells from NULLs
The functions PRESENTV and PRESENTNNV enable you to identify missing cells and
distinguish them from NULL values. These functions take a single cell reference and
two expressions as arguments as in PRESENTV(cell, expr1, expr2).
PRESENTV returns the first expression expr1 if the cell cell is existent in the data
input to the MODEL clause. Otherwise, it returns the second expression expr2. For
example, consider the following:
PRESENTV(sales['Bounce', 2000], 1.1*sales['Bounce', 2000], 100)
If the cell for product Bounce and year 2000 exists, it returns the corresponding sales
multiplied by 1.1, otherwise, it returns 100. Note that if sales for the product Bounce
for year 2000 is NULL, the preceding specification would return NULL.
The PRESENTNNV function not only checks for the presence of a cell but also
whether it is NULL or not. It returns the first expression expr1 if the cell exists and
is not NULL, otherwise, it returns the second expression expr2. For example,
consider the following:
PRESENTNNV(sales['Bounce', 2000], 1.1*sales['Bounce', 2000], 100)
CASE WHEN sales['Bounce', 2000] IS PRESENT AND sales['Bounce', 2000] IS NOT NULL
THEN
1.1 * sales['Bounce', 2000]
ELSE
100
END
The IS PRESENT predicate, like the PRESENTV and PRESENTNNV functions, checks
for cell existence in the input data, that is, the data as existed before the execution of
the MODEL clause. This enables you to initialize multiple measures of a cell newly
inserted by an UPSERT rule. For example, if you want to initialize sales and profit
values of a cell, if it does not exist in the data, for product Bounce and year 2003 to
1000 and 500 respectively, you can do so by the following:
RULES
(UPSERT sales['Bounce', 2003] =
PRESENTV(sales['Bounce', 2003], sales['Bounce', 2003], 1000),
UPSERT profit['Bounce', 2003] =
PRESENTV(profit['Bounce', 2003], profit['Bounce', 2003], 500))
The PRESENTV functions used in this formulation would both return TRUE or
FALSE based on the existence of the cell in the input data. If the cell for Bounce and
2003 gets inserted by one of the rules, based on their evaluation order, PRESENTV
function in the other rule would still evaluate to FALSE. You can consider this
behavior as a preprocessing step to rule evaluation that evaluates and replaces all
PRESENTV and PRESENTNNV functions and IS PRESENT predicate by their
respective values.
In this, the input to the MODEL clause does not have a cell for product Bounce and
year 2002. Because of IGNORE NAV option, sales['Bounce', 2002] value
would default to 0 (as sales is of numeric type) instead of NULL. Thus,
sales['Bounce', 2003] value would be same as that of sales['Bounce',
2001].
Reference Models
In addition to the multi-dimensional array on which rules operate, which is called
the main model, one or more read-only multi-dimensional arrays, called reference
models, can be created and referenced in the MODEL clause to act as look-up tables.
Like the main model, a reference model is defined over a query block and has
DIMENSION BY and MEASURES clauses to indicate its dimensions and measures
respectively. A reference model is created by the following subclause:
REFERENCE model_name ON (query) DIMENSION BY (cols) MEASURES (cols)
[reference options]
Like the main model, a multi-dimensional array for the reference model is built
before evaluating the rules. But, unlike the main model, reference models are
read-only in that their cells cannot be updated and no new cells can be inserted after
they are built. Thus, the rules in the main model can access cells of a reference
model, but they cannot update or insert new cells into the reference model.
References to the cells of a reference model can only appear on the right side of
rules. You can view reference models as look-up tables on which the rules of the
main model perform look-ups to obtain cell values. The following is an example
using a currency conversion table as a reference model:
CREATE TABLE dollar_conv_tbl(country VARCHAR2(30), exchange_rate NUMBER);
INSERT INTO dollar_conv_tbl VALUES('Poland', 0.25);
INSERT INTO dollar_conv_tbl VALUES('France', 0.14);
...
Now, to convert the projected sales of Poland and France for 2003 to the US dollar,
you can use the dollar conversion table as a reference model as in the following:
SELECT country, year, sales, dollar_sales
FROM sales_view
GROUP BY country, year
MODEL
REFERENCE conv_ref ON (SELECT country, exchange_rate FROM dollar_conv_tbl)
DIMENSION BY (country) MEASURES (exchange_rate) IGNORE NAV
MAIN conversion
DIMENSION BY (country, year)
MEASURES (SUM(sales) sales, SUM(sales) dollar_sales) IGNORE NAV
RULES
(dollar_sales['France', 2003] = sales[CV(country), 2002] * 1.02 *
conv_ref.exchange_rate['France'],
dollar_sales['Poland', 2003] =
sales['Poland', 2002] * 1.05 * exchange_rate['Poland']);
Then the following query computes the projected sales in dollars for 2003 for all
countries:
SELECT country, year, sales, dollar_sales
FROM sales_view
GROUP BY country, year
MODEL
REFERENCE conv_ref ON
(SELECT country, exchange_rate FROM dollar_conv_tbl)
DIMENSION BY (country c) MEASURES (exchange_rate) IGNORE NAV
REFERENCE growth_ref ON
(SELECT country, year, growth_rate FROM growth_rate_tbl)
DIMENSION BY (country c, year y) MEASURES (growth_rate) IGNORE NAV
MAIN projection
DIMENSION BY (country, year) MEASURES (SUM(sales) sales, 0 dollar_sales)
IGNORE NAV
RULES
(dollar_sales[ANY, 2003] = sales[CV(country), 2002] *
growth_rate[CV(country), CV(year)] *
exchange_rate[CV(country)]);
This query shows the capability of the MODEL clause in dealing with and relating
objects of different dimensionality. Reference model conv_ref has one dimension
while the reference model growth_ref and the main model have two dimensions.
Dimensions in the single cell references on reference models are specified using the
CV function thus relating the cells in main model with the reference model. This
This view can define two lookup tables: integer-to-year i2y, which maps sequence
numbers to integers, and year-to-integer y2i, which performs the reverse mapping.
The references y2i.i[year] and y2i.i[year] - 1 return sequence numbers of
the current and previous years respectively and the reference
i2y.y[y2i.i[year]-1] returns the year key value of the previous year. The
following query demonstrates such a usage of reference models:
SELECT country, product, year, sales, prior_period
FROM sales_view
MODEL
REFERENCE y2i ON (SELECT year, i FROM year_2_seq) DIMENSION BY (year y)
MEASURES (i)
REFERENCE i2y ON (SELECT year, i FROM year_2_seq) DIMENSION BY (i)
MEASURES (year y)
MAIN projection2 PARTITION BY (country)
DIMENSION BY (product, year)
MEASURES (sales, CAST(NULL AS NUMBER) prior_period)
(prior_period[ANY, ANY] = sales[CV(product), i2y.y[y2i.i[CV(year)]-1]])
ORDER BY country, product, year;
Nesting of reference model cell references is evident in the preceding example. Cell
reference on the reference model y2i is nested inside the cell reference on i2y
which, in turn, is nested in the cell reference on the main SQL model. There is no
limitation on the levels of nesting you can have on reference model cell references.
However, you can only have two levels of nesting on the main SQL model cell
references.
Finally, the following are restrictions on the specification and usage of reference
models:
■ Reference models cannot have a PARTITION BY clause.
■ The query block on which the reference model is defined cannot be correlated
to an outer query.
■ Reference models must be named and their names should be unique.
■ All references to the cells of a reference model should be single cell references.
FOR Loops
The MODEL clause provides a FOR construct that can be used inside rules to express
computations more compactly. It can be used on both the left and right side of a
rule. For example, consider the following computation, which estimates the sales of
several products for 2004 to be 10% higher than their sales for 2003:
RULES UPSERT
(sales['Bounce', 2004] = 1.1 * sales['Bounce', 2003],
sales['Standard Mouse Pad', 2004] = 1.1 * sales['Standard Mouse Pad', 2003],
...
sales['Y Box', 2004] = 1.1 * sales['Y Box', 2003])
The UPSERT option is used in this computation so that cells for these products and
2004 will be inserted if they are not previously present in the multi-dimensional
array. This is rather bulky as you have to have as many rules as there are products.
Using the FOR construct, this computation can be represented compactly and with
exactly the same semantics as in:
RULES UPSERT
(sales[FOR product IN ('Bounce', 'Standard Mouse Pad', ..., 'Y Box'), 2004] =
1.1 * sales[CV(product), 2003])
If you write a specification similar to this, but without the FOR keyword as in the
following:
RULES UPSERT
You would get UPDATE semantics even though you have specified UPSERT. In other
words, existing cells will be updated but no new cells will be created by this
specification. This is because of the symbolic multi-cell reference on product that is
treated as a predicate. You can view a FOR construct as a macro that generates
multiple rules with positional references from a single rule, thus preserving the
UPSERT semantics. Conceptually, the following rule:
sales[FOR product IN ('Bounce', 'Standard Mouse Pad', ..., 'Y Box'),
FOR year IN (2004, 2005)] = 1.1 * sales[CV(product),
CV(year)-1]
The FOR construct in the preceding examples is of type FOR dimension IN (list of
values). Values in the list should either be constants or expressions involving
constants. In this example, there are separate FOR constructs on product and year. It
is also possible to specify all dimensions using one FOR construct. Consider for
example, we want only to estimate sales for Bounce in 2004, Standard Mouse Pad in
2005 and Y Box in 2004 and 2005. This can be formulated as the following:
sales[FOR (product, year) IN (('Bounce', 2004), ('Standard Mouse Pad', 2005),
('Y Box', 2004), ('Y Box', 2005))] =
1.1 * sales[CV(product), CV(year)-1]
This FOR construct should be of form FOR (d1, ..., dn) IN ((d1_val1,
..., dn_val1), ..., (d1_valm, ..., dn_valm)] when there are n
dimensions d1, ..., dn and m values in the list.
In some cases, the list of values for a dimension in FOR can be stored in a table or
they can be the result of a subquery. Oracle Database provides a flavor of FOR
construct as in FOR dimension in (subquery) to handle these cases. For example,
As another example, consider the scenario where you want to introduce a new
country, called new_country, with sales that mimic those of Poland. This is
accomplished by issuing the following statement:
SELECT country, product, year, s
FROM sales_view
MODEL
DIMENSION BY (country, product, year)
MEASURES (sales s) IGNORE NAV
RULES UPSERT
(s[FOR (country, product, year) IN (SELECT DISTINCT 'new_country', product, year
FROM sales_view
WHERE country = 'Poland')] = s['Poland',CV(),CV()])
ORDER BY country, year, product;
This kind of FOR construct can be used for dimensions of numeric, date and
datetime datatypes. The increment/decrement expression value3 should be
numeric for numeric dimensions and can be numeric or interval for dimensions of
date or datetime types. Also, value3 should be positive. Oracle will return an error
if you use FOR year FROM 2005 TO 2001 INCREMENT -1. You should use
either FOR year FROM 2005 TO 2001 DECREMENT 1 or FOR year FROM
2001 TO 2005 INCREMENT 1. Oracle will also report an error if the domain (or
the range) is empty, as in FOR year FROM 2005 TO 2001 INCREMENT 1.
To generate string values, you can use the FOR construct FOR dimension LIKE
string FROM value1 TO value2 [INCREMENT | DECREMENT] value3.
The string string should contain only one % character. This specification results in
string by replacing % with values between value1 and value2 with appropriate
increment/decrement value value3. For example, the following rule:
sales[FOR product LIKE 'product-%' FROM 1 TO 3 INCREMENT 1, 2003] =
sales[CV(product), 2002] * 1.2
For this kind of FOR construct, value1, value2, and value3 should all be of
numeric type.
sales['product-1', 2003] = sales['product-1', 2002] * 1.2,
sales['product-2', 2003] = sales['product-2', 2002] * 1.2,
sales['product-3', 2003] = sales['product-3', 2002] * 1.2
Iterative Models
Using the ITERATE option of the MODEL clause, you can evaluate rules iteratively
for a certain number of times, which you can specify as an argument to the
ITERATE clause. ITERATE can be specified only for SEQUENTIAL ORDER models
and such models are referred to as iterative models. For example, consider the
following:
SELECT x, s FROM DUAL
MODEL
DIMENSION BY (1 AS x) MEASURES (1024 AS s)
RULES UPDATE ITERATE (4)
(s[1] = s[1]/2);
In Oracle, the table DUAL has only one row. Hence this model defines a
1-dimensional array, dimensioned by x with a measure s, with a single element
s[1] = 1024. The rule s[1] = s[1]/2 evaluation will be repeated four times.
The result of this query will be a single row with values 1 and 64 for columns x and
s respectively. The number of iterations arguments for the ITERATE clause should
be a positive integer constant. Optionally, you can specify an early termination
condition to stop rule evaluation before reaching the maximum iteration. This
condition is specified in the UNTIL subclause of ITERATE and is checked at the end
of an iteration. So, you will have at least one iteration when ITERATE is specified.
The syntax of the ITERATE clause is:
ITERATE (number_of_iterations) [ UNTIL (condition) ]
Iterative evaluation will stop either after finishing the specified number of iterations
or when the termination condition evaluates to TRUE, whichever comes first.
In some cases, you may want the termination condition to be based on the change,
across iterations, in value of a cell. Oracle provides a mechanism to specify such
conditions in that it enables you to access cell values as they existed before and after
the current iteration in the UNTIL condition. Oracle's PREVIOUS function takes a
single cell reference as an argument and returns the measure value of the cell as it
existed after the previous iteration. You can also access the current iteration number
by using the system variable ITERATION_NUMBER, which starts at value 0 and is
incremented after each iteration. By using PREVIOUS and ITERATION_NUMBER,
you can construct complex termination conditions.
Consider the following iterative model that specifies iteration over rules till the
change in the value of s[1] across successive iterations falls below 1, up to a
maximum of 1000 times:
SELECT x, s, iterations FROM DUAL
MODEL
DIMENSION BY (1 AS x) MEASURES (1024 AS s, 0 AS iterations)
RULES ITERATE (1000) UNTIL ABS(PREVIOUS(s[1]) - s[1]) < 1
(s[1] = s[1]/2, iterations[1] = ITERATION_NUMBER);
The absolute value function (ABS) can be helpful for termination conditions because
you may not know if the most recent value is positive or negative. Rules in this
model will be iterated over 11 times as after 11th iteration the value of s[1] would
be 0.5. This query results in a single row with values 1, 0.5, 10 for x, s and iterations
respectively.
You can use the PREVIOUS function only in the UNTIL condition. However,
ITERATION_NUMBER can be anywhere in the main model. In the following
example, ITERATION_NUMBER is used in cell references:
SELECT country, product, year, sales
FROM sales_view
MODEL
PARTITION BY (country) DIMENSION BY (product, year) MEASURES (sales sales)
IGNORE NAV
RULES ITERATE(3)
(sales['Bounce', 2002 + ITERATION_NUMBER] = sales['Bounce', 1999
+ ITERATION_NUMBER]);
This statement achieves an array copy of sales of Bounce from cells in the array
1999-2001 to 2002-2005.
circular (or cyclical) dependencies. A cyclic dependency can be of the form "rule A
depends on B and rule B depends on A" or of the self-cyclic "rule depending on
itself" form. An example of the former is:
sales['Bounce', 2002] = 1.5 * sales['Y Box', 2002],
sales['Y Box', 2002] = 100000 / sales['Bounce', 2002
However, there is no self-cycle in the following rule as different measures are being
accessed on the left and right side:
projected_sales['Bounce', 2002] = 25000 / sales['Bounce', 2002]
When the analysis of an AUTOMATIC ORDER model finds that the rules have no
circular dependencies (that is, the dependency graph is acyclic), Oracle Database
will evaluate the rules in their dependency order. For example, in the following
AUTOMATIC ORDER model:
MODEL DIMENSION BY (prod, year) MEASURES (sale sales) IGNORE NAV
RULES AUTOMATIC ORDER
(sales['SUV', 2001] = 10000,
sales['Standard Mouse Pad', 2001] = sales['Finding Fido', 2001]
* 0.10 + sales['Boat', 2001] * 0.50,
sales['Boat', 2001] = sales['Finding Fido', 2001]
* 0.25 + sales['SUV', 2001]* 0.75,
sales['Finding Fido', 2001] = 20000)
Rule 2 depends on rules 3 and 4, while rule 3 depends on rules 1 and 4, and rules 1
and 4 do not depend on any rule. Oracle, in this case, will find that the rule
dependencies are acyclic and will evaluate rules in one of the possible evaluation
orders (1, 4, 3, 2) or (4, 1, 3, 2). This type of rule evaluation is called an ACYCLIC
algorithm.
In some cases, Oracle Database may not be able to ascertain that your model is
acyclic even though there is no cyclical dependency among the rules. This can
happen if you have complex expressions in your cell references. Oracle Database
assumes that the rules are cyclic and employs a CYCLIC algorithm that evaluates
the model iteratively based on the rules and data. Iteration will stop as soon as
convergence is reached and the results will be returned. Convergence is defined as
the state in which further executions of the model will not change values of any of
the cell in the model. Convergence is certain to be reached when there are no
cyclical dependencies.
If your AUTOMATIC ORDER model has rules with cyclical dependencies, Oracle will
employ the earlier mentioned CYCLIC algorithm. Results are produced if
convergence can be reached within the number of iterations Oracle is going to try
the algorithm. Otherwise, Oracle will report a cycle detection error. You can
circumvent this problem by manually ordering the rules and specifying
SEQUENTIAL ORDER.
Ordered Rules
An ordered rule is one that has ORDER BY specified on the left side. It accesses cells
in the order prescribed by ORDER BY and applies the right side computation. When
you have a positional ANY or symbolic references on the left side of a rule but
without the ORDER BY clause, Oracle might return an error saying that the rule's
results depend on the order in which cells are accessed and hence are
non-deterministic. Consider the following SEQUENTIAL ORDER model:
SELECT t, s
FROM sales, times
WHERE sales.time_id = times.time_id
GROUP BY calendar_year
MODEL
DIMENSION BY (calendar_year t) MEASURES (SUM(amount_sold) s)
RULES SEQUENTIAL ORDER
(s[ANY] = s[CV(t)-1]);
This query attempts to set, for all years t, sales s value for a year to the sales value
of the prior year. Unfortunately, the result of this rule depends on the order in
which the cells are accessed. If cells are accessed in the ascending order of year, the
result would be that of column 3 in Table 22–1. If they are accessed in descending
order, the result would be that of column 4.
If you want the cells to be considered in descending order and get the result given
in column 4, you should specify:
SELECT t, s
FROM sales, times
WHERE sales.time_id = times.time_id
GROUP BY calendar_year
MODEL
DIMENSION BY (calendar_year t) MEASURES (SUM(amount_sold) s)
RULES SEQUENTIAL ORDER
(s[ANY] ORDER BY t DESC = s[CV(t)-1]);
In general, you can use any ORDER BY specification as long as it produces a unique
order among cells that qualify the left side cell reference. Expressions in the ORDER
BY of a rule can involve constants, measures and dimension keys and you can
specify the ordering options [ASC | DESC] [NULLS FIRST | NULLS LAST] to
get the order you want.
You can also specify ORDER BY for rules in an AUTOMATIC ORDER model to make
Oracle consider cells in a particular order during rule evaluation. Rules are never
considered self-cyclic if they have ORDER BY. For example, to make the following
AUTOMATIC ORDER model with a self-cyclic formula:
MODEL
DIMENSION BY (calendar_year t) MEASURES (SUM(amount_sold) s)
RULES AUTOMATIC ORDER
(s[ANY] = s[CV(t)-1])
acyclic, you need to provide the order in which cells need to be accessed for
evaluation using ORDER BY. For example, you can say:
s[ANY] ORDER BY t = s[CV(t) - 1]
Then Oracle will pick an ACYCLIC algorithm (which is certain to produce the
result) for formula evaluation.
This would return a uniqueness violation error as the rowset input to model is not
unique on country and product:
ERROR at line 2:
ORA-32638: Non unique addressing in MODEL dimensions
Input to the MODEL clause in this case is unique on country, product, and year
as shown in:
COUNTRY PRODUCT YEAR SALES
------- ----------------------------- ---- --------
Italy 1.44MB External 3.5" Diskette 1998 3141.84
Italy 1.44MB External 3.5" Diskette 1999 3086.87
Italy 1.44MB External 3.5" Diskette 2000 3440.37
Italy 1.44MB External 3.5" Diskette 2001 855.23
...
If you want to relax this uniqueness checking, you can specify UNIQUE SINGLE
REFERENCE keyword. This can save processing time. In this case, the MODEL clause
checks the uniqueness of only the single cell references appearing on the right side
of rules. So the query that returned the uniqueness violation error would be
successful if you specify UNIQUE SINGLE REFERENCE instead of UNIQUE
DIMENSION.
Another difference between UNIQUE DIMENSION and UNIQUE SINGLE REFERENCE
semantics is the number of cells that can be updated by a rule with a single cell
reference on left side. In the case of UNIQUE DIMENSION, such a rule can update at
most one row as only one cell would qualify the single cell reference on the left side.
This is because the input rowset would be unique on PARTITION BY and
DIMENSION BY keys. With UNIQUE SINGLE REFERENCE, all cells that qualify the
left side single cell reference would be updated by the rule.
■ When there is a multi-cell reference on the right hand side of a rule, you need to
apply a function to aggregate the measure values of multiple cells referenced
into a single value. You can use any kind of aggregate function for this purpose:
regular, OLAP aggregate (inverse percentile, hypothetical rank and
distribution), or user-defined aggregate.
■ You cannot use analytic functions (functions that use the OVER clause) in rules.
■ Only rules with positional single cell references on the left side have UPSERT
semantics. All other rules have UPDATE semantics, even when you specify the
UPSERT option for them.
■ Negative increments are not allowed in FOR loops. Also, no empty FOR loops
are allowed. FOR d FROM 2005 TO 2001 INCREMENT -1 is illegal. You
should use FOR d FROM 2005 TO 2001 DECREMENT 1 instead. FOR d
FROM 2005 TO 2001 INCREMENT 1 is illegal as it designates an empty loop.
■ You cannot use nested query expressions (subqueries) in rules except in the FOR
construct. For example, it would be illegal to issue the following:
SELECT *
FROM sales_view WHERE country = 'Poland'
MODEL DIMENSION BY (product, year)
MEASURES (sales sales)
RULES UPSERT
(sales['Bounce', 2003] = sales['Bounce', 2002] +
(SELECT SUM(sales) FROM sales_view));
This is because the rule has a subquery on its right side. Instead, you can
rewrite the preceding query in the following legal way:
SELECT *
FROM sales_view WHERE country = 'Poland'
MODEL DIMENSION BY (product, year)
MEASURES (sales sales, (SELECT SUM(sales) FROM sales_view) AS grand_total)
RULES UPSERT
(sales['Bounce', 2003] =sales['Bounce', 2002] +
grand_total['Bounce', 2002]);
■ You can also use subqueries in the FOR construct specified on the left side of a
rule. However, they:
■ Cannot be correlated
■ Must return fewer than 10000 rows
■ Cannot be a query defined in the WITH clause
■ Will make the cursor unsharable
■ Nested cell references must be single cell references. Aggregates on nested cell
references are not supported. So, it would be illegal to say s['Bounce',
MAX(best_year)['Bounce', ANY]].
■ Only one level of nesting is supported for nested cell references on the main
model. So, for example, s['Bounce', best_year['Bounce', 2001]] is
legal, but s['Bounce', best_year['Bounce', best_year['Bounce',
2001]]] is not.
■ Nested cell references appearing on the left side of rules in an AUTOMATIC
ORDER model should not be updated in any rule of the model. This restriction
ensures that the rule dependency relationships do not arbitrarily change (and
hence cause non-deterministic results) due to updates to reference measures.
There is no such restriction on nested cell references in a SEQUENTIAL ORDER
model. Also, this restriction is not applicable on nested references appearing on
the right side of rules in both SEQUENTIAL or AUTOMATIC ORDER models.
■ Reference models have the following restrictions:
■ The query defining the reference model cannot be correlated to any outer
query. It can, however, be a query with subqueries, views and so on.
■ Reference models cannot have a PARTITION BY clause.
■ Reference models cannot be updated.
Parallel Execution
MODEL clause computation is scalable in terms of the number of processors you
have. Scalability is achieved by performing the MODEL computation in parallel
across the partitions defined by the PARTITION BY clause. Data is distributed
among processing elements (also called parallel query slaves) based on the
PARTITION BY key values such that all rows with the same values for the
PARTITION BY keys will go to the same slave. Note that the internal processing of
partitions will not create a one-to-one match of logical and internally processed
partitions. This way, each slave can finish MODEL clause computation independent
of other slaves. The data partitioning can be hash based or range based. Consider
the following MODEL clause:
MODEL
PARTITION BY (country) DIMENSION BY (product, time) MEASURES (sales)
RULES UPDATE
(sales['Bounce', 2002] = 1.2 * sales['Bounce', 2001],
sales['Car', 2002] = 0.8 * sales['Car', 2001])
Here input data will be partitioned among slaves based on the PARTITION BY key
country and this partitioning can be hash or range based. Each slave will evaluate
the rules on the data it receives.
Parallelism of the model computation is governed or limited by the way you specify
the MODEL clause. If your MODEL clause has no PARTITION BY keys, then the
computation cannot be parallelized (with exceptions mentioned in the following). If
PARTITION BY keys have very low cardinality, then the degree of parallelism will
be limited. In such cases, Oracle identifies the DIMENSION BY keys that can used
for partitioning. For example, consider a MODEL clause equivalent to the preceding
one, but without PARTITION BY keys as in the following:
MODEL
DIMENSION BY (country, product, time) MEASURES (sales)
RULES UPDATE
(sales[ANY, 'Bounce', 2002] = 1.2 * sales[CV(country), 'Bounce', 2001],
sales[ANY, 'Car', 2002] = 0.8 * sales[CV(country), 'Car', 2001])
In this case, Oracle Database will identify that it can use the DIMENSION BY key
country for partitioning and uses region as the basis of internal partitioning. It
partitions the data among slaves on country and thus effects parallel execution.
Aggregate Computation
The MODEL clause processes aggregates in two different ways: first, the regular
fashion in which data in the partition is scanned and aggregated and second, an
efficient window style aggregation. The first type as illustrated in the following
introduces a new dimension member ALL_2002_products and computes its value
to be the sum of year 2002 sales for all products:
MODEL PARTITION BY (country) DIMENSION BY (product, time) MEASURES (sale sales)
RULES UPSERT
(sales['ALL_2002_products', 2002] = SUM(sales)[ANY, 2002])
To evaluate the aggregate sum in this case, each partition will be scanned to find the
cells for 2002 for all products and they will be aggregated. If the left side of the rule
were to reference multiple cells, then Oracle will have to compute the right side
aggregate by scanning the partition for each cell referenced on the left. For example,
consider the following example:
MODEL PARTITION BY (country) DIMENSION BY (product, time)
MEASURES (sale sales, 0 avg_exclusive)
RULES UPDATE
(avg_exclusive[ANY, 2002] = AVG(sales)[product <> CV(product), CV(time)])
This rule calculates a measure called avg_exclusive for every product in 2002.
The measure avg_exclusive is defined as the average sales of all products
excluding the current product. In this case, Oracle scans the data in a partition for
every product in 2002 to calculate the aggregate, and this may be expensive
Oracle Database will optimize the evaluation of such aggregates in some scenarios
with window-style computation as used in analytic functions. These scenarios
involve rules with multi-cell references on their left side and computing window
computations such as moving averages, cumulative sums and so on. Consider the
following example:
MODEL PARTITION BY (country) DIMENSION BY (product, time)
MEASURES (sale sales, 0 mavg)
RULES UPDATE
(mavg[product IN ('Bounce', 'Y Box', 'Mouse Pad'), ANY] =
AVG(sales)[CV(product), time BETWEEN CV(time)
AND CV(time) - 2])
It computes the moving average of sales for products Bounce, Y Box, and Mouse
Pad over a three year period. It would be very inefficient to evaluate the aggregate
by scanning the partition for every cell referenced on the left side. Oracle identifies
the computation as being in window-style and evaluates it efficiently. It sorts the
input on product, time and then scans the data once to compute the moving
average. You can view this rule as an analytic function being applied on the sales
data for products Bounce, Y Box, and Mouse Pad:
AVG(sales) OVER (PARTITION BY product ORDER BY time
RANGE BETWEEN 2 PRECEDING AND CURRENT ROW)
This computation style is called WINDOW (IN MODEL) SORT. This style of
aggregation is applicable when the rule has a multi-cell reference on its left side
with no ORDER BY, has a simple aggregate (SUM, COUNT, MIN, MAX, STDEV, and
VAR) on its right side, only one dimension on the right side has a boolean predicate
(<, <=, >, >=, BETWEEN), and all other dimensions on the right are qualified with CV.
FROM sales_view
WHERE country IN ('Italy', 'Japan')
MODEL UNIQUE DIMENSION
PARTITION BY (country) DIMENSION BY (prod, year) MEASURES (sale sales)
RULES UPSERT
(sales['Bounce', 2003] = AVG(sales)[ANY, 2002] * 1.24
sales[prod <> 'Bounce', 2003] = sales['Bounce', 2003] * 0.25);
RULES UPSERT
(sales['DIFF ITALY-SPAIN'] = sales['Italy'] - sales['Spain']);
To calculate the NPV using a discount rate of 0.14, issue the following statement:
SELECT year, i, prod, amount, npv
FROM cash_flow
MODEL PARTITION BY (prod)
DIMENSION BY (i)
MEASURES (amount, 0 npv, year)
RULES
(npv[0] = amount[0],
npv[i !=0] ORDER BY i =
amount[CV()]/ POWER(1.14,CV(i)) + npv[CV(i)-1]);
■ mortgage
Holds output information for the calculations. The columns are customer,
payment number (pmt_num), principal applied in that payment (principalp),
interest applied in that payment (interestp), and remaining loan balance
(mort_balance). In order to upsert new cells into a partition, you need to
have at least one row pre-existing per partition. Therefore, we seed the
mortgage table with the values for the two customers before they have made
any payments. This seed information could be easily generated using a SQL
Insert statement based on the mortgage_fact table.
CREATE TABLE mortgage_facts (customer VARCHAR2(20), fact VARCHAR2(20),
amount NUMBER(10,2));
The following SQL statement is complex, so individual lines have been annotated as
needed. These lines are explained in more detail later.
SELECT c, p, m, pp, ip
FROM MORTGAGE
MODEL --See 1
REFERENCE R ON
(SELECT customer, fact, amt --See 2
FROM mortgage_facts
MODEL DIMENSION BY (customer, fact) MEASURES (amount amt) --See 3
RULES
(amt[any, 'PaymentAmt']= (amt[CV(),'Loan']*
Power(1+ (amt[CV(),'Annual_Interest']/100/12),
amt[CV(),'Payments']) *
(amt[CV(),'Annual_Interest']/100/12)) /
(Power(1+(amt[CV(),'Annual_Interest']/100/12),
amt[CV(),'Payments']) - 1)
)
)
DIMENSION BY (customer cust, fact) measures (amt) --See 4
MAIN amortization
PARTITION BY (customer c) --See 5
DIMENSION BY (0 p) --See 6
MEASURES (principalp pp, interestp ip, mort_balance m, customer mc) --See 7
RULES
ITERATE(1000) UNTIL (ITERATION_NUMBER+1 =
r.amt[mc[0],'Payments']) --See 8
(ip[ITERATION_NUMBER+1] = m[CV()-1] *
r.amt[mc[0], 'Annual_Interest']/1200, --See 9
pp[ITERATION_NUMBER+1] = r.amt[mc[0], 'PaymentAmt'] - ip[CV()], --See 10
m[ITERATION_NUMBER+1] = m[CV()-1] - pp[CV()] --See 11
)
ORDER BY c, p;
2 through 4: These lines mark the start and end of the reference model labeled R.
This model defines defines a SELECT statement that calculates the monthly
payment amount for each customer's loan. The SELECT statement uses its own
MODEL clause starting at the line labeled 3 with a single rule that defines the amt
value based on information from the mortgage_facts table. The measure
returned by reference model R is amt, dimensioned by customer name cust and
fact value fact as defined in the line labeled 4.
The reference model is computed once and the values are then used in the main
model for computing other calculations. Reference model R will return a row for
each existing row of mortgage_fact, and it will return the newly calculated rows
for each customer where the fact type is Payment and the amt is the monthly
payment amount. If we wish to use a specific amount from the R output, we
address it with the expression r.amt[<customer_name>,<fact_name>].
5: This is the continuation of the main model definition. We will partition the output
by customer, aliased as c.
6: The main model is dimensioned with a constant value of 0, aliased as p. This
represents the payment number of a row.
7: Four measures are defined: principalp (pp) is the principal amount applied
to the loan in the month, interestp (ip) is the interest paid that month, mort_
balance (m) is the remaining mortgage value after the payment of the loan, and
customer (mc) is used to support the partitioning.
8: This begins the rules block. It will perform the rule calculations up to 1000 times.
Because the calculations are performed once for each month for each customer, the
maximum number of months that can be specified for a loan is 1000. Iteration is
stopped when the ITERATION_NUMBER+1 equals the amount of payments derived
from reference R. Note that the value from reference R is the amt (amount) measure
defined in the reference clause. This reference value is addressed as
r.amt[<customer_name>,<fact>]. The expression used in the iterate line,
"r.amt[mc[0], 'Payments']" is resolved to be the amount from reference R,
where the customer name is the value resolved by mc[0]. Since each partition
contains only one customer, mc[0] can have only one value. Thus
"r.amt[mc[0], 'Payments']" yields the reference clause's value for the
number of payments for the current customer. This means that the rules will be
performed as many times as there are payments for that customer.
9 through 11: The first two rules in this block use the same type of r.amt reference
that was explained in 8. The difference is that the ip rule defines the fact value as
Annual_Interest. Note that each rule refers to the value of one of the other
measures. The expression used on the left side of each rule, "[ITERATION_
NUMBER+1]" will create a new dimension value, so the measure will be upserted
into the result set. Thus the result will include a monthly amortization row for all
payments for each customer.
The final line of the example sorts the results by customer and loan payment
number.
In large data warehouse environments, many different types of analysis can occur.
In addition to SQL queries, you may also apply more advanced analytical
operations to your data. Two major types of such analysis are OLAP (On-Line
Analytic Processing) and data mining. Rather than having a separate OLAP or data
mining engine, Oracle has integrated OLAP and data mining capabilities directly
into the database server. Oracle OLAP and Oracle Data Mining (ODM) are options
to the Oracle Database. This chapter provides a brief introduction to these
technologies, and more detail can be found in these products' respective
documentation.
The following topics provide an introduction to Oracle's OLAP and data mining
capabilities:
■ OLAP Overview
■ Oracle Data Mining Overview
OLAP Overview
Oracle Database OLAP adds the query performance and calculation capability
previously found only in multidimensional databases to Oracle's relational
platform. In addition, it provides a Java OLAP API that is appropriate for the
development of internet-ready analytical applications. Unlike other combinations of
OLAP and RDBMS technology, Oracle Database OLAP is not a multidimensional
database using bridges to move data from the relational data store to a
multidimensional data store. Instead, it is truly an OLAP-enabled relational
database. As a result, this release provides the benefits of a multidimensional
database along with the scalability, accessibility, security, manageability, and high
availability of the Oracle Database. The Java OLAP API, which is specifically
designed for internet-based analytical applications, offers productive data access.
See Oracle OLAP Application Developer's Guide for further information regarding
OLAP.
Scalability
Oracle Database OLAP is highly scalable. In today's environment, there is
tremendous growth along three dimensions of analytic applications: number of
users, size of data, complexity of analyses. There are more users of analytical
applications, and they need access to more data to perform more sophisticated
analysis and target marketing. For example, a telephone company might want a
customer dimension to include detail such as all telephone numbers as part of an
application that is used to analyze customer turnover. This would require support
for multi-million row dimension tables and very large volumes of fact data. Oracle
Database can handle very large data sets using parallel execution and partitioning,
as well as offering support for advanced hardware and clustering.
Availability
Oracle Database includes many features that support high availability. One of the
most significant is partitioning, which allows management of precise subsets of
tables and indexes, so that management operations affect only small pieces of these
data structures. By partitioning tables and indexes, data management processing
time is reduced, thus minimizing the time data is unavailable. Another feature
supporting high availability is transportable tablespaces. With transportable
tablespaces, large data sets, including tables and indexes, can be added with almost
no processing to other databases. This enables extremely rapid data loading and
updates.
Manageability
Oracle enables you to precisely control resource utilization. The Database Resource
Manager, for example, provides a mechanism for allocating the resources of a data
warehouse among different sets of end-users. Consider an environment where the
marketing department and the sales department share an OLAP system. Using the
Database Resource Manager, you could specify that the marketing department
receive at least 60 percent of the CPU resources of the machines, while the sales
department receive 40 percent of the CPU resources. You can also further specify
limits on the total number of active sessions, and the degree of parallelism of
individual queries for each department.
Another resource management facility is the progress monitor, which gives end
users and administrators the status of long-running operations. Oracle Database 10g
maintains statistics describing the percent-complete of these operations. Oracle
Enterprise Manager enables you to view a bar-graph display of these operations
showing what percent complete they are. Moreover, any other tool or any database
administrator can also retrieve progress information directly from the Oracle data
server, using system views.
Security
Just as the demands of real-world transaction processing required Oracle to develop
robust features for scalability, manageability and backup and recovery, they lead
Oracle to create industry-leading security features. The security features in Oracle
have reached the highest levels of U.S. government certification for database
trustworthiness. Oracle's fine grained access control feature, enables cell-level
security for OLAP users. Fine grained access control works with minimal burden on
query processing, and it enables efficient centralized security management.
■ Several algorithms:
■ Classification: Naive Bayes, Adaptive Bayes Network, Support Vector
Machine
■ Regression: Support Vector Machine
■ Clustering: k-Means, O-Cluster
■ Association: Apriori
■ Attribute Importance: Predictor Variance
■ Feature Extraction: Non-Negative Matrix Factorization
■ Real-time and batch scoring modes
Oracle Data Mining also supports sequence similarity search and annotation
(BLAST) in the database.
Data Preparation
Data preparation usually requires the creation of new tables or views based on
existing data. Both options perform faster than moving data to an external data
mining utility and offer the programmer the option of snapshots or real-time
updates.
Oracle Data Mining provides utilities for complex, data mining-specific tasks. For
example, for some types of models, binning improves model build time and model
performance; therefore, ODM provides a utility for user-defined binning.
ODM accepts data in either non-transactional (single-record case) format or
transactional (multi-record case) format. ODM provides a pivoting utility for
converting multiple non-transactional tables into a single transactional table.
ODM data exploration and model evaluation features are extended by Oracle's
statistical functions and OLAP capabilities. Because these also operate within the
database, they can all be incorporated into a seamless application that shares
database objects. This allows for more functional and faster applications.
Model Building
Oracle Data Mining supports all the major data mining functions: classification,
regression, association rules, clustering, attribute importance, and feature
extraction.
These algorithms address a broad spectrum of business problems, ranging from
predicting the future likelihood of a customer purchasing a given product, to
understand which products are likely to be purchased together in a single trip to the
grocery store. Since all model building takes place inside the database, the data
never needs to move outside the database, and therefore the entire data-mining
process is accelerated.
Model Evaluation
Models are stored in the database and are directly accessible for evaluation,
reporting, and further analysis by a wide variety of tools and application functions.
ODM provides APIs for calculating confusion matrix and lift charts. ODM stores
the models, the underlying data, and the results of model evaluation together in the
database to enable further analysis, reporting, and application-specific model
management.
You can also use a parameter file to achieve the same thing.
An important point to remember is that indexes are not maintained during a
parallel load.
parallel execution coordinator or query coordinator. The query coordinator does the
following:
■ Parses the query and determines the degree of parallelism
■ Allocates one or two set of slaves (threads or processes)
■ Controls the query and sends instructions to the PQ slaves
■ Determines which tables or indexes need to be scanned by the PQ slaves
■ Produces the final output to the user
Degree of Parallelism
The parallel execution coordinator may enlist two or more of the instance's parallel
execution servers to process a SQL statement. The number of parallel execution
servers associated with a single operation is known as the degree of parallelism.
A single operation is a part of a SQL statement such as an order by, a full table scan
to perform a join on a nonindexed column table.
Note that the degree of parallelism applies directly only to intra-operation
parallelism. If inter-operation parallelism is possible, the total number of parallel
execution servers for a statement can be twice the specified degree of parallelism.
No more than two sets of parallel execution servers can run simultaneously. Each
set of parallel execution servers may process multiple operations. Only two sets of
parallel execution servers need to be active to guarantee optimal inter-operation
parallelism.
Parallel execution is designed to effectively use multiple CPUs and disks to answer
queries quickly. When multiple users use parallel execution at the same time, it is
easy to quickly exhaust available CPU, memory, and disk resources.
Oracle Database provides several ways to manage resource utilization in
conjunction with parallel execution environments, including:
■ The adaptive multiuser algorithm, which is enabled by default, reduces the
degree of parallelism as the load on the system increases.
■ User resource limits and profiles, which allow you to set limits on the amount
of various system resources available to each user as part of a user's security
domain.
■ The Database Resource Manager, which lets you allocate resources to different
groups of users.
coordinator for the statement. See "Setting the Degree of Parallelism for Parallel
Execution" on page 24-32 for more information.
Each communication channel has at least one, and sometimes up to four memory
buffers. Multiple memory buffers facilitate asynchronous communication among
the parallel execution servers.
A single-instance environment uses at most three buffers for each communication
channel. An Oracle Real Application Clusters environment uses at most four buffers
for each channel. Figure 24–1 illustrates message buffers and how producer parallel
execution servers connect to consumer parallel execution servers.
... Parallel
execution
server set 1
Parallel
... execution
server set 2
connections
message
buffer
When a connection is between two processes on the same instance, the servers
communicate by passing the buffers back and forth. When the connection is
between processes in different instances, the messages are sent using external
high-speed network protocols. In Figure 24–1, the DOP is equal to the number of
parallel execution servers, which in this case is n. Figure 24–1 does not show the
parallel execution coordinator. Each parallel execution server actually has an
additional connection to the parallel execution coordinator.
After the optimizer determines the execution plan of a statement, the parallel
execution coordinator determines the parallelization method for each operation in
the plan. For example, the parallelization method might be to parallelize a full table
scan by block range or parallelize an index range scan by partition. The coordinator
must decide whether an operation can be performed in parallel and, if so, how
many parallel execution servers to enlist. The number of parallel execution servers
in one set is the DOP. See "Setting the Degree of Parallelism for Parallel Execution"
on page 24-32 for more information.
Note that hints have been used in the query to force the join order and join method,
and to specify the DOP of the tables employees and departments. In general,
you should let the optimizer determine the order and method.
Figure 24–2 illustrates the data flow graph or query plan for this query.
Parallel
Execution
Coordinator
GROUP
BY
SORT
HASH
JOIN
query that specifies the DOP. In other words, the DOP will be four because each set
of parallel execution servers will have four processes.
Slave set SS1 first scans the table employees while SS2 will fetch rows from SS1
and build a hash table on the rows. In other words, the parent servers in SS2 and the
child servers in SS2 work concurrently: one in scanning employees in parallel, the
other in consuming rows sent to it from SS1 and building the hash table for the hash
join in parallel. This is an example of inter-operation parallelism.
After SS1 has finished scanning the entire table employees (that is, all granules or
task units for employees are exhausted), it scans the table departments in
parallel. It sends its rows to servers in SS2, which then perform the probes to finish
the hash-join in parallel. After SS1 is done scanning the table departments in
parallel and sending the rows to SS2, it switches to performing the GROUP BY in
parallel. This is how two server sets run concurrently to achieve inter-operation
parallelism across various operators in the query tree while achieving
intra-operation parallelism in executing each operation in parallel.
Another important aspect of parallel execution is the re-partitioning of rows while
they are sent from servers in one server set to another. For the query plan in
Figure 24–2, after a server process in SS1 scans a row of employees, which server
process of SS2 should it send it to? The partitioning of rows flowing up the query
tree is decided by the operator into which the rows are flowing into. In this case, the
partitioning of rows flowing up from SS1 performing the parallel scan of
employees into SS2 performing the parallel hash-join is done by hash partitioning
on the join column value. That is, a server process scanning employees computes a
hash function of the value of the column employees.employee_id to decide the
number of the server process in SS2 to send it to. The partitioning method used in
parallel queries is explicitly shown in the EXPLAIN PLAN of the query. Note that the
partitioning of rows being sent between sets of execution servers should not be
confused with Oracle's partitioning feature whereby tables can be partitioned using
hash, range, and other methods.
Producer Operations
Operations that require the output of other operations are known as consumer
operations. In Figure 24–2, the GROUP BY SORT operation is the producer of the
HASH JOIN operation because GROUP BY SORT requires the HASH JOIN output.
Producer operations can begin consuming rows as soon as the producer operations
have produced rows. In the previous example, while the parallel execution servers
are producing rows in the FULL SCAN departments operation, another set of
parallel execution servers can begin to perform the HASH JOIN operation to
consume the rows.
Each of the two operations performed concurrently is given its own set of parallel
execution servers. Therefore, both query operations and the data flow tree itself
have parallelism. The parallelism of an individual operation is called intraoperation
parallelism and the parallelism between operations in a data flow tree is called
interoperation parallelism. Due to the producer-consumer nature of the Oracle
server's operations, only two operations in a given tree need to be performed
simultaneously to minimize execution time. To illustrate intraoperation and
interoperation parallelism, consider the following statement:
SELECT * FROM employees ORDER BY last_name;
The execution plan implements a full scan of the employees table. This operation
is followed by a sorting of the retrieved rows, based on the value of the last_name
column. For the sake of this example, assume the last_name column is not
indexed. Also assume that the DOP for the query is set to 4, which means that four
parallel execution servers can be active for any given operation.
Figure 24–3 illustrates the parallel execution of the example query.
A-G
employees Table
H-M
Parallel
User Execution
Process Coordinator
N-S
SELECT *
from employees
T-Z
ORDER BY last_name;
As you can see from Figure 24–3, there are actually eight parallel execution servers
involved in the query even though the DOP is 4. This is because a parent and child
operator can be performed at the same time (interoperation parallelism).
Also note that all of the parallel execution servers involved in the scan operation
send rows to the appropriate parallel execution server performing the SORT
operation. If a row scanned by a parallel execution server contains a value for the
last_name column between A and G, that row gets sent to the first ORDER BY
parallel execution server. When the scan operation is complete, the sorting
processes can return the sorted results to the coordinator, which, in turn, returns the
complete query results to the user.
Types of Parallelism
The following types of parallelism are discussed in this section:
■ Parallel Query
■ Parallel DDL
■ Parallel DML
■ Parallel Execution of Functions
■ Other Types of Parallelism
Parallel Query
You can parallelize queries and subqueries in SELECT statements. You can also
parallelize the query portions of DDL statements and DML statements (INSERT,
UPDATE, and DELETE). You can also query external tables in parallel.
See Also:
■ "Operations That Can Be Parallelized" on page 24-3 for
information on the query operations that Oracle can parallelize
■ "Parallelizing SQL Statements" on page 24-8 for an explanation
of how the processes perform parallel queries
■ "Distributed Transaction Restrictions" on page 24-27 for
examples of queries that reference a remote object
■ "Rules for Parallelizing Queries" on page 24-37 for information
on the conditions for parallelizing a query and the factors that
determine the DOP
Parallel DDL
This section includes the following topics on parallelism for DDL statements:
■ DDL Statements That Can Be Parallelized
■ CREATE TABLE ... AS SELECT in Parallel
■ Recoverability and Parallel DDL
■ Space Management for Parallel DDL
■ CREATE INDEX
■ CREATE TABLE ... AS SELECT
■ ALTER INDEX ... REBUILD
The parallel DDL statements for partitioned tables and indexes are:
■ CREATE INDEX
■ CREATE TABLE ... AS SELECT
■ ALTER TABLE ... [MOVE|SPLIT|COALESCE] PARTITION
■ ALTER INDEX ... [REBUILD|SPLIT] PARTITION
■ This statement can be executed in parallel only if the (global) index
partition being split is usable.
All of these DDL operations can be performed in no-logging mode for either
parallel or serial execution.
CREATE TABLE for an index-organized table can be parallelized either with or
without an AS SELECT clause.
Different parallelism is used for different operations (see Table 24–3 on page 24-44).
Parallel CREATE TABLE ... AS SELECT statements on partitioned tables and parallel
CREATE INDEX statements on partitioned indexes execute with a DOP equal to the
number of partitions.
Partition parallel analyze table is made less necessary by the ANALYZE {TABLE,
INDEX} PARTITION statements, since parallel analyze of an entire partitioned table
can be constructed with multiple user sessions.
Parallel DDL cannot occur on tables with object columns. Parallel DDL cannot occur
on non-partitioned tables with LOB columns.
Parallel Execution
Coordinator
free space within the internal table extents of a datafile cannot be coalesced with
other free space and cannot be allocated as extents.
See Oracle Database Performance Tuning Guide for more information about creating
tables and indexes in parallel.
USERS Tablespace
DATA1.ORA
EXTENT 1
Parallel
Execution
Server Free space
for INSERTs
Parallel EXTENT 2
CREATE TABLE emp Execution
AS SELECT ... Server Free space
for INSERTs
EXTENT 3
Parallel
Execution Free space
Server for INSERTs
Parallel DML
Parallel DML (PARALLEL INSERT, UPDATE, DELETE, and MERGE) uses parallel
execution mechanisms to speed up or scale up large DML operations against large
database tables and indexes.
Refreshing Tables in a Data Warehouse System In a data warehouse system, large tables
need to be refreshed (updated) periodically with new or modified data from the
production system. You can do this efficiently by using parallel DML combined
with updatable join views. You can also use the MERGE statement.
The data that needs to be refreshed is generally loaded into a temporary table before
starting the refresh process. This table contains either new rows or rows that have
been updated since the last refresh of the data warehouse. You can use an updatable
join view with parallel UPDATE to refresh the updated rows, and you can use an
anti-hash join with parallel INSERT to refresh the new rows.
Using Scoring Tables Many DSS applications score customers periodically based on a
set of criteria. The scores are usually stored in large DSS tables. The score
information is then used in making a decision, for example, inclusion in a mailing
list.
This scoring activity queries and updates a large number of rows in the large table.
Parallel DML can speed up the operations against these large tables.
Running Batch Jobs Batch jobs executed in an OLTP database during off hours have a
fixed time window in which the jobs must complete. A good way to ensure timely
job completion is to parallelize their operations. As the work load increases, more
machine resources can be added; the scaleup property of parallel operations ensures
that the time constraint can be met.
The default mode of a session is DISABLE PARALLEL DML. When parallel DML is
disabled, no DML will be executed in parallel even if the PARALLEL hint is used.
When parallel DML is enabled in a session, all DML statements in this session will
be considered for parallel execution. However, even if parallel DML is enabled, the
DML operation may still execute serially if there are no parallel hints or no tables
with a parallel attribute or if restrictions on parallel operations are violated.
The session's PARALLEL DML mode does not influence the parallelism of SELECT
statements, DDL statements, and the query portions of DML statements. Thus, if
this mode is not set, the DML operation is not parallelized, but scans or join
operations within the DML statement may still be parallelized.
See Also:
■ "Space Considerations for Parallel DML" on page 24-24
■ "Lock and Enqueue Resources for Parallel DML" on page 24-25
■ "Restrictions on Parallel DML" on page 24-25
Rollback Segments
If you use rollback segments instead of Automatic Undo Management, there are
some restrictions when using parallel DML. See Oracle Database SQL Reference for
information about restrictions for parallel DML and rollback segments.
System Recovery Recovery from a system failure requires a new startup. Recovery is
performed by the SMON process and any recovery server processes spawned by
SMON. Parallel DML statements may be recovered using parallel rollback. If the
initialization parameter COMPATIBLE is set to 8.1.3 or greater, Fast-Start
On-Demand Rollback enables terminated transactions to be recovered, on demand
one block at a time.
■ Parallel DML can be done on tables with LOB columns provided the table is
partitioned. However, intra-partition parallelism is not supported.
■ A transaction involved in a parallel DML operation cannot be or become a
distributed transaction.
■ Clustered tables are not supported.
Violations of these restrictions cause the statement to execute serially without
warnings or error messages (except for the restriction on statements accessing the
same table in a transaction, which can cause error messages). For example, an
update is serialized if it is on a nonpartitioned table.
Partitioning Key Restriction You can only update the partitioning key of a partitioned
table to a new value if the update does not cause the row to move to a new
partition. The update is possible if the table is defined with the row movement
clause enabled.
Function Restrictions The function restrictions for parallel DML are the same as those
for parallel DDL and parallel query. See "Parallel Execution of Functions" on
page 24-28 for more information.
NOT NULL and CHECK These types of integrity constraints are allowed. They are not a
problem for parallel DML because they are enforced on the column and row level,
respectively.
UNIQUE and PRIMARY KEY These types of integrity constraints are allowed.
Delete Cascade Delete on tables having a foreign key with delete cascade is not
parallelized because parallel execution servers will try to delete rows from multiple
partitions (parent and child tables).
Deferrable Integrity Constraints If any deferrable constraints apply to the table being
operated on, the DML operation will not be parallelized.
Trigger Restrictions
A DML operation will not be parallelized if the affected tables contain enabled
triggers that may get fired as a result of the statement. This implies that DML
statements on tables that are being replicated will not be parallelized.
Relevant triggers must be disabled in order to parallelize DML on the table. Note
that, if you enable or disable triggers, the dependent shared cursors are invalidated.
Like parallel SQL, parallel recovery and propagation are performed by a parallel
execution coordinator and multiple parallel execution servers. Parallel load,
however, uses a different mechanism.
The behavior of the parallel execution coordinator and parallel execution servers
may differ, depending on what kind of operation they perform (SQL, recovery, or
propagation). For example, if all parallel execution servers in the pool are occupied
and the maximum number of parallel execution servers has been started:
■ In parallel SQL, the parallel execution coordinator switches to serial processing.
■ In parallel propagation, the parallel execution coordinator returns an error.
For a given session, the parallel execution coordinator coordinates only one kind of
operation. A parallel execution coordinator cannot coordinate, for example, parallel
SQL and parallel recovery or propagation at the same time.
See Also:
■ Oracle Database Utilities for information about parallel load and
SQL*Loader
■ Oracle Database Backup and Recovery Basics for information about
parallel media recovery
■ Oracle Database Performance Tuning Guide for information about
parallel instance recovery
■ Oracle Database Advanced Replication for information about
parallel propagation
Note that you can set some parameters in such a way that Oracle will be
constrained. For example, if you set PROCESSES to 20, you will not be able to get 25
slaves.
See Also:
■ "The Parallel Execution Server Pool" on page 24-6
■ "Parallelism Between Operations" on page 24-10
■ "Default Degree of Parallelism" on page 24-34
■ "Parallelization Rules for SQL Statements" on page 24-37
■ The PARALLEL hint is used only for operations on tables. You can use it to
parallelize queries and DML statements (INSERT, UPDATE, MERGE, and
DELETE).
■ The PARALLEL_INDEX hint parallelizes an index range scan of a partitioned
index. (In an index operation, the PARALLEL hint is not valid and is ignored.)
See Oracle Database Performance Tuning Guide for information about using hints in
SQL statements and the specific syntax for the PARALLEL, NO_PARALLEL,
PARALLEL_INDEX, CACHE, and NOCACHE hints.
Real Application Clusters Deployment and Performance Guide for more information
about instance groups.
Degree of Parallelism The DOP for a query is determined by the following rules:
■ The query uses the maximum DOP taken from all of the table declarations
involved in the query and all of the potential indexes that are candidates to
satisfy the query (the reference objects). That is, the table or index that has the
greatest DOP determines the query's DOP (maximum query directive).
■ If a table has both a parallel hint specification in the query and a parallel
declaration in its table specification, the hint specification takes precedence over
parallel declaration specification. See Table 24–3 on page 24-44 for precedence
rules.
Decision to Parallelize The following rule determines whether the UPDATE, MERGE, or
DELETE operation should be parallelized:
The UPDATE or DELETE operation will be parallelized if and only if at least one of
the following is true:
■ The table being updated or deleted has a PARALLEL specification.
■ The PARALLEL hint is specified in the DML statement.
Degree of Parallelism The DOP is determined by the same rules as for the queries.
Note that in the case of UPDATE and DELETE operations, only the target table to be
modified (the only reference object) is involved. Thus, the UPDATE or DELETE
parallel hint specification takes precedence over the parallel declaration
specification of the target table. In other words, the precedence order is: MERGE,
UPDATE, DELETE hint > Session > Parallel declaration specification of target table.
See Table 24–3 on page 24-44 for precedence rules.
A parallel execution server can update or merge into, or delete from multiple
partitions, but each partition can only be updated or deleted by one parallel
execution server.
If the DOP is less than the number of partitions, then the first process to finish work
on one partition continues working on another partition, and so on until the work is
finished on all partitions. If the DOP is greater than the number of partitions
involved in the operation, then the excess parallel execution servers will have no
work to do.
If tbl_1 is a partitioned table and its table definition has a parallel clause, then the
update operation is parallelized even if the scan on the table is serial (such as an
index scan), assuming that the table has more than one partition with c1 greater
than 100.
Both the scan and update operations on tbl_2 will be parallelized with degree
four.
Decision to Parallelize The following rule determines whether the INSERT operation
should be parallelized in an INSERT ... SELECT statement:
The INSERT operation will be parallelized if and only if at least one of the following
is true:
■ The PARALLEL hint is specified after the INSERT in the DML statement.
■ The table being inserted into (the reference object) has a PARALLEL declaration
specification.
■ An ALTER SESSION FORCE PARALLEL DML statement has been issued
previously during the session.
The decision to parallelize the INSERT operation is made independently of the
SELECT operation, and vice versa.
Session> Parallel declaration specification of the inserting table > Maximum query
directive.
In this context, maximum query directive means that among multiple tables and
indexes, the table or index that has the maximum DOP determines the parallelism
for the query operation.
The chosen parallel directive is applied to both the SELECT and INSERT operations.
Parallel CREATE INDEX or ALTER INDEX ... REBUILD The CREATE INDEX and ALTER
INDEX ... REBUILD statements can be parallelized only by a PARALLEL clause or an
ALTER SESSION FORCE PARALLEL DDL statement.
ALTER INDEX ... REBUILD can be parallelized only for a nonpartitioned index, but
ALTER INDEX ... REBUILD PARTITION can be parallelized by a PARALLEL clause
or an ALTER SESSION FORCE PARALLEL DDL statement.
The scan operation for ALTER INDEX ... REBUILD (nonpartitioned), ALTER INDEX ...
REBUILD PARTITION, and CREATE INDEX has the same parallelism as the
REBUILD or CREATE operation and uses the same DOP. If the DOP is not specified
for REBUILD or CREATE, the default is the number of CPUs.
Parallel MOVE PARTITION or SPLIT PARTITION The ALTERINDEX ... MOVE PARTITION
and ALTERINDEX ...SPLIT PARTITION statements can be parallelized only by a
PARALLEL clause or an ALTER SESSION FORCE PARALLEL DDL statement. Their
scan operations have the same parallelism as the corresponding MOVE or SPLIT
operations. If the DOP is not specified, the default is the number of CPUs.
Decision to Parallelize (Query Part) The query part of a CREATE TABLE ... AS SELECT
statement can be parallelized only if the following conditions are satisfied:
■ The query includes a parallel hint specification (PARALLEL or PARALLEL_
INDEX) or the CREATE part of the statement has a PARALLEL clause
specification or the schema objects referred to in the query have a
PARALLEL declaration associated with them.
■ At least one of the tables specified in the query requires one of the following: a
full table scan or an index range scan spanning multiple partitions.
Degree of Parallelism (Query Part) The DOP for the query part of a CREATE TABLE ...
AS SELECT statement is determined by one of the following rules:
■ The query part uses the values specified in the PARALLEL clause of the CREATE
part.
■ If the PARALLEL clause is not specified, the default DOP is the number of CPUs.
■ If the CREATE is serial, then the DOP is determined by the query.
Note that any values specified in a hint for parallelism are ignored.
Decision to Parallelize (CREATE Part) The CREATE operation of CREATE TABLE ... AS
SELECT can be parallelized only by a PARALLEL clause or an ALTER SESSION
FORCE PARALLEL DDL statement.
When the CREATE operation of CREATE TABLE ... AS SELECT is parallelized, Oracle
also parallelizes the scan operation if possible. The scan operation cannot be
parallelized if, for example:
■ The SELECT clause has a NO_PARALLEL hint
■ The operation scans an index of a nonpartitioned table
When the CREATE operation is not parallelized, the SELECT can be parallelized if it
has a PARALLEL hint or if the selected table (or partitioned index) has a parallel
declaration.
Degree of Parallelism (CREATE Part) The DOP for the CREATE operation, and for the
SELECT operation if it is parallelized, is specified by the PARALLEL clause of the
CREATE statement, unless it is overridden by an ALTER SESSION FORCE PARALLEL
DDL statement. If the PARALLEL clause does not specify the DOP, the default is the
number of CPUs.
Once Oracle determines the DOP for a query, the DOP does not change for the
duration of the query.
It is best to use the parallel adaptive multiuser feature when users process
simultaneous parallel execution operations. By default, PARALLEL_ADAPTIVE_
MULTI_USER is set to TRUE, which optimizes the performance of systems with
concurrent parallel SQL execution operations. If PARALLEL_ADAPTIVE_MULTI_
USER is set to FALSE, each parallel SQL execution operation receives the requested
number of parallel execution server processes regardless of the impact to the
performance of the system as long as sufficient resources have been configured.
PARALLEL_MAX_SERVERS
The PARALLEL_MAX_SEVERS parameter sets a resource limit on the maximum
number of processes available for parallel execution. Most parallel operations need
at most twice the number of query server processes as the maximum DOP
attributed to any table in the operation.
Oracle sets PARALLEL_MAX_SERVERS to a default value that is sufficient for most
systems. The default value for PARALLEL_MAX_SERVERS is as follows:
(CPU_COUNT x PARALLEL_THREADS_PER_CPU x (2 if PGA_AGGREGATE_TARGET > 0;
otherwise 1) x 5)
This might not be enough for parallel queries on tables with higher DOP attributes.
We recommend users who expects to run queries of higher DOP to set PARALLEL_
MAX_SERVERS as follows:
2 x DOP x NUMBER_OF_CONCURRENT_USERS
When Users Have Too Many Processes When concurrent users have too many query
server processes, memory contention (paging), I/O contention, or excessive context
switching can occur. This contention can reduce system throughput to a level lower
than if parallel execution were not used. Increase the PARALLEL_MAX_SERVERS
value only if the system has sufficient memory and I/O bandwidth for the resulting
load.
You can use operating system performance monitoring tools to determine how
much memory, swap space and I/O bandwidth are free. Look at the runq lengths
for both your CPUs and disks, as well as the service time for I/Os on the system.
Verify that the machine has sufficient swap space exists on the machine to add more
processes. Limiting the total number of query server processes might restrict the
number of concurrent users who can execute parallel operations, but system
throughput tends to remain stable.
PARALLEL_MIN_SERVERS
The recommended value for the PARALLEL_MIN_SERVERS parameter is 0 (zero),
which is the default.
This parameter lets you specify in a single instance the number of processes to be
started and reserved for parallel operations. The syntax is:
PARALLEL_MIN_SERVERS=n
The n variable is the number of processes you want to start and reserve for parallel
operations.
Setting PARALLEL_MIN_SERVERS balances the startup cost against memory usage.
Processes started using PARALLEL_MIN_SERVERS do not exit until the database is
shut down. This way, when a query is issued the processes are likely to be available.
It is desirable, however, to recycle query server processes periodically since the
memory these processes use can become fragmented and cause the high water mark
to slowly increase. When you do not set PARALLEL_MIN_SERVERS, processes exit
after they are idle for five minutes.
SHARED_POOL_SIZE
Parallel execution requires memory resources in addition to those required by serial
SQL execution. Additional memory is used for communication and passing data
between query server processes and the query coordinator.
Oracle Database allocates memory for query server processes from the shared pool.
Tune the shared pool as follows:
■ Allow for other clients of the shared pool, such as shared cursors and stored
procedures.
■ Remember that larger values improve performance in multiuser systems, but
smaller values use less memory.
■ You must also take into account that using parallel execution generates more
cursors. Look at statistics in the V$SQLAREA view to determine how often
Oracle recompiles cursors. If the cursor hit ratio is poor, increase the size of the
pool. This happens only when you have a large number of distinct queries.
You can then monitor the number of buffers used by parallel execution and
compare the shared pool PX msg pool to the current high water mark
reported in output from the view V$PX_PROCESS_SYSSTAT.
By default, Oracle allocates parallel execution buffers from the shared pool.
You should reduce the value for SHARED_POOL_SIZE low enough so your database
starts. After reducing the value of SHARED_POOL_SIZE, you might see the error:
ORA-04031: unable to allocate 16084 bytes of shared memory
("SHARED pool","unknown object","SHARED pool heap","PX msg pool")
If so, execute the following query to determine why Oracle could not allocate the
16,084 bytes:
SELECT NAME, SUM(BYTES) FROM V$SGASTAT WHERE POOL='SHARED POOL'
GROUP BY ROLLUP (NAME);
If you specify SHARED_POOL_SIZE and the amount of memory you need to reserve
is bigger than the pool, Oracle does not allocate all the memory it can get. Instead, it
leaves some space. When the query runs, Oracle tries to get what it needs. Oracle
uses the 560 KB and needs another 16KB when it fails. The error does not report the
cumulative amount that is needed. The best way of determining how much more
memory is needed is to use the formulas in "Adding Memory for Message Buffers"
on page 24-52.
To resolve the problem in the current example, increase the value for SHARED_
POOL_SIZE. As shown in the sample output, the SHARED_POOL_SIZE is about 2
MB. Depending on the amount of memory available, you could increase the value
of SHARED_POOL_SIZE to 4 MB and attempt to start your database. If Oracle
continues to display an ORA-4031 message, gradually increase the value for
SHARED_POOL_SIZE until startup is successful.
Adding Memory for Message Buffers You must increase the value for the SHARED_
POOL_SIZE parameter to accommodate message buffers. The message buffers
allow query server processes to communicate with each other.
Oracle uses a fixed number of buffers for each virtual connection between producer
query servers and consumer query servers. Connections increase as the square of
the DOP increases. For this reason, the maximum amount of memory used by
parallel execution is bound by the highest DOP allowed on your system. You can
control this value by using either the PARALLEL_MAX_SERVERS parameter or by
using policies and profiles.
To calculate the amount of memory required, use one of the following formulas:
■ For SMP systems:
mem in bytes = (3 x size x users x groups x connections)
Add this amount to your original setting for the shared pool. However, before
setting a value for either of these memory structures, you must also consider
additional memory for cursors, as explained in the following section.
Calculating Additional Memory for Cursors Parallel execution plans consume more space
in the SQL area than serial execution plans. You should regularly monitor shared
pool resource use to ensure that the memory used by both messages and cursors
can accommodate your system's processing requirements.
Evaluate the memory used as shown in your output, and alter the setting for
SHARED_POOL_SIZE based on your processing needs.
To obtain more memory usage statistics, execute the following query:
SELECT * FROM V$PX_PROCESS_SYSSTAT WHERE STATISTIC LIKE 'Buffers%';
The amount of memory used appears in the Buffers Current and Buffers HWM
statistics. Calculate a value in bytes by multiplying the number of buffers by the
value for PARALLEL_EXECUTION_MESSAGE_SIZE. Compare the high water mark
to the parallel execution message pool size to determine if you allocated too much
memory. For example, in the first output, the value for large pool as shown in px
msg pool is 38,092,812 or 38 MB. The Buffers HWM from the second output is
3,620, which when multiplied by a parallel execution message size of 4,096 is
14,827,520, or approximately 15 MB. In this case, the high water mark has reached
approximately 40 percent of its capacity.
PARALLEL_MIN_PERCENT
The recommended value for the PARALLEL_MIN_PERCENT parameter is 0 (zero).
This parameter enables users to wait for an acceptable DOP, depending on the
application in use. Setting this parameter to values other than 0 (zero) causes Oracle
to return an error when the requested DOP cannot be satisfied by the system at a
given time. For example, if you set PARALLEL_MIN_PERCENT to 50, which
translates to 50 percent, and the DOP is reduced by 50 percent or greater because of
the adaptive algorithm or because of a resource limitation, then Oracle returns
ORA-12827. For example:
SELECT /*+ PARALLEL(e, 8, 1) */ d.department_id, SUM(SALARY)
FROM employees e, departments d WHERE e.department_id = d.department_id
GROUP BY d.department_id ORDER BY d.department_id;
process memory and the number of processes can vary greatly. Use the PGA_
AGGREGATE_TARGET parameter to control both the process memory and the
number of processes.
PGA_AGGREGATE_TARGET
You can simplify and improve the way PGA memory is allocated by enabling
automatic PGA memory management. In this mode, Oracle dynamically adjusts the
size of the portion of the PGA memory dedicated to work areas, based on an overall
PGA memory target explicitly set by the DBA. To enable automatic PGA memory
management, you have to set the initialization parameter PGA_AGGREGATE_
TARGET. See Oracle Database Performance Tuning Guide for descriptions of how to use
PGA_AGGREGATE_TARGET in different scenarios.
HASH_AREA_SIZE HASH_AREA_SIZE has been deprecated and you should use PGA_
AGGREGATE_TARGET instead.
SORT_AREA_SIZE SORT_AREA_SIZE has been deprecated and you should use PGA_
AGGREGATE_TARGET instead.
PARALLEL_EXECUTION_MESSAGE_SIZE
The PARALLEL_EXECUTION_MESSAGE_SIZE parameter specifies the size of the
buffer used for parallel execution messages. The default value is os specific, but is
typically 2K. This value should be adequate for most applications, however,
increasing this value can improve performance. Consider increasing this value if
you have adequate free memory in the shared pool or if you have sufficient
operating system memory and can increase your shared pool size to accommodate
the additional amount of memory required. Parameters Affecting Resource
Consumption for Parallel DML and Parallel DDL
Parameters Affecting Resource Consumption for Parallel DML and Parallel DDL
The parameters that affect parallel DML and parallel DDL resource consumption
are:
■ TRANSACTIONS
■ FAST_START_PARALLEL_ROLLBACK
■ LOG_BUFFER
■ DML_LOCKS
■ ENQUEUE_RESOURCES
Parallel inserts, updates, and deletes require more resources than serial DML
operations. Similarly, PARALLEL CREATE TABLE ... AS SELECT and PARALLEL
CREATE INDEX can require more resources. For this reason, you may need to
increase the value of several additional initialization parameters. These parameters
do not affect resources for queries.
TRANSACTIONS For parallel DML and DDL, each query server process starts a
transaction. The parallel coordinator uses the two-phase commit protocol to commit
transactions; therefore, the number of transactions being processed increases by the
DOP. As a result, you might need to increase the value of the TRANSACTIONS
initialization parameter.
The TRANSACTIONS parameter specifies the maximum number of concurrent
transactions. The default assumes no parallelism. For example, if you have a DOP
of 20, you will have 20 more new server transactions (or 40, if you have two server
sets) and 1 coordinator transaction. In this case, you should increase
TRANSACTIONS by 21 (or 41) if the transactions are running in the same instance. If
you do not set this parameter, Oracle sets it to a value equal to 1.1 x SESSIONS. This
discussion does not apply if you are using server-managed undo.
DML_LOCKS This parameter specifies the maximum number of DML locks. Its value
should equal the total number of locks on all tables referenced by all users. A
parallel DML operation's lock and enqueue resource requirement is very different
from serial DML. Parallel DML holds many more locks, so you should increase the
value of the ENQUEUE_RESOURCES and DML_LOCKS parameters by equal amounts.
Table 24–4 shows the types of locks acquired by coordinator and parallel execution
server processes for different types of parallel DML statements. Using this
information, you can determine the value required for these parameters.
Consider a table with 600 partitions running with a DOP of 100. Assume all
partitions are involved in a parallel UPDATE or DELETE statement with no
row-migrations.
The coordinator acquires:
■ 1 table lock SX
■ 600 partition locks X
Total server processes acquires:
■ 100 table locks SX
■ 600 partition locks NULL
■ 600 partition-wait locks S
■ DB_BLOCK_SIZE
■ DB_FILE_MULTIBLOCK_READ_COUNT
■ DISK_ASYNCH_IO and TAPE_ASYNCH_IO
These parameters also affect the optimizer which ensures optimal performance for
parallel execution I/O operations.
DB_CACHE_SIZE
When you perform parallel updates, merges, and deletes, the buffer cache behavior
is very similar to any OLTP system running a high volume of updates.
DB_BLOCK_SIZE
The recommended value for this parameter is 8 KB or 16 KB.
Set the database block size when you create the database. If you are creating a new
database, use a large block size such as 8 KB or 16 KB.
DB_FILE_MULTIBLOCK_READ_COUNT
The recommended value for this parameter is eight for 8 KB block size, or four for
16 KB block size. The default is 8.
This parameter determines how many database blocks are read with a single
operating system READ call. The upper limit for this parameter is
platform-dependent. If you set DB_FILE_MULTIBLOCK_READ_COUNT to an
excessively high value, your operating system will lower the value to the highest
allowable level when you start your database. In this case, each platform uses the
highest value possible. Maximum values generally range from 64 KB to 1 MB.
Synchronous read
Asynchronous read
I/O: CPU:
read block #1 process block #1
I/O: CPU:
read block #2 process block #2
Asynchronous operations are currently supported for parallel table scans, hash
joins, sorts, and serial table scans. However, this feature can require operating
system specific configuration and may not be supported on all platforms. Check
your Oracle operating system-specific documentation.
Is There Regression?
Does parallel execution's actual performance deviate from what you expected? If
performance is as you expected, could there be an underlying performance
problem? Perhaps you have a desired outcome in mind to which you are comparing
the current outcome. Perhaps you have justifiable performance expectations that the
system does not achieve. You might have achieved this level of performance or a
particular execution plan in the past, but now, with a similar environment and
operation, the system is not meeting this goal.
If performance is not as you expected, can you quantify the deviation? For data
warehousing operations, the execution plan is key. For critical data warehousing
operations, save the EXPLAIN PLAN results. Then, as you analyze and reanalyze the
data, upgrade Oracle, and load new data, over time you can compare new
execution plans with old plans. Take this approach either proactively or reactively.
Alternatively, you might find that plan performance improves if you use hints. You
might want to understand why hints are necessary and determine how to get the
optimizer to generate the desired plan without hints. Try increasing the statistical
sample size: better statistics can give you a better plan.
See Oracle Database Performance Tuning Guide for information on preserving plans
throughout changes to your system, using plan stability and outlines.
■ Use the CREATE TABLE ... AS SELECT statement to break a complex operation
into smaller pieces. With a large query referencing five or six tables, it may be
difficult to determine which part of the query is taking the most time. You can
isolate bottlenecks in the query by breaking it into steps and analyzing each
step.
■ Is the system CPU-bound with too much parallelism? Check the operating
system CPU monitor to see whether a lot of time is being spent in system calls.
The resource might be overcommitted, and too much parallelism might cause
processes to compete with themselves.
■ Are there more concurrent users than the system can support?
V$PX_BUFFER_ADVICE
The V$PX_BUFFER_ADVICE view provides statistics on historical and projected
maximum buffer usage by all parallel queries. You can consult this view to
reconfigure SGA size in response to insufficient memory problems for parallel
queries.
V$PX_SESSION
The V$PX_SESSION view shows data about query server sessions, groups, sets, and
server numbers. It also displays real-time data about the processes working on
behalf of parallel execution. This table includes information about the requested
DOP and the actual DOP granted to the operation.
V$PX_SESSTAT
The V$PX_SESSTAT view provides a join of the session information from V$PX_
SESSION and the V$SESSTAT table. Thus, all session statistics available to a normal
session are available for all sessions performed using parallel execution.
V$PX_PROCESS
The V$PX_PROCESS view contains information about the parallel processes,
including status, session ID, process ID, and other information.
V$PX_PROCESS_SYSSTAT
The V$PX_PROCESS_SYSSTAT view shows the status of query servers and
provides buffer allocation statistics.
V$PQ_SESSTAT
The V$PQ_SESSTAT view shows the status of all current server groups in the
system such as data about how queries allocate processes and how the multiuser
and load balancing algorithms are affecting the default and hinted values. V$PQ_
SESSTAT will be obsolete in a future release.
You might need to adjust some parameter settings to improve performance after
reviewing data from these views. In this case, refer to the discussion of "Tuning
General Parameters for Parallel Execution" on page 24-47. Query these views
periodically to monitor the progress of long-running parallel operations.
For many dynamic performance views, you must set the parameter TIMED_
STATISTICS to TRUE in order for Oracle to collect statistics for each view. You can
use the ALTER SYSTEM or ALTER SESSION statements to turn TIMED_
STATISTICS on and off.
V$FILESTAT
The V$FILESTAT view sums read and write requests, the number of blocks, and
service times for every datafile in every tablespace. Use V$FILESTAT to diagnose
I/O and workload distribution problems.
You can join statistics from V$FILESTAT with statistics in the DBA_DATA_FILES
view to group I/O by tablespace or to find the filename for a given file number.
Using a ratio analysis, you can determine the percentage of the total tablespace
activity used by each file in the tablespace. If you make a practice of putting just one
large, heavily accessed object in a tablespace, you can use this technique to identify
objects that have a poor physical layout.
You can further diagnose disk space allocation problems using the DBA_EXTENTS
view. Ensure that space is allocated evenly from all files in the tablespace.
Monitoring V$FILESTAT during a long-running operation and then correlating I/O
activity to the EXPLAIN PLAN output is a good way to follow progress.
V$PARAMETER
The V$PARAMETER view lists the name, current value, and default value of all
system parameters. In addition, the view shows whether a parameter is a session
parameter that you can modify online with an ALTER SYSTEM or ALTER SESSION
statement.
V$PQ_TQSTAT
As a simple example, consider a hash join between two tables, with a join on a
column with only two distinct values. At best, this hash function will have one hash
value to parallel execution server A and the other to parallel execution server B. A
DOP of two is fine, but, if it is four, then at least two parallel execution servers have
no work. To discover this type of skew, use a query similar to the following
example:
SELECT dfo_number, tq_id, server_type, process, num_rows
FROM V$PQ_TQSTAT ORDER BY dfo_number DESC, tq_id, server_type, process;
The best way to resolve this problem might be to choose a different join method; a
nested loop join might be the best option. Alternatively, if one of the join tables is
small relative to the other, a BROADCAST distribution method can be hinted using
PQ_DISTRIBUTE hint. Note that the optimizer considers the BROADCAST
distribution method, but requires OPTIMIZER_FEATURES_ENABLE set to 9.0.2 or
higher.
Now, assume that you have a join key with high cardinality, but one of the values
contains most of the data, for example, lava lamp sales by year. The only year that
had big sales was 1968, and thus, the parallel execution server for the 1968 records
will be overwhelmed. You should use the same corrective actions as described
previously.
The V$PQ_TQSTAT view provides a detailed report of message traffic at the table
queue level. V$PQ_TQSTAT data is valid only when queried from a session that is
executing parallel SQL statements. A table queue is the pipeline between query
server groups, between the parallel coordinator and a query server group, or
between a query server group and the coordinator. The table queues are represented
explicitly in the operation column by PX SEND <partitioning type> (for
example, PX SEND HASH) and PX RECEIVE. For backward compatibility, the row
labels of PARALLEL_TO_PARALLEL, SERIAL_TO_PARALLEL, or PARALLEL_TO_
SERIAL will continue to have the same semantics as previous releases and can be
used as before to infer the table queue allocation. In addition, the top of the parallel
plan is marked by a new node with operation PX COORDINATOR.
V$PQ_TQSTAT has a row for each query server process that reads from or writes to
in each table queue. A table queue connecting 10 consumer processes to 10 producer
processes has 20 rows in the view. Sum the bytes column and group by TQ_ID, the
table queue identifier, to obtain the total number of bytes sent through each table
queue. Compare this with the optimizer estimates; large variations might indicate a
need to analyze the data using a larger sample.
Compute the variance of bytes grouped by TQ_ID. Large variances indicate
workload imbalances. You should investigate large variances to determine whether
the producers start out with unequal distributions of data, or whether the
distribution itself is skewed. If the data itself is skewed, this might indicate a low
cardinality, or low number of distinct values.
Note that the V$PQ_TQSTAT view will be renamed in a future release to V$PX_
TQSTSAT.
For a single instance, use SELECT FROM V$PX_SESSION and do not include the
column name Instance ID.
The processes shown in the output from the previous example using
GV$PX_SESSION collaborate to complete the same task. The next example shows
the execution of a join query to determine the progress of these processes in terms
of physical reads. Use this query to track any specific statistic:
SELECT QCSID, SID, INST_ID "Inst", SERVER_GROUP "Group", SERVER_SET "Set",
NAME "Stat Name", VALUE
FROM GV$PX_SESSTAT A, V$STATNAME B
WHERE A.STATISTIC# = B.STATISTIC# AND NAME LIKE 'PHYSICAL READS'
AND VALUE > 0 ORDER BY QCSID, QCINST_ID, SERVER_GROUP, SERVER_SET;
Use the previous type of query to track statistics in V$STATNAME. Repeat this query
as often as required to observe the progress of the query server processes.
The next query uses V$PX_PROCESS to check the status of the query servers.
SELECT * FROM V$PX_PROCESS;
The following query shows the current wait state of each slave and QC process on
the system:
SELECT px.SID "SID", p.PID, p.SPID "SPID", px.INST_ID "Inst",
px.SERVER_GROUP "Group", px.SERVER_SET "Set",
px.DEGREE "Degree", px.REQ_DEGREE "Req Degree", w.event "Wait Event"
FROM GV$SESSION s, GV$PX_SESSION px, GV$PROCESS p, GV$SESSION_WAIT w
WHERE s.sid (+) = px.sid AND s.inst_id (+) = px.inst_id AND
Oracle considers affinity when allocating work to parallel execution servers. The
use of affinity for parallel execution of SQL statements is transparent to users.
You can also use the utlxplp.sql script to present the EXPLAIN PLAN output
with all relevant parallel information.
You can increase the optimizer's ability to generate parallel plans converting
subqueries, especially correlated subqueries, into joins. Oracle can parallelize joins
more efficiently than subqueries. This also applies to updates. See "Updating the
Table in Parallel" on page 24-86 for more information.
These tables can also be incrementally loaded with parallel INSERT. You can take
advantage of intermediate tables using the following techniques:
■ Common subqueries can be computed once and referenced many times. This
can allow some queries against star schemas (in particular, queries without
selective WHERE-clause predicates) to be better parallelized. Note that star
queries with selective WHERE-clause predicates using the star-transformation
technique can be effectively parallelized automatically without any
modification to the SQL.
■ Decompose complex queries into simpler steps in order to provide
application-level checkpoint or restart. For example, a complex multitable join
on a database 1 terabyte in size could run for dozens of hours. A failure during
this query would mean starting over from the beginning. Using CREATE TABLE
... AS SELECT or PARALLEL INSERT AS SELECT, you can rewrite the query as a
sequence of simpler queries that run for a few hours each. If a system failure
occurs, the query can be restarted from the last completed step.
■ Implement manual parallel deletes efficiently by creating a new table that omits
the unwanted rows from the original table, and then dropping the original
table. Alternatively, you can use the convenient parallel delete feature, which
directly deletes rows from the original table.
■ Create summary tables for efficient multidimensional drill-down analysis. For
example, a summary table might store the sum of revenue grouped by month,
brand, region, and salesman.
■ Reorganize tables, eliminating chained rows, compressing free space, and so on,
by copying the old table to a new table. This is much faster than export/import
and easier than reloading.
Be sure to use the DBMS_STATS package on newly created tables. Also consider
creating indexes. To avoid I/O bottlenecks, specify a tablespace with at least as
many devices as CPUs. To avoid fragmentation in allocating space, the number of
files in a tablespace should be a multiple of the number of CPUs. See Chapter 4,
"Hardware and I/O Considerations in Data Warehouses", for more information
about bottlenecks.
1MB to 10MB. Once you allocate an extent, it is available for the duration of an
operation. If you allocate a large extent but only need to use a small amount of
space, the unused space in the extent is unavailable.
At the same time, temporary extents should be large enough that processes do not
have to wait for space. Temporary tablespaces use less overhead than permanent
tablespaces when allocating and freeing a new extent. However, obtaining a new
temporary extent still requires the overhead of acquiring a latch and searching
through the SGA structures, as well as SGA space consumption for the sort extent
pool.
See Oracle Database Performance Tuning Guide for information regarding
locally-managed temporary tablespaces.
be an indication that the tables are not analyzed or that the optimizer has made
an incorrect estimate about the correlation of multiple predicates on the same
table. A hint may be required to force the optimizer to use another join method.
Consequently, if the plan says only one row is produced from any particular
stage and this is incorrect, consider hints or gather statistics.
■ Use hash join on low cardinality join keys. If a join key has few distinct values,
then a hash join may not be optimal. If the number of distinct values is less than
the DOP, then some parallel query servers may be unable to work on the
particular query.
■ Consider data skew. If a join key involves excessive data skew, a hash join may
require some parallel query servers to work more than others. Consider using a
hint to cause a BROADCAST distribution method if the optimizer did not choose
it. Note that the optimizer will consider the BROADCAST distribution method
only if the OPTIMIZER_FEATURES_ENABLE is set to 9.0.2 or higher. See
"V$PQ_TQSTAT" on page 24-67 for further details.
Increasing INITRANS
If you have global indexes, a global index segment and global index blocks are
shared by server processes of the same parallel DML statement. Even if the
operations are not performed against the same row, the server processes can share
the same index blocks. Each server transaction needs one transaction entry in the
index block header before it can make changes to a block. Therefore, in the CREATE
INDEX or ALTER INDEX statements, you should set INITRANS, the initial number
of transactions allocated within each data block, to a large value, such as the
maximum DOP against this index.
limitation the next time you re-create the segment header by decreasing the number
of process free lists; this leaves more room for transaction free lists in the segment
header.
For UPDATE and DELETE operations, each server process can require its own
transaction free list. The parallel DML DOP is thus effectively limited by the
smallest number of transaction free lists available on the table and on any of the
global indexes the DML statement must maintain. For example, if the table has 25
transaction free lists and the table has two global indexes, one with 50 transaction
free lists and one with 30 transaction free lists, the DOP is limited to 25. If the table
had had 40 transaction free lists, the DOP would have been limited to 30.
The FREELISTS parameter of the STORAGE clause is used to set the number of
process free lists. By default, no process free lists are created.
The default number of transaction free lists depends on the block size. For example,
if the number of process free lists is not set explicitly, a 4 KB block has about 80
transaction free lists by default. The minimum number of transaction free lists is 25.
In this case, you should consider increasing the DBWn processes. If there are no
waits for free buffers, the query will not return any rows.
[NO]LOGGING Clause
The [NO]LOGGING clause applies to tables, partitions, tablespaces, and indexes.
Virtually no log is generated for certain operations (such as direct-path INSERT) if
the NOLOGGING clause is used. The NOLOGGING attribute is not specified at the
INSERT statement level but is instead specified when using the ALTER or CREATE
statement for a table, partition, index, or tablespace.
When a table or index has NOLOGGING set, neither parallel nor serial direct-path
INSERT operations generate redo logs. Processes running with the NOLOGGING
option set run faster because no redo is generated. However, after a NOLOGGING
operation against a table, partition, or index, if a media failure occurs before a
backup is taken, then all tables, partitions, and indexes that have been modified
might be corrupted.
Direct-path INSERT operations (except for dictionary updates) never generate redo
logs. The NOLOGGING attribute does not affect undo, only redo. To be precise,
NOLOGGING allows the direct-path INSERT operation to generate a negligible
amount of redo (range-invalidation redo, as opposed to full image redo).
For backward compatibility, [UN]RECOVERABLE is still supported as an alternate
keyword with the CREATE TABLE statement. This alternate keyword might not be
supported, however, in future releases.
At the tablespace level, the logging clause specifies the default logging attribute for
all tables, indexes, and partitions created in the tablespace. When an existing
tablespace logging attribute is changed by the ALTER TABLESPACE statement, then
all tables, indexes, and partitions created after the ALTER statement will have the
new logging attribute; existing ones will not change their logging attributes. The
tablespace-level logging attribute can be overridden by the specifications at the
table, index, or partition level.
The default logging attribute is LOGGING. However, if you have put the database in
NOARCHIVELOG mode, by issuing ALTER DATABASE NOARCHIVELOG, then all
operations that can be done without logging will not generate logs, regardless of the
specified logging attribute.
Parallel local index creation uses a single server set. Each server process in the set is
assigned a table partition to scan and for which to build an index partition. Because
half as many server processes are used for a given DOP, parallel local index creation
can be run with a higher DOP. However, the DOP is restricted to be less than or
equal to the number of index partitions you wish to create. To avoid this, you can
use the DBMS_PCLXUTIL package.
You can optionally specify that no redo and undo logging should occur during
index creation. This can significantly improve performance but temporarily renders
the index unrecoverable. Recoverability is restored after the new index is backed
up. If your application can tolerate a window where recovery of the index requires
it to be re-created, then you should consider using the NOLOGGING clause.
The PARALLEL clause in the CREATE INDEX statement is the only way in which you
can specify the DOP for creating the index. If the DOP is not specified in the parallel
clause of CREATE INDEX, then the number of CPUs is used as the DOP. If there is no
PARALLEL clause, index creation is done serially.
When creating an index in parallel, the STORAGE clause refers to the storage of each
of the subindexes created by the query server processes. Therefore, an index created
with an INITIAL of 5 MB and a DOP of 12 consumes at least 60 MB of storage
during index creation because each process starts with an extent of 5 MB. When the
query coordinator process combines the sorted subindexes, some of the extents
might be trimmed, and the resulting index might be smaller than the requested 60
MB.
When you add or enable a UNIQUE or PRIMARY KEY constraint on a table, you
cannot automatically create the required index in parallel. Instead, manually create
an index on the desired columns, using the CREATE INDEX statement and an
appropriate PARALLEL clause, and then add or enable the constraint. Oracle then
uses the existing index when enabling or adding the constraint.
Multiple constraints on the same table can be enabled concurrently and in parallel if
all the constraints are already in the ENABLE NOVALIDATE state. In the following
example, the ALTER TABLE ... ENABLE CONSTRAINT statement performs the table
scan that checks the constraint in parallel:
CREATE TABLE a (a1 NUMBER CONSTRAINT ach CHECK (a1 > 0) ENABLE NOVALIDATE)
PARALLEL;
INSERT INTO a values (1);
COMMIT;
ALTER TABLE a ENABLE CONSTRAINT ach;
If parallel DML is enabled and there is a PARALLEL hint or PARALLEL attribute set
for the table in the data dictionary, then inserts are parallel and appended, unless a
restriction applies. If either the PARALLEL hint or PARALLEL attribute is missing,
the insert is performed serially.
used when recovery is needed for the table or partition. If recovery is needed, be
sure to take a backup immediately after the operation. Use the ALTER TABLE
[NO]LOGGING statement to set the appropriate value.
Parallelizing INSERT ... SELECT In the INSERT ... SELECT statement you can specify a
PARALLEL hint after the INSERT keyword, in addition to the hint after the SELECT
keyword. The PARALLEL hint after the INSERT keyword applies to the INSERT
operation only, and the PARALLEL hint after the SELECT keyword applies to the
SELECT operation only. Thus, parallelism of the INSERT and SELECT operations
are independent of each other. If one operation cannot be performed in parallel, it
has no effect on whether the other operation can be performed in parallel.
The ability to parallelize inserts causes a change in existing behavior if the user has
explicitly enabled the session for parallel DML and if the table in question has a
PARALLEL attribute set in the data dictionary entry. In that case, existing INSERT ...
SELECT statements that have the select operation parallelized can also have their
insert operation parallelized.
If you query multiple tables, you can specify multiple SELECT PARALLEL hints and
multiple PARALLEL attributes.
The APPEND keyword is not required in this example because it is implied by the
PARALLEL hint.
Parallelizing UPDATE and DELETE The PARALLEL hint (placed immediately after the
UPDATE or DELETE keyword) applies not only to the underlying scan operation, but
also to the UPDATE or DELETE operation. Alternatively, you can specify UPDATE or
DELETE parallelism in the PARALLEL clause specified in the definition of the table
to be modified.
If you have explicitly enabled parallel DML for the session or transaction, UPDATE
or DELETE statements that have their query operation parallelized can also have
their UPDATE or DELETE operation parallelized. Any subqueries or updatable views
in the statement can have their own separate PARALLEL hints or clauses, but these
parallel directives do not affect the decision to parallelize the update or delete. If
these operations cannot be performed in parallel, it has no effect on whether the
UPDATE or DELETE portion can be performed in parallel.
Tables must be partitioned in order to support parallel UPDATE and DELETE.
The PARALLEL hint is applied to the UPDATE operation as well as to the scan.
Again, the parallelism is applied to the scan as well as UPDATE operation on table
employees.
contains either new rows or rows that have been updated since the last refresh of
the data warehouse. In this example, the updated data is shipped from the
production system to the data warehouse system by means of ASCII files. These
files must be loaded into a temporary table, named diff_customer, before
starting the refresh process. You can use SQL*Loader with both the parallel and
direct options to efficiently perform this task. You can use the APPEND hint when
loading in parallel as well.
Once diff_customer is loaded, the refresh process can be started. It can be
performed in two phases or by merging in parallel, as demonstrated in the
following:
■ Updating the Table in Parallel
■ Inserting the New Rows into the Table in Parallel
■ Merging in Parallel
You can then update the customers table with the following SQL statement:
UPDATE /*+ PARALLEL(cust_joinview) */
(SELECT /*+ PARALLEL(customers) PARALLEL(diff_customer) */
CUSTOMER.c_name AS c_name CUSTOMER.c_addr AS c_addr,
diff_customer.c_name AS c_newname, diff_customer.c_addr AS c_newaddr
WHERE customers.c_key = diff_customer.c_key) cust_joinview
SET c_name = c_newname, c_addr = c_newaddr;
The base scans feeding the join view cust_joinview are done in parallel. You can
then parallelize the update to further improve performance, but only if the
customers table is partitioned.
However, you can guarantee that the subquery is transformed into an anti-hash join
by using the HASH_AJ hint. Doing so enables you to use parallel INSERT to execute
the preceding statement efficiently. Parallel INSERT is applicable even if the table is
not partitioned.
Merging in Parallel
You can combine updates and inserts into one statement, commonly known as a
merge. The following statement achieves the same result as all of the statements in
"Updating the Table in Parallel" on page 24-86 and "Inserting the New Rows into
the Table in Parallel" on page 24-87:
MERGE INTO customers USING diff_customer
ON (diff_customer.c_key = customer.c_key) WHEN MATCHED THEN
UPDATE SET (c_name, c_addr) = (SELECT c_name, c_addr
FROM diff_customer WHERE diff_customer.c_key = customers.c_key)
WHEN NOT MATCHED THEN
INSERT VALUES (diff_customer.c_key,diff_customer.c_data);
advantage. In such cases, begin with the execution plan recommended by query
optimization, and go on to test the effect of hints only after you have quantified
your performance expectations. Remember that hints are powerful. If you use them
and the underlying data changes, you might need to change the hints. Otherwise,
the effectiveness of your execution plans might deteriorate.
FIRST_ROWS(n) Hint
The FIRST_ROWS(n) hint enables the optimizer to use a new optimization mode to
optimize the query to return n rows in the shortest amount of time. Oracle
Corporation recommends that you use this new hint in place of the old FIRST_
ROWS hint for online queries because the new optimization mode may improve the
response time compared to the old optimization mode.
Use the FIRST_ROWS(n) hint in cases where you want the first n number of rows
in the shortest possible time. For example, to obtain the first 10 rows in the shortest
possible time, use the hint as follows:
SELECT /*+ FIRST_ROWS(10) */ article_id
FROM articles_tab WHERE CONTAINS(article, 'Oracle')>0 ORDER BY pub_date DESC;
additive
Describes a fact (or measure) that can be summarized through addition. An
additive fact is the most common type of fact. Examples include sales, cost, and
profit. Contrast with nonadditive and semi-additive.
advisor
See: SQLAccess Advisor.
aggregate
Summarized data. For example, unit sales of a particular product could be
aggregated by day, month, quarter and yearly sales.
aggregation
The process of consolidating data values into a single value. For example, sales data
could be collected on a daily basis and then be aggregated to the week level, the
week data could be aggregated to the month level, and so on. The data can then be
referred to as aggregate data. Aggregation is synonymous with summarization,
and aggregate data is synonymous with summary data.
ancestor
A value at any level higher than a given value in a hierarchy. For example, in a Time
dimension, the value 1999 might be the ancestor of the values Q1-99 and Jan-99.
Glossary-1
attribute
A descriptive characteristic of one or more levels. For example, the product
dimension for a clothing manufacturer might contain a level called item, one of
whose attributes is color. Attributes represent logical groupings that enable end
users to select data based on like characteristics.
Note that in relational modeling, an attribute is defined as a characteristic of an
entity. In Oracle Database 10g, an attribute is a column in a dimension that
characterizes elements of a single level.
cardinality
From an OLTP perspective, this refers to the number of rows in a table. From a data
warehousing perspective, this typically refers to the number of distinct values in a
column. For most data warehouse DBAs, a more important issue is the degree of
cardinality.
change set
A set of logically grouped change data that is transactionally consistent. It contains
one or more change tables.
change table
A relational table that contains change data for a single source table. To Change
Data Capture subscribers, a change table is known as a publication.
child
A value at the level under a given value in a hierarchy. For example, in a Time
dimension, the value Jan-99 might be the child of the value Q1-99. A value can be a
child for more than one parent if the child value belongs to multiple hierarchies.
See Also:
■ hierarchy
■ level
■ parent
cleansing
The process of resolving inconsistencies and fixing the anomalies in source data,
typically as part of the ETL process.
Glossary-2
See Also: ETL
cross product
A procedure for combining the elements in multiple sets. For example, given two
columns, each element of the first column is matched with every element of the
second column. A simple example is illustrated as follows:
Col1 Col2 Cross Product
---- ---- -------------
a c ac
b d ad
bc
bd
Cross products are performed when grouping sets are concatenated, as described in
Chapter 20, "SQL for Aggregation in Data Warehouses".
data mart
A data warehouse that is designed for a particular line of business, such as sales,
marketing, or finance. In a dependent data mart, the data can be derived from an
enterprise-wide data warehouse. In an independent data mart, data can be collected
directly from sources.
data source
A database, application, repository, or file that contributes data to a warehouse.
data warehouse
A relational database that is designed for query and analysis rather than transaction
processing. A data warehouse usually contains historical data that is derived from
transaction data, but it can include data from other sources. It separates analysis
workload from transaction workload and enables a business to consolidate data
from several sources.
Glossary-3
In addition to a relational database, a data warehouse environment often consists of
an ETL solution, an OLAP engine, client analysis tools, and other applications that
manage the process of gathering data and delivering it to business users.
degree of cardinality
The number of unique values of a column divided by the total number of rows in
the table. This is particularly important when deciding which indexes to build. You
typically want to use bitmap indexes on low degree of cardinality columns and
B-tree indexes on high degree of cardinality columns. As a general rule, a
cardinality of under 1% makes a good candidate for a bitmap index.
denormalize
The process of allowing redundancy in a table. Contrast with normalize.
detail
See: fact table.
detail table
See: fact table.
dimension
The term dimension is commonly used in two ways:
■ A general term for any characteristic that is used to specify the members of a
data set. The 3 most common dimensions in sales-oriented data warehouses are
time, geography, and product. Most dimensions have hierarchies.
■ An object defined in a database to enable queries to navigate dimensions. In
Oracle Database 10g, a dimension is a database object that defines hierarchical
(parent/child) relationships between pairs of column sets. In Oracle Express, a
dimension is a database object that consists of a list of values.
Glossary-4
dimension table
Dimension tables describe the business entities of an enterprise, represented as
hierarchical, categorical information such as time, departments, locations, and
products. Dimension tables are sometimes called lookup or reference tables.
dimension value
One element in the list that makes up a dimension. For example, a computer
company might have dimension values in the product dimension called LAPPC
and DESKPC. Values in the geography dimension might include Boston and Paris.
Values in the time dimension might include MAY96 and JAN97.
drill
To navigate from one item to a set of related items. Drilling typically involves
navigating up and down through the levels in a hierarchy. When selecting data, you
can expand or collapse a hierarchy by drilling down or up in it, respectively.
drill down
To expand the view to include child values that are associated with parent values in
the hierarchy.
drill up
To collapse the list of descendant values that are associated with a parent value in
the hierarchy.
element
An object or process. For example, a dimension is an object, a mapping is a process,
and both are elements.
entity
Entity is used in database modeling. In relational databases, it typically maps to a
table.
Glossary-5
ETL
Extraction, transformation, and loading. ETL refers to the methods involved in
accessing and manipulating source data and loading it into a data warehouse. The
order in which these processes are performed varies.
Note that ETT (extraction, transformation, transportation) and ETM (extraction,
transformation, move) are sometimes used instead of ETL.
See Also:
■ data warehouse
■ extraction
■ transformation
■ transportation
extraction
The process of taking data out of a source as part of an initial phase of ETL.
fact
Data, usually numeric and additive, that can be examined and analyzed. Examples
include sales, cost, and profit. Fact and measure are synonymous; fact is more
commonly used with relational environments, measure is more commonly used
with multidimensional environments.
fact table
A table in a star schema that contains facts. A fact table typically has two types of
columns: those that contain facts and those that are foreign keys to dimension
tables. The primary key of a fact table is usually a composite key that is made up of
all of its foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated
(fact tables that contain aggregated facts are often instead called summary tables). A
fact table usually contains facts with the same level of aggregation.
Glossary-6
fast refresh
An operation that applies only the data changes to a materialized view, thus
eliminating the need to rebuild the materialized view from scratch.
file-to-table mapping
Maps data from flat files to tables in the warehouse.
hierarchy
A logical structure that uses ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation; for example, in a time dimension,
a hierarchy might be used to aggregate data from the Month level to the Quarter
level to the Year level. Hierarchies can be defined in Oracle as part of the dimension
object. A hierarchy can also be used to define a navigational drill path, regardless of
whether the levels in the hierarchy represent aggregated totals.
high boundary
The newest row in a subscription window.
level
A position in a hierarchy. For example, a time dimension might have a hierarchy
that represents data at the Month, Quarter, and Year levels.
low boundary
The oldest row in a subscription window.
mapping
The definition of the relationship and data flow between source and target objects.
materialized view
A pre-computed table comprising aggregated or joined data from fact and possibly
dimension tables. Also known as a summary or aggregate table.
Glossary-7
measure
See: fact.
metadata
Data that describes data and other structures, such as objects, business rules, and
processes. For example, the schema design of a data warehouse is typically stored in
a repository as metadata, which is used to generate scripts used to build and
populate the data warehouse. A repository contains metadata.
Examples include: for data, the definition of a source to target transformation that is
used to generate and populate the data warehouse; for information, definitions of
tables, columns and associations that are stored inside a relational modeling tool;
for business rules, discount by 10 percent after selling 1,000 items.
model
An object that represents something to be made. A representative style, plan, or
design. Metadata that defines the structure of the data warehouse.
nonadditive
Describes a fact (or measure) that cannot be summarized through addition. An
example includes Average. Contrast with additive and semi-additive.
normalize
In a relational database, the process of removing redundancy in data by separating
the data into multiple tables. Contrast with denormalize.
The process of removing redundancy in data by separating the data into multiple
tables.
OLAP
See: online analytical processing (OLAP).
Glossary-8
OLAP tools can run against a multidimensional database or interact directly with a
relational database.
OLTP
See: online transaction processing (OLTP).
parallelism
Breaking down a task so that several processes do part of the work. When multiple
CPUs each do their portion simultaneously, very large performance gains are
possible.
parallel execution
Breaking down a task so that several processes do part of the work. When multiple
CPUs each do their portion simultaneously, very large performance gains are
possible.
parent
A value at the level above a given value in a hierarchy. For example, in a Time
dimension, the value Q1-99 might be the parent of the value Jan-99.
See Also:
■ child
■ hierarchy
■ level
partition
Very large tables and indexes can be difficult and time-consuming to work with. To
improve manageability, you can break your tables and indexes into smaller pieces
called partitions.
Glossary-9
pivoting
A transformation where each record in an input stream is converted to many
records in the appropriate table in the data warehouse. This is particularly
important when taking data from nonrelational databases.
publication
A relational table that contains change data for a single source table. Change Data
Capture publishers refer to a publication as a change table.
publication ID
A publication ID is a unique numeric value that Change Data Capture assigns to
each change table defined by a publisher.
publisher
Usually a database administrator who is in charge of creating and maintaining
schema objects that make up the Change Data Capture system.
refresh
The mechanism whereby materialized views are changed to reflect new data.
schema
A collection of related database objects. Relational schemas are grouped by database
user ID and include tables, views, and other objects. The sample schemas sh are
used throughout this Guide.
semi-additive
Describes a fact (or measure) that can be summarized through addition along some,
but not all, dimensions. Examples include headcount and on hand stock. Contrast
with additive and nonadditive.
Glossary-10
snowflake schema
A type of star schema in which the dimension tables are partly or fully normalized.
source
A database, application, file, or other storage facility from which the data in a data
warehouse is derived.
source system
A database, application, file, or other storage facility from which the data in a data
warehouse is derived.
source tables
The tables in a source database.
SQLAccess Advisor
The SQLAccess Advisor helps you achieve your performance goals by
recommending the proper set of materialized views, materialized view logs, and
indexes for a given workload. It is a GUI in Oracle Enterprise Manager, and has
similar capabilities to the DBMS_ADVISOR package.
staging area
A place where data is processed before entering the warehouse.
staging file
A file used when data is processed before entering the warehouse.
star query
A join between a fact table and a number of dimension tables. Each dimension table
is joined to the fact table using a primary key to foreign key join, but the dimension
tables are not joined to each other.
star schema
A relational schema whose design represents a multidimensional data model. The
star schema consists of one or more fact tables and one or more dimension tables
that are related through foreign keys.
Glossary-11
subject area
A classification system that represents or distinguishes parts of an organization or
areas of knowledge. A data mart is often developed to support a subject area such
as sales, marketing, or geography.
subscribers
Consumers of the published change data. These are normally applications.
subscription
A mechanism for Change Data Capture subscribers that controls access to the
change data from one or more source tables of interest within a single change set. A
subscription contains one or more subscriber views.
subscription window
A mechanism that defines the range of rows in a Change Data Capture publication
that the subscriber can currently see in subscriber views.
summary
See: materialized view.
Summary Advisor
Replaced by the SQLAccess Advisor. See: SQLAccess Advisor.
target
Holds the intermediate or final results of any part of the ETL process. The target of
the entire ETL process is the data warehouse.
Glossary-12
warehouses, especially environments with significant data loading requirements
that are used to feed data marts and execute long-running queries.
transformation
The process of manipulating data. Any manipulation beyond copying is a
transformation. Examples include cleansing, aggregating, and integrating data from
multiple sources.
transportation
The process of moving copied or transformed data from a source to a data
warehouse.
unique identifier
An identifier whose purpose is to differentiate between the same item when it
appears in more than one place.
update window
The length of time available for updating a warehouse. For example, you might
have 8 hours at night to update your warehouse.
update frequency
How often a data warehouse is updated with new information. For example, a
warehouse might be updated nightly from an OLTP system.
validation
The process of verifying metadata definitions and configuration parameters.
versioning
The ability to create new versions of a data warehouse project for new requirements
and changes.
Glossary-13
Glossary-14
Index
A applications
data warehouses
adaptive multiuser star queries, 19-3
algorithm for, 24-46 decision support, 24-2
definition, 24-45 decision support systems (DSS), 6-2
affinity parallel SQL, 24-16
parallel DML, 24-72 direct-path INSERT, 24-22
partitions, 24-71 parallel DML, 24-21
aggregates, 8-12, 18-73 ARCH processes
computability check, 18-34 multiple, 24-80
ALL_PUBLISHED_COLUMNS view, 16-70 architecture
ALTER, 18-5 data warehouse, 1-5
ALTER INDEX statement MPP, 24-72
partition attributes, 5-42 SMP, 24-72
ALTER MATERIALIZED VIEW statement, 8-21 asychronous AutoLog publishing
ALTER SESSION statement requirements for, 16-19
ENABLE PARALLEL DML clause, 24-22 asynchronous AutoLog publishing
FORCE PARALLEL DDL clause, 24-41, 24-44 latency for, 16-20
create or rebuild index, 24-42, 24-44 location of staging database, 16-20
create table as select, 24-43, 24-44 setting database initialization parameters
move or split partition, 24-42, 24-45 for, 16-22
FORCE PARALLEL DML clause asynchronous Autolog publishing
insert, 24-40, 24-44 source database performace impact, 16-20
update and delete, 24-38, 24-39, 24-44 Asynchronous Change Data Capture
ALTER TABLE statement columns of built-in Oracle datatypes supported
NOLOGGING clause, 24-84 by, 16-51
altering dimensions, 10-14 asynchronous Change Data Capture
amortization archived redo log files and, 16-48
calculating, 22-51 ARCHIVELOGMODE and, 16-48
analytic functions supplemental logging, 16-50
concepts, 21-3 supplemental logging and, 16-11
analyzing data asynchronous change sets
for parallel processing, 24-65 disabling, 16-53
APPEND hint, 24-83 enabling, 16-53
Index-1
exporting, 16-68 cardinality
importing, 16-68 degree of, 6-3
managing, 16-52 CASE expressions, 21-43
recovering from capture errors, 16-55 cell referencing, 22-15
example of, 16-56, 16-57 Change Data Capture, 12-5
removing DDL, 16-58 asynchronous
specifying ending values for, 16-52 Streams apply process and, 16-25
specifying starting values for, 16-52 Streams capture process and, 16-25
stopping capture on DDL, 16-54 benefits for subscribers, 16-9
excluded statements, 16-55 choosing a mode, 16-20
asynchronous change tables effects of stopping on DDL, 16-54
exporting, 16-68 latency, 16-20
importing, 16-68 location of staging database, 16-20
asynchronous HotLog publishing modes of data capture
latency for, 16-20 asynchronous AutoLog, 16-13
location of staging database, 16-20 asynchronous HotLog, 16-12
requirements for, 16-19 synchronous, 16-10
setting database initialization parameters Oracle Data Pump and, 16-67
for, 16-21, 16-22 removing from database, 16-71
asynchronous Hotlog publishing restriction on direct-path INSERT
source database performace impact, 16-20 statement, 16-72
asynchronous I/O, 24-60 setting up, 16-18
attributes, 2-3, 10-6 source database performance impact, 16-20
AutoLog change sets, 16-15 static data dictionary views, 16-16
Automatic Storage Management, 4-4 supported export utility, 16-67
supported import utility, 16-67
systemwide triggers installed by, 16-71
B Change Data Capture publisher
bandwidth, 5-2, 24-2 default tablespace for, 16-19
bind variables change sets
with query rewrite, 18-61 AutoLog, 16-15
bitmap indexes, 6-2 AutoLog change sources and, 16-15
nulls and, 6-5 defined, 16-14
on partitioned tables, 6-6 effects of disabling, 16-54
parallel query and DML, 6-3 HotLog, 16-15
bitmap join indexes, 6-6 HOTLOG_SOURCE change sources and, 16-15
block range granules, 5-3 managing asynchronous, 16-52
B-tree indexes, 6-10 synchronous, 16-15
bitmap indexes versus, 6-3 synchronous Change Data Capture and, 16-15
build methods, 8-23 valid combinations with change sources, 16-15
change sources
C asynchronous AutoLog Change Data Capture
and, 16-13
capture errors database instance represented, 16-15
recovering from, 16-55 defined, 16-6
Index-2
HOTLOG_SOURCE, 16-12 CONNECT role, 16-19
SYNC_SOURCE, 16-10 constraints, 7-2, 10-12
valid combinations with change sets, 16-15 foreign key, 7-5
change tables parallel create table, 24-42
adding a column to, 16-70 RELY, 7-6
control columns, 16-60 states, 7-3
defined, 16-6 unique, 7-4
dropping, 16-67 view, 7-7, 18-46
dropping with active subscribers, 16-67 with partitioning, 7-7
effect of SQL DROP USER CASCADE statement with query rewrite, 18-72
on, 16-67 control columns
exporting, 16-67 used to indicate changed columns in a
granting subscribers access to, 16-64 row, 16-62
importing, 16-67 controls columns
importing for Change Data Capture, 16-69 COMMIT_TIMESTAMP$, 16-61
managing, 16-58 CSCN$, 16-61
purging all in a named change set, 16-66 OPERATION$, 16-61
purging all on staging database, 16-66 ROW_ID$, 16-62
purging by name, 16-66 RSID$, 16-61
purging of unneeded data, 16-65 SOURCE_COLMAP$, 16-61
source tables referenced by, 16-59 interpreting, 16-62
tablespaces created in, 16-59 SYS_NC_OID$, 16-62
change-value selection, 16-3 TARGET_COLMAP$, 16-61
columns interpreting, 16-62
cardinality, 6-3 TIMESTAMP$, 16-61
COMMIT_TIMESTAMP$ USERNAME$, 16-62
control column, 16-61 XIDSEQ$, 16-62
common joins, 18-29 XIDSLT$, 16-62
COMPLETE clause, 8-26 XIDUSN$, 16-62
complete refresh, 15-15 cost-based rewrite, 18-3
complex queries CPU
snowflake schemas, 19-5 utilization, 5-2, 24-2
composite CREATE DIMENSION statement, 10-4
columns, 20-20 CREATE INDEX statement, 24-82
partitioning, 5-8 partition attributes, 5-42
partitioning methods, 5-8 rules of parallelism, 24-42
performance considerations, 5-12, 5-14 CREATE MATERIALIZED VIEW statement, 8-21
compression enabling query rewrite, 18-5
See data segment compression, 8-22 CREATE SESSION privilege, 16-19
concatenated groupings, 20-22 CREATE TABLE AS SELECT
concatenated ROLLUP, 20-29 rules of parallelism
concurrent users index-organized tables, 24-3
increasing the number of, 24-49 CREATE TABLE AS SELECT statement, 24-64,
configuration 24-75
bandwidth, 4-2 rules of parallelism
Index-3
index-organized tables, 24-16 data segment compression, 3-5
CREATE TABLE privilege, 16-19 bitmap indexes, 5-17
CREATE TABLE statement materialized views, 8-22
AS SELECT partitioning, 3-5, 5-16
decision support systems, 24-16 data transformation
rules of parallelism, 24-42 multistage, 14-2
space fragmentation, 24-18 pipelined, 14-3
temporary storage space, 24-18 data warehouse, 8-2
parallelism, 24-16 architectures, 1-5
index-organized tables, 24-3, 24-16 dimension tables, 8-7
CREATE TABLESPACE privilege, 16-19 dimensions, 19-3
CSCN$ fact tables, 8-7
control column, 16-61 logical design, 2-2
CUBE clause, 20-9 partitioned tables, 5-10
partial, 20-11 physical design, 3-2
when to use, 20-9 refresh tips, 15-20
cubes refreshing table data, 24-21
hierarchical, 9-10 star queries, 19-3
CUME_DIST function, 21-12 database
scalability, 24-21
staging, 8-2
D database initialization paramters
data adjusting when Streams values change, 16-25
integrity of determining current setting of, 16-25
parallel DML restrictions, 24-26 retaining settings when database is
partitioning, 5-4 restarted, 16-25
purging, 15-12 database writer process (DBWn)
sufficiency check, 18-33 tuning, 24-80
transformation, 14-8 DATE datatype
transportation, 13-2 partition pruning, 5-32
data compression partitioning, 5-32
See data segment compression, 8-22 date folding
data cubes with query rewrite, 18-49
hierarchical, 20-24 DB_BLOCK_SIZE initialization parameter, 24-60
data densification, 21-45 and parallel query, 24-60
time series calculation, 21-53 DB_FILE_MULTIBLOCK_READ_COUNT
with sparse data, 21-46 initialization parameter, 24-60
data dictionary DBA role, 16-19
asynchronous change data capture and, 16-38 DBA_DATA_FILES view, 24-66
data extraction DBA_EXTENTS view, 24-66
with and without Change Data Capture, 16-5 DBMS_ADVISOR package, 17-2
data manipulation language DBMS_CDC_PUBLISH package, 16-6
parallel DML, 24-19 privileges required to use, 16-19
transaction model for parallel DML, 24-23 DBMS_CDC_PUBLISH.DROP_CHANGE_TABLE
data marts, 1-6 PL/SQL procedure, 16-67
Index-4
DBMS_CDC_PUBLISH.PURGE PL/SQL dropping, 10-14
procedure, 16-65, 16-66 hierarchies, 2-6
DBMS_CDC_PUBLISH.PURGE_CHANG_SET hierarchies overview, 2-6
PL/SQL procedure, 16-66 multiple, 20-2
DBMS_CDC_PUBLISH.PURGE_CHANGE_TABLE star joins, 19-4
PL/SQL procedure, 16-66 star queries, 19-3
DBMS_CDC_SUBSCRIBE package, 16-7 validating, 10-12
DBMS_CDC_SUBSCRIBE.PURGE_WINDOW with query rewrite, 18-73
PL/SQL procedure, 16-65 direct-path INSERT
DBMS_JOB PL/SQL procedure, 16-65 restrictions, 24-25
DBMS_MVIEW package, 15-16, 15-17 direct-path INSERT statement
EXPLAIN_MVIEW procedure, 8-37 Change Data Capture restriction, 16-72
EXPLAIN_REWRITE procedure, 18-66 disk affinity
REFRESH procedure, 15-14, 15-17 parallel DML, 24-72
REFRESH_ALL_MVIEWS procedure, 15-14 partitions, 24-71
REFRESH_DEPENDENT procedure, 15-14 disk redundancy, 4-3
DBMS_STATS package, 17-4, 18-3 disk striping
decision support systems (DSS) affinity, 24-71
bitmap indexes, 6-2 DISK_ASYNCH_IO initialization parameter, 24-60
disk striping, 24-72 distributed transactions
parallel DML, 24-21 parallel DDL restrictions, 24-4
parallel SQL, 24-16, 24-21 parallel DML restrictions, 24-4, 24-27
performance, 24-21 DML access
scoring tables, 24-22 subscribers, 16-65
default partition, 5-8 DML_LOCKS initialization parameter, 24-58
degree of cardinality, 6-3 downstream capture, 16-35
degree of parallelism, 24-5, 24-32, 24-37, 24-39 drilling down, 10-2
and adaptive multiuser, 24-45 hierarchies, 10-2
between query operations, 24-12 DROP MATERIALIZED VIEW statement, 8-21
parallel SQL, 24-33 prebuilt tables, 8-35
DELETE statement dropping
parallel DELETE statement, 24-38 dimensions, 10-14
DENSE_RANK function, 21-5 materialized views, 8-37
design dropping change tables, 16-67
logical, 3-2 DSS database
physical, 3-2 partitioning indexes, 5-42
dimension tables, 2-5, 8-7
normalized, 10-10
dimensional modeling, 2-3
E
dimensions, 2-6, 10-2, 10-12 ENFORCED mode, 18-7
altering, 10-14 ENQUEUE_RESOURCES initialization
analyzing, 20-2 parameter, 24-58
creating, 10-4 entity, 2-2
definition, 10-2 equipartitioning
dimension tables, 8-7 examples, 5-35
Index-5
local indexes, 5-34 star joins, 19-4
errors star queries, 19-3
ORA-31424, 16-67 facts, 10-2
ORA-31496, 16-67 FAST clause, 8-26
ETL. See extraction, transformation, and loading fast refresh, 15-15
(ETL), 11-2 restrictions, 8-27
EXCHANGE PARTITION statement, 7-7 with UNION ALL, 15-28
EXECUTE_CATALOG_ROLE privilege, 16-19 FAST_START_PARALLEL_ROLLBACK
EXECUTE_TASK procedure, 17-26 initialization parameter, 24-57
execution plans features, new, 1-xxxix
parallel operations, 24-63 files
star transformations, 19-9 ultralarge, 3-4
EXPLAIN PLAN statement, 18-65, 24-63 FIRST_ROWS(n) hint, 24-88
partition pruning, 5-33 FIRST_VALUE function, 21-21
query parallelization, 24-77 FIRST/LAST functions, 21-26
star transformations, 19-9 FORCE clause, 8-26
EXPLAIN_MVIEW procedure, 17-47 foreign key
exporting constraints, 7-5
a change table, 16-67 joins
asynchronous change sets, 16-68 snowflake schemas, 19-5
asynchronous change tables, 16-68 fragmentation
EXP utility, 12-10 parallel DDL, 24-18
expression matching FREELISTS parameter, 24-80
with query rewrite, 18-49 full partition-wise joins, 5-20
extents full table scans
parallel DDL, 24-18 parallel execution, 24-2
external tables, 14-5 functions
extraction, transformation, and loading (ETL), 11-2 COUNT, 6-5
overview, 11-2 CUME_DIST, 21-12
process, 7-2 DENSE_RANK, 21-5
extractions FIRST_VALUE, 21-21
data files, 12-7 FIRST/LAST, 21-26
distributed operations, 12-11 GROUP_ID, 20-16
full, 12-3 GROUPING, 20-12
incremental, 12-3 GROUPING_ID, 20-15
OCI, 12-9 LAG/LEAD, 21-25
online, 12-4 LAST_VALUE, 21-21
overview, 12-2 linear regression, 21-33
physical, 12-4 NTILE, 21-13
Pro*C, 12-9 parallel execution, 24-28
SQL*Plus, 12-8 PERCENT_RANK, 21-13
RANK, 21-5
ranking, 21-5
F RATIO_TO_REPORT, 21-24
fact tables, 2-5 REGR_AVGX, 21-34
Index-6
REGR_AVGY, 21-34 overview, 2-6
REGR_COUNT, 21-34 rolling up and drilling down, 10-2
REGR_INTERCEPT, 21-34 high boundary
REGR_SLOPE, 21-34 defined, 16-8
REGR_SXX, 21-35 hints
REGR_SXY, 21-35 FIRST_ROWS(n), 24-88
REGR_SYY, 21-35 PARALLEL, 24-34
reporting, 21-22 PARALLEL_INDEX, 24-34
ROW_NUMBER, 21-15 query rewrite, 18-5, 18-8
WIDTH_BUCKET, 21-39, 21-41 histograms
windowing, 21-15 creating with user-defined buckets, 21-44
HotLog change sets, 16-15
HOTLOG_SOURCE change sources, 16-12
G change sets and, 16-15
global hypothetical rank, 21-32
indexes, 24-79
global indexes
partitioning, 5-37 I
managing partitions, 5-38 importing
summary of index types, 5-39 a change table, 16-67, 16-69
granules, 5-3 asynchronous change sets, 16-68
block range, 5-3 asynchronous change tables, 16-68
partition, 5-4 data into a source table, 16-69
GROUP_ID function, 20-16 indexes
grouping bitmap indexes, 6-6
compatibility check, 18-34 bitmap join, 6-6
conditions, 18-74 B-tree, 6-10
GROUPING function, 20-12 cardinality, 6-3
when to use, 20-15 creating in parallel, 24-81
GROUPING_ID function, 20-15 global, 24-79
GROUPING_SETS expression, 20-17 global partitioned indexes, 5-37
groups, instance, 24-35 managing partitions, 5-38
GT GlossaryTitle, Glossary-1 local, 24-79
GV$FILESTAT view, 24-65 local indexes, 5-34
nulls and, 6-5
parallel creation, 24-81, 24-82
H parallel DDL storage, 24-18
hash partitioning, 5-7 parallel local, 24-82
HASH_AREA_SIZE initialization parameter partitioned tables, 6-6
and parallel execution, 24-56 partitioning, 5-9
hierarchical cubes, 9-10, 20-29 partitioning guidelines, 5-41
in SQL, 20-29 partitions, 5-33
hierarchies, 10-2 index-organized tables
how used, 2-6 parallel CREATE, 24-3, 24-16
multiple, 10-9 parallel queries, 24-14
Index-7
initcdc.sql script, 16-72 J
initialization parameters
DB_BLOCK_SIZE, 24-60 Java
DB_FILE_MULTIBLOCK_READ_ used by Change Data Capture, 16-71
COUNT, 24-60 JOB_QUEUE_PROCESSES initialization
DISK_ASYNCH_IO, 24-60 parameter, 15-20
DML_LOCKS, 24-58 join compatibility, 18-28
ENQUEUE_RESOURCES, 24-58 joins
FAST_START_PARALLEL_ROLLBACK, 24-57 full partition-wise, 5-20
HASH_AREA_SIZE, 24-56 partial partition-wise, 5-26
JOB_QUEUE_PROCESSES, 15-20 partition-wise, 5-20
LARGE_POOL_SIZE, 24-50 star joins, 19-4
LOG_BUFFER, 24-57 star queries, 19-4
NLS_LANGUAGE, 5-31
NLS_SORT, 5-31 K
OPTIMIZER_MODE, 15-21, 18-6
key lookups, 14-21
PARALLEL_ADAPTIVE_MULTI_USER, 24-46
keys, 8-7, 19-4
PARALLEL_EXECUTION_MESSAGE_
SIZE, 24-56
PARALLEL_MAX_SERVERS, 15-21, 24-7, 24-48 L
PARALLEL_MIN_PERCENT, 24-35, 24-48, LAG/LEAD functions, 21-25
24-55 LARGE_POOL_SIZE initialization
PARALLEL_MIN_SERVERS, 24-6, 24-7, 24-49 parameter, 24-50
PGA_AGGREGATE_TARGET, 15-21 LAST_VALUE function, 21-21
QUERY_REWRITE_ENABLED, 18-5, 18-6 level relationships, 2-6
QUERY_REWRITE_INTEGRITY, 18-6 purpose, 2-6
SHARED_POOL_SIZE, 24-50 levels, 2-6
STAR_TRANSFORMATION_ENABLED, 19-6 linear regression functions, 21-33
TAPE_ASYNCH_IO, 24-60 list partitioning, 5-7
TIMED_STATISTICS, 24-66 LOB datatypes
TRANSACTIONS, 24-57 restrictions
INSERT statement parallel DDL, 24-3, 24-16
functionality, 24-83 parallel DML, 24-25, 24-26
parallelizing INSERT ... SELECT, 24-40 local indexes, 5-34, 5-39, 6-3, 6-6, 24-79
instance groups for parallel operations, 24-35 equipartitioning, 5-34
instances locks
instance groups, 24-35 parallel DML, 24-25
integrity constraints, 7-2 LOG_BUFFER initialization parameter
integrity rules and parallel execution, 24-57
parallel DML restrictions, 24-26 LOGGING clause, 24-80
invalidating logging mode
materialized views, 9-14 parallel DDL, 24-3, 24-16, 24-17
I/O logical design, 3-2
asynchronous, 24-60 logs
parallel execution, 5-2, 24-2 materialized views, 8-31
Index-8
lookup tables rewrites
See dimension tables, 8-7 enabling, 18-5
low boundary schema design, 8-8
defined, 16-8 schema design guidelines, 8-8
security, 9-14
set operators, 9-11
M storage characteristics, 8-22
manual tuning, 17-47
refresh, 15-17 types of, 8-12
manual refresh uses for, 8-2
with DBMS_MVIEW package, 15-16 with VPD, 9-15
massively parallel processing (MPP) MAXVALUE
affinity, 24-71, 24-72 partitioned tables and indexes, 5-31
massively parallel systems, 5-2, 24-2 measures, 8-7, 19-4
materialized view logs, 8-31 memory
materialized views configure at 2 levels, 24-55
aggregates, 8-12 MERGE statement, 15-8
altering, 9-17 Change Data Capture restriction, 16-72
build methods, 8-23 MINIMUM EXTENT parameter, 24-18
checking status, 15-22 MODEL clause, 22-2
containing only joins, 8-15 cell referencing, 22-15
creating, 8-20 data flow, 22-4
data segment compression, 8-22 keywords, 22-14
delta joins, 18-31 parallel execution, 22-42
dropping, 8-35, 8-37 rules, 22-17
invalidating, 9-14 monitoring
logs, 12-7 parallel processing, 24-65
naming, 8-22 refresh, 15-21
nested, 8-17 mortgage calculation, 22-51
OLAP, 9-9 MOVE PARTITION statement
OLAP cubes, 9-9 rules of parallelism, 24-42
Partition Change Tracking (PCT), 9-2 multiple archiver processes, 24-80
partitioned tables, 15-29 multiple hierarchies, 10-9
partitioning, 9-2 MV_CAPABILITIES_TABLE table, 8-38
prebuilt, 8-20
query rewrite
hints, 18-5, 18-8 N
matching join graphs, 8-24 National Language Support (NLS)
parameters, 18-5 DATE datatype and partitions, 5-32
privileges, 18-9 nested materialized views, 8-17
refresh dependent, 15-19 refreshing, 15-27
refreshing, 8-26, 15-14 restrictions, 8-20
refreshing all, 15-18 net present value
registration, 8-34 calculating, 22-48
restrictions, 8-24 NEVER clause, 8-26
Index-9
new features, 1-xxxix indexes, 5-40
NLS_LANG environment variable, 5-31 partitioned indexes, 5-40
NLS_LANGUAGE parameter, 5-31 optimizations
NLS_SORT parameter parallel SQL, 24-8
no effect on partitioning keys, 5-31 query rewrite
NOAPPEND hint, 24-83 enabling, 18-5
NOARCHIVELOG mode, 24-81 hints, 18-5, 18-8
nodes matching join graphs, 8-24
disk affinity in Real Application Clusters, 24-71 query rewrites
NOLOGGING clause, 24-75, 24-80, 24-82 privileges, 18-9
with APPEND hint, 24-83 optimizer
NOLOGGING mode with rewrite, 18-2
parallel DDL, 24-3, 24-16, 24-17 OPTIMIZER_MODE initialization
nonprefixed indexes, 5-36, 5-40 parameter, 15-21, 18-6
global partitioned indexes, 5-38 ORA-31424 error, 16-67
nonvolatile data, 1-3 ORA-31496 error, 16-67
NOPARALLEL attribute, 24-74 Oracle Data Pump
NOREWRITE hint, 18-5, 18-8 using with Change Data Capture, 16-67
NTILE function, 21-13 Oracle Real Application Clusters
nulls disk affinity, 24-71
indexes and, 6-5 instance groups, 24-35
partitioned tables and indexes, 5-32 ORDER BY clause, 8-31
outer joins
with query rewrite, 18-73
O
object types
parallel query, 24-15 P
restrictions, 24-15 packages
restrictions DBMS_ADVISOR, 17-2
parallel DDL, 24-3, 24-16 paragraph tags
parallel DML, 24-25, 24-26 GT GlossaryTitle, Glossary-1
OLAP, 23-2 PARALLEL clause, 24-83, 24-84
materialized views, 9-9 parallelization rules, 24-37
OLAP cubes PARALLEL CREATE INDEX statement, 24-57
materialized views, 9-9 PARALLEL CREATE TABLE AS SELECT statement
OLTP database resources required, 24-57
batch jobs, 24-22 parallel DDL, 24-15
parallel DML, 24-21 extent allocation, 24-18
partitioning indexes, 5-41 parallelization rules, 24-37
ON COMMIT clause, 8-25 partitioned tables and indexes, 24-16
ON DEMAND clause, 8-25 restrictions
OPERATION$ LOBs, 24-3, 24-16
control column, 16-61 object types, 24-3, 24-15, 24-16
optimization parallel delete, 24-38
partition pruning parallel DELETE statement, 24-38
Index-10
parallel DML, 24-19 parallel update, 24-38
applications, 24-21 parallel UPDATE statement, 24-38
bitmap indexes, 6-3 PARALLEL_ADAPTIVE_MULTI_USER
degree of parallelism, 24-37, 24-39 initialization parameter, 24-46
enabling PARALLEL DML, 24-22 PARALLEL_EXECUTION_MESSAGE_SIZE
lock and enqueue resources, 24-25 initialization parameter, 24-56
parallelization rules, 24-37 PARALLEL_INDEX hint, 24-34
recovery, 24-24 PARALLEL_MAX_SERVERS initialization
restrictions, 24-25 parameter, 15-21, 24-7, 24-48
object types, 24-15, 24-25, 24-26 and parallel execution, 24-48
remote transactions, 24-27 PARALLEL_MIN_PERCENT initialization
transaction model, 24-23 parameter, 24-35, 24-48, 24-55
parallel execution PARALLEL_MIN_SERVERS initialization
full table scans, 24-2 parameter, 24-6, 24-7, 24-49
index creation, 24-81 PARALLEL_THREADS_PER_CPU initialization
interoperator parallelism, 24-12 parameter, 24-47
intraoperator parallelism, 24-12 parallelism, 5-2
introduction, 5-2 degree, 24-5, 24-32
I/O parameters, 24-60 degree, overriding, 24-74
plans, 24-63 enabling for tables and queries, 24-45
query optimization, 24-87 interoperator, 24-12
resource parameters, 24-55 intraoperator, 24-12
rewriting SQL, 24-74 parameters
solving problems, 24-74 FREELISTS, 24-80
tuning, 5-2, 24-2 partition
PARALLEL hint, 24-34, 24-74, 24-83 default, 5-8
parallelization rules, 24-37 granules, 5-4
UPDATE and DELETE, 24-38 Partition Change Tracking (PCT), 9-2, 15-29, 18-52
parallel partition-wise joins with Pmarkers, 18-55
performance considerations, 5-29 partitioned outer join, 21-45
parallel query, 24-13 partitioned tables
bitmap indexes, 6-3 data warehouses, 5-10
index-organized tables, 24-14 materialized views, 15-29
object types, 24-15 partitioning, 12-6
restrictions, 24-15 composite, 5-8
parallelization rules, 24-37 data, 5-4
parallel SQL data segment compression, 5-16
allocating rows to parallel execution bitmap indexes, 5-17
servers, 24-9 hash, 5-7
degree of parallelism, 24-33 indexes, 5-9
instance groups, 24-35 list, 5-7
number of parallel execution servers, 24-6 materialized views, 9-2
optimizer, 24-8 prebuilt tables, 9-7
parallelization rules, 24-37 range, 5-5
shared server, 24-6 range-list, 5-14
Index-11
partitions DBMS_CDC_PUBLISH.PURGE, 16-65, 16-66
affinity, 24-71 DBMS_CDC_PUBLISH.PURGE_CHANGE_
bitmap indexes, 6-6 SET, 16-66
DATE datatype, 5-32 DBMS_CDC_PUBLISH.PURGE_CHANGE_
equipartitioning TABLE, 16-66
examples, 5-35 DBMS_CDC_SUBSCRIBE.PURGE_
local indexes, 5-34 WINDOW, 16-65
global indexes, 5-37 DBMS_JOB, 16-65
local indexes, 5-34 Pmarkers
multicolumn keys, 5-33 with PCT, 18-55
nonprefixed indexes, 5-36, 5-40 prebuilt materialized views, 8-20
parallel DDL, 24-16 predicates
partition bounds, 5-31 partition pruning
partition pruning indexes, 5-40
DATE datatype, 5-32 prefixed indexes, 5-35, 5-39
disk striping and, 24-72 PRIMARY KEY constraints, 24-82
indexes, 5-40 privileges
partitioning indexes, 5-33, 5-41 SQLAccess Advisor, 17-9
partitioning keys, 5-30 privileges required
physical attributes, 5-42 to publish change data, 16-19
prefixed indexes, 5-35 procedures
pruning, 5-19 EXPLAIN_MVIEW, 17-47
range partitioning TUNE_MVIEW, 17-47
disk striping and, 24-72 process monitor process (PMON)
restrictions parallel DML process recovery, 24-24
datatypes, 5-32 processes
rules of parallelism, 24-42 and memory contention in parallel
partition-wise joins, 5-20 processing, 24-48
benefits of, 5-28 pruning
full, 5-20 partitions, 5-19, 24-72
partial, 5-26 using DATE columns, 5-20
PERCENT_RANK function, 21-13 pruning partitions
performance DATE datatype, 5-32
DSS database, 24-21 EXPLAIN PLAN, 5-33
prefixed and nonprefixed indexes, 5-40 indexes, 5-40
PGA_AGGREGATE_TARGET initialization publication
parameter, 15-21 defined, 16-6
physical design, 3-2 publishers
structures, 3-3 components associated with, 16-7
pivoting, 14-23 defined, 16-5
plans determining the source tables, 16-6
star transformations, 19-9 privileges for reading views, 16-16
PL/SQL procedures purpose, 16-6
DBMS_CDC_PUBLISH_DROP_CHANGE_ table partitioning properties and, 16-59
TABLE, 16-67 tasks, 16-6
Index-12
publishing VPD, 9-16
asynchronous AutoLog mode when it occurs, 18-4
step-by-step example, 16-35 with bind variables, 18-61
asynchronous HotLog mode with DBMS_MVIEW package, 18-66
step-by-step example, 16-30 with expression matching, 18-49
synchronous mode with inline views, 18-44
step-by-step example, 16-27 with partially stale materialized views, 18-35
publishing change data with selfjoins, 18-45
preparations for, 16-18 with set operator materialized views, 18-62
privileges required, 16-19 with view constraints, 18-46
purging change tables QUERY_REWRITE_ENABLED initialization
automatically, 16-65 parameter, 18-5, 18-6
by name, 16-66 QUERY_REWRITE_INTEGRITY initialization
in a named changed set, 16-66 parameter, 18-6
on the staging database, 16-66
publishers, 16-66
subscribers, 16-65
R
purging data, 15-12 range partitioning, 5-5
key comparison, 5-31, 5-33
partition bounds, 5-31
Q performance considerations, 5-9
queries range-list partitioning, 5-14
ad hoc, 24-16 RANK function, 21-5
enabling parallelism for, 24-45 ranking functions, 21-5
star queries, 19-3 RATIO_TO_REPORT function, 21-24
query delta joins, 18-31 REBUILD INDEX PARTITION statement
query optimization, 24-87 rules of parallelism, 24-42
parallel execution, 24-87 REBUILD INDEX statement
query rewrite rules of parallelism, 24-42
advanced, 18-75 recovery
checks made by, 18-28 from asychronous change set capture
controlling, 18-6 errors, 16-55
correctness, 18-7 parallel DML, 24-24
date folding, 18-49 redo buffer allocation retries, 24-57
enabling, 18-5 redo log files
hints, 18-5, 18-8 archived
matching join graphs, 8-24 asynchronous Change Data Capture
methods, 18-11 and, 16-48
parameters, 18-5 determining which are no longer needed by
privileges, 18-9 Change Data Capture, 16-48
restrictions, 8-25 reference tables
using equivalences, 18-75 See dimension tables, 8-7
using GROUP BY extensions, 18-39 refresh
using nested materialezed views, 18-38 monitoring, 15-21
using PCT, 18-52 options, 8-25
Index-13
scheduling, 15-22 rewrites
with UNION ALL, 15-28 hints, 18-8
refreshing parameters, 18-5
materialized views, 15-14 privileges, 18-9
nested materialized views, 15-27 query optimizations
partitioning, 15-2 hints, 18-5, 18-8
REGR_AVGX function, 21-34 matching join graphs, 8-24
REGR_AVGY function, 21-34 rmcdc.sql script, 16-71
REGR_COUNT function, 21-34 rolling up hierarchies, 10-2
REGR_INTERCEPT function, 21-34 ROLLUP, 20-6
REGR_R2 function, 21-35 concatenated, 20-29
REGR_SLOPE function, 21-34 partial, 20-8
REGR_SXX function, 21-35 when to use, 20-6
REGR_SXY function, 21-35 root level, 2-6
REGR_SYY function, 21-35 ROW_ID$
regression control column, 16-62
detecting, 24-62 ROW_NUMBER function, 21-15
RELY constraints, 7-6 RSID$
remote transactions control column, 16-61
parallel DML and DDL restrictio, 24-4 rules
removing in MODEL clause, 22-17
Change Data Capture from source in SQL modeling, 22-17
database, 16-71 order of evaluation, 22-21
replication
restrictions
parallel DML, 24-25
S
reporting functions, 21-22 sar UNIX command, 24-71
RESOURCE role, 16-19 scalability
resources batch jobs, 24-22
consumption, parameters affecting, 24-55, 24-57 parallel DML, 24-21
limiting for users, 24-49 scalable operations, 24-77
limits, 24-48 scans
parallel query usage, 24-55 full table
restrictions parallel query, 24-2
direct-path INSERT, 24-25 schema-level export operations, 16-68
fast refresh, 8-27 schema-level import operations, 16-68
nested materialized views, 8-20 schemas, 19-2
parallel DDL, 24-3, 24-16 design guidelines for materialized views, 8-8
parallel DML, 24-25 snowflake, 2-3
remote transactions, 24-27 star, 2-3
partitions third normal form, 19-2
datatypes, 5-32 scripts
query rewrite, 8-25 initcdc.sql for Change Data Capture, 16-72
result set, 19-6 rmcdc.sql for Change Data Capture, 16-71
REWRITE hint, 18-5, 18-8 SELECT_CATALOG_ROLE privilege, 16-17, 16-19
Index-14
sessions SQL Workload Journal, 17-20
enabling parallel DML, 24-22 SQL*Loader, 24-4
set operators SQLAccess Advisor, 17-2, 17-10
materialized views, 9-11 constants, 17-38
shared server creating a task, 17-4
parallel SQL execution, 24-6 defining the workload, 17-4
SHARED_POOL_SIZE initialization EXECUTE_TASK procedure, 17-26
parameter, 24-50 generating the recommendations, 17-6
SHARED_POOL_SIZE parameter, 24-50 implementing the recommendations, 17-6
simultaneous equations, 22-49 maintaining workloads, 17-22
single table aggregate requirements, 8-15 privileges, 17-9
skewing parallel DML workload, 24-36 quick tune, 17-36
SMP architecture recommendation process, 17-32
disk affinity, 24-72 steps in using, 17-4
snowflake schemas, 19-5 workload objects, 17-12
complex queries, 19-5 SQLAccess Advisor workloads
SORT_AREA_SIZE initialization parameter maintaining, 17-22
and parallel execution, 24-56 staging
source database areas, 1-6
defined, 16-6 databases, 8-2
source systems, 12-2 files, 8-2
source tables staging database
importing for Change Data Capture, 16-69 defined, 16-6
referenced by change tables, 16-59 STALE_TOLERATED mode, 18-7
SOURCE_COLMAP$ star joins, 19-4
control column, 16-61 star queries, 19-3
interpreting, 16-62 star transformation, 19-6
space management star schemas
MINIMUM EXTENT parameter, 24-18 advantages, 2-4
parallel DDL, 24-17 defining fact tables, 2-5
sparse data dimensional model, 2-4, 19-3
data densification, 21-46 star transformations, 19-6
SPLIT PARTITION clause restrictions, 19-11
rules of parallelism, 24-42 STAR_TRANSFORMATION_ENABLED
SQL GRANT statement, 16-64 initialization parameter, 19-6
SQL modeling, 22-2 statistics, 18-74
cell referencing, 22-15 estimating, 24-63
keywords, 22-14 operating system, 24-71
order of evaluation, 22-21 storage
performance, 22-42 fragmentation in parallel DDL, 24-18
rules, 22-17 index partitions, 5-42
rules and restrictions, 22-40 STORAGE clause
SQL REVOKE statement, 16-64 parallel execution, 24-18
SQL statements Streams apply parallelism value
parallelizing, 24-3, 24-8 determining, 16-26
Index-15
Streams apply process synchronous Change Data Capture
asynchronous Change Data Capture and, 16-25 change sets and, 16-15
Streams capture parallelism value synchronous change sets
determining, 16-26 defined, 16-15
Streams capture process disabling, 16-53
asynchronous Change Data Capture and, 16-25 enabling, 16-53
striping, 4-3 synchronous publishing
subpartition latency for, 16-20
mapping, 5-14 location of staging database, 16-20
template, 5-13 requirements for, 16-27
subqueries setting database initialization parameters
in DDL statements, 24-16 for, 16-21
subscriber view source database performace impact, 16-20
defined, 16-8 SYS_NC_OID$
returning DML changes in order, 16-60 control column, 16-62
subscribers system monitor process (SMON)
access to change tables, 16-64 parallel DML system recovery, 24-24
ALL_PUBLISHED_COLUMNS view, 16-70
components associated with, 16-8
controlling access to tables, 16-64
T
defined, 16-5 table differencing, 16-2
DML access, 16-65 table partitioning
privileges, 16-8 publisher and, 16-59
purpose, 16-7 table queues, 24-67
retrieve change data from the subscriber tables
views, 16-8 detail tables, 8-7
tasks, 16-7 dimension tables (lookup tables), 8-7
subscribing dimensions
step-by-step example, 16-42 star queries, 19-3
subscription windows enabling parallelism for, 24-45
defined, 16-8 external, 14-5
subscriptions fact tables, 8-7
changes to change tables and, 16-70 star queries, 19-3
defined, 16-7 historical, 24-22
effect of SQL DROP USER CASCADE statement lookup, 19-3
on, 16-67 parallel creation, 24-16
summary management parallel DDL storage, 24-18
components, 8-5 refreshing in data warehouse, 24-21
summary tables, 2-5 STORAGE clause with parallel execution, 24-18
supplemental logging summary or rol, 24-16
asynchronous Change Data Capture, 16-50 tablespace
asynchronous Change Data Capture and, 16-11 specifying default for Change Data Capture
symmetric multiprocessors, 5-2, 24-2 publisher, 16-19
SYNC_SET predefined change set, 16-15 tablespaces
SYNC_SOURCE change source, 16-10 change tables and, 16-59
Index-16
transportable, 12-5, 13-3, 13-6 U
TAPE_ASYNCH_IO initialization parameter, 24-60
TARGET_COLMAP$ ultralarge files, 3-4
control column, 16-61 unique
interpreting, 16-62 constraints, 7-4, 24-82
templates identifier, 2-3, 3-2
SQLAccess Advisor, 17-10 UNLIMITED TABLESPACE privilege, 16-19
temporary segments update frequencies, 8-11
parallel DDL, 24-18 UPDATE statement
text match, 18-11 parallel UPDATE statement, 24-38
with query rewrite, 18-73 update windows, 8-11
third normal form user resources
queries, 19-3 limiting, 24-49
schemas, 19-2 USERNAME$
time series calculations, 21-53 control column, 16-62
TIMED_STATISTICS initialization
parameter, 24-66 V
TIMESTAMP$
V$FILESTAT view
control column, 16-61
and parallel query, 24-66
timestamps, 12-6
V$PARAMETER view, 24-66
TO_DATE function
V$PQ_SESSTAT view, 24-64, 24-66
partitions, 5-32
V$PQ_SYSSTAT view, 24-64
transactions
V$PQ_TQSTAT view, 24-64, 24-67
distributed
V$PX_PROCESS view, 24-65, 24-66
parallel DDL restrictions, 24-4
V$PX_SESSION view, 24-65
parallel DML restrictions, 24-4, 24-27
V$PX_SESSTAT view, 24-65
TRANSACTIONS initialization parameter, 24-57
V$SESSTAT view, 24-68, 24-71
transformations, 14-2
V$SYSSTAT view, 24-57, 24-68, 24-80
scenarios, 14-21
validating dimensions, 10-12
SQL and PL/SQL, 14-8
VALUES LESS THAN clause, 5-31
SQL*Loader, 14-5
MAXVALUE, 5-32
transportable tablespaces, 12-5, 13-3, 13-6
view constraints, 7-7, 18-46
transportation
views
definition, 13-2
DBA_DATA_FILES, 24-66
distributed operations, 13-2
DBA_EXTENTS, 24-66
flat files, 13-2
V$FILESTAT, 24-66
triggers, 12-6
V$PARAMETER, 24-66
installed by Change Data Capture, 16-71
V$PQ_SESSTAT, 24-66
restrictions, 24-27
V$PQ_TQSTAT, 24-67
parallel DML, 24-25
V$PX_PROCESS, 24-66
TRUSTED mode, 18-7
V$SESSTAT, 24-68, 24-71
TUNE_MVIEW procedure, 17-47
V$SYSSTAT, 24-68
two-phase commit, 24-57
vmstat UNIX command, 24-71
VPD
Index-17
and materialized views, 9-15
restrictions with materialized views, 9-16
W
WIDTH_BUCKET function, 21-39, 21-41
windowing functions, 21-15
workload objects, 17-12
workloads
deleting, 17-14
distribution, 24-64
skewing, 24-36
X
XIDSEQ$
control column, 16-62
XIDSLT$
control column, 16-62
XIDUSN$
control column, 16-62
Index-18