Lecture 4, Data Cube Computation: CSI 4352, Introduction To Data Mining
Lecture 4, Data Cube Computation: CSI 4352, Introduction To Data Mining
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
Full Cube
Full materialization
Materializing all the cells of all of the cuboids for a given data cube
Issues in time and space
Iceberg Cube
Partial materialization
Materializing the cells of only interesting cuboids
Materializing only the cells in a cuboid whose measure value is above
the minimum threshold
Closed Cube
Materializing only closed cells
1
11/19/2015
Example all
time,supplier item,supplier
time,item,location time,location,supplier
time,item,supplier item,location,supplier
Computation Techniques
Aggregating
Aggregating from the smallest child cuboid
Caching
Caching the result of a cuboid for the computation of other cuboids to
reduce disk I/O
Pruning
A priori pruning the cells with lower support than minimum threshold
2
11/19/2015
Star-Cubing
Features
Array-based “bottom-up” approach
Uses multi-dimensional chunks all
No direct tuple comparisons
Simultaneous aggregation
A B C
on multiple dimensions
Intermediate aggregate values
are re-used for computing ancestor
AB AC BC
cuboids
Full materialization (No iceberg
optimization, No Apriori pruning) ABC
3
11/19/2015
Aggregation Strategy
Data addressing
Uses chunk id and offset
C c2
c3 61 62 63 64
45 46 47 48
Multi-way Aggregation c1 29 30 31 32
c0
60
Computes aggregates in b3 B13 14 15 16 44
28 56
multi-way
b2 9 10 11 12 40
Visits chunks in the order B 24 52
b1 5 6 7 8 36
• to minimize memory 20
access b0 1 2 3 4
• to minimize memory space a0 a1 a2 a3
A
Aggregation Process
Example
c3 61 62 63 64
c2 45 46 47 48
c1 29 30 31 32
c0
60
b3 B13 14 15 16 44
28 56
b2 9 10 11 12 40
24 52
b1 5 6 7 8 36
20
b0 1 2 3 4
a0 a1 a2 a3
4
11/19/2015
Aggregation Process
Example - Continued
c3 61 62 63 64
c2 45 46 47 48
c1 29 30 31 32
c0
60
b3 B13 14 15 16 44
28 56
b2 9 10 11 12 40
24 52
b1 5 6 7 8 36
20
b0 1 2 3 4
a0 a1 a2 a3
Example
Suppose the data size on
each dimension A, B and C
C c2
c3 61 62 63 64
45 46 47 48
c1 29 30 31 32
is 40, 400 and 4000, c0
60
respectively. b3 B13 14 15 16 44
28 56
Minimum memory required b2 9 10 11 12 40
B 24 52
when traversing the order,
b1 5 6 7 8 36
1,2,3,4,5,…, 64 20
b0 1 2 3 4
Total memory required is
100×1000 + 40×1000 + a0 a1 a2 a3
40×400
A
5
11/19/2015
Summary of Multi-Way
Method
Cuboids should be sorted and computed according to the data size
on each dimension
Keeps the smallest plane in the main memory, fetches and computes
only one chunk at a time for the largest plane
Limitations
Full materialization
Computes well only for a small number of dimensions
( high dimensional data → partial materialization )
Reference
Zhao, Y., Deshpande, P.M. and Naughton, J.F., “An Array-Based Algorithm
for Simultaneous Multidimensional Aggregates”, In Proceedings of SIGMOD
(1997)
Star-Cubing
6
11/19/2015
Characteristics
“Top-down” approach
Partial materialization (iceberg cube computation)
Divides dimensions into partitions
1 all
and facilitates iceberg pruning
No simultaneous aggregation
2A 10 B 14 C 16 D
3 AB 7 AC 9 AD 11 BC 13 BD 15 CD
5 ABCD
Partitioning
Sorts data values
Partitions into blocks that fit in memory
Apriori Pruning
For each partition
• If it does not satisfy min_sup, its descendants are pruned
• If it satisfies min_sup,
materialization and a recursive call including the next dimension
7
11/19/2015
Description in SQL
compute cube iceberg_cube as
Iceberg cube computation select A, B, C, D, count(*)
from R
cube by A, B, C, D
having count(*) ≥ min_sup
Example
( , , , ) 0-D cuboid
Summary of BUC
Method
Computation of sparse data cubes
Limitations
Sensitive to the order of dimensions
→ The most discriminating dimension should be used first
→ Dimensions should be in the order of decreasing cardinality
→ Dimensions should be in the order of increasing maximum number
of duplicates
Reference
Beyer, K. and Ramakrishnan, R., “Bottom-Up Computation of Sparse and
Iceberg CUBEs”, In Proceedings of SIGMOD (1999)
8
11/19/2015
Star-Cubing
Characteristics of Star-Cubing
Characteristics
Integrated method of “top-down” and “bottom-up” cube computation
Explores both multidimensional aggregation (as Multi-way) and
apriori pruning (as BUC)
all
Explores shared dimensions
e.g., dimension A is a shared
A/A B/B C/C D
dimension of ACD and AD
e.g., ABD/AB means ABD has
AB/AB AC/AC AD/A BC/BC BD/B CD
the shared dimension AB
ABCD
9
11/19/2015
Cuboid Trees
Cuboid Trees
Tree structure to represent cuboids
Base cuboid tree, 3-D cuboid trees, 2-D cuboid trees, …
Each level represents a dimension
Each node represents an attribute
Each node includes the attribute value, root: 100
Example
c1: 5 c2: 5
• count(a1,b1,*,*) = 10
• count(a1,b1,c1,*) = 5 d1: 2 d2: 3
10
11/19/2015
Star Nodes
Star Nodes
If the single dimensional aggregate does not satisfy min_sup,
no need to consider the node in the iceberg cube computation
The nodes are replaced by *
A B C D count
Example (min_sup = 2) a1 b1 * * 1
a1 b1 * * 1
A B C D count a1 * * * 1
a1 b1 c1 d1 1 a2 * c3 d4 1
a1 b1 c4 d3 1 a2 * c3 d4 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3 d4 1 A B C D count
a1 b1 * * 2
a1 * * * 1
a2 * c3 d4 2
Star Tree
A cuboid tree that is compressed using star nodes
11
11/19/2015
Aggregation
DFS (depth-first-search) from the root of a star tree
Creates star trees for the cuboids on the next level
Pruning
Prunes if the aggregates do not satisfy min_sup
Prunes if all the nodes in the generated tree are star nodes
Method
Multi-way aggregation & iceberg pruning
all
BCD :3
A/A C/C D/D
b*:3 b1:2
pruning pruning
root:5
ABC/ABC ABD/AB ACD/A BCD
a1:3 a2:2
12
11/19/2015
Limitations
Sensitive to the order of dimensions
→ The order of decreasing cardinality
Reference
Xin, D., Han, J., Li, X. and Wah, B.W., “Star-Cubing: Computing Iceberg
Cubes by Top-Down and Bottom-Up Integration”, In Proceedings of VLDB
(2003)
Star-Cubing
13
11/19/2015
Motivations
The computation and storage of iceberg cube are still costly for high
dimensional data ( “the curse of dimensionality” )
Hard to determine an appropriate iceberg threshold
No update in an iceberg cube
Features
Reduces a high dimensional cube into a set of lower dimensional cubes
Lossless reduction
Online re-construction of high-dimensional data cube
Fragmentation Strategy
Observation
OLAP occurs only on a small subset of dimensions at a time
Fragmentation
Partitions the set of dimensions into shell fragments
Computes data cubes for each shell fragment
( 20 3-D data cube computation is much better than 1 60-D data cube.)
Retains inverted indices or value-list indices
Semi-Online Computation
Given the pre-computed fragment cubes, dynamically compute cube cells
of the high-dimensional cube online
14
11/19/2015
Inverted Index
Example
TID A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
Value TID List Size
5 a2 b1 c1 d1 e3
a1 1 2 3 3
Cube Computation
Process
Generalizes the 1-D inverted indices to multi-dimensional inverted indices
Computes cuboids using the inverted indices
Example
Shell fragment cube ABC contains 7 cuboids, A, B, C, AB, AC, BC, ABC
The cuboid AB is computed using the 2-D inverted index below
15
11/19/2015
Process
Given shell fragment cubes
Divides the query into shell fragments
Fetches the corresponding TID list for each fragment from the cubes
Intersects the TID lists from each fragment to construct instantiated
base table
Computes the online cube using the base table with any data cube
computation algorithm
Computes the query
Dimensions D Cuboid
EF Cuboid
A B C D E F … DE Cuboid
Cell T-ID List
d1 e1 {1, 3, 8, 9}
d1 e2 {2, 4, 6, 7}
d2 e1 {5, 10}
… …
ABC DEF
Cube Cube
16
11/19/2015
A B C D E F GH I J K L MN …
Instantiated Online
Base Table Cube
Advantages
Significant increase of the speed of data cube computation
Significant decrease of the storage usage
Various applications in very high dimensional data
Limitations
Tradeoffs between the preprocessing time to construct a data cube and
the online computation time for queries
Tradeoffs between the storage for a data cube and the memory for
online computation
Reference
Li, X., Han, J. and Gonzalez, H., “High-Dimensional OLAP: A Minimal
Cubing Approach”, In Proceedings of VLDB (2004)
17
11/19/2015
Questions?
18