Materialized View Generation Using Apriori Algorithm
Materialized View Generation Using Apriori Algorithm
6, December 2015
ABSTRACT
Data analysis is an important issue in business world in many respects. Different business organizations
have data scientists, knowledge workers to analyze the business patterns and the customer behavior.
Scrutinizing the past data to predict the future result has many aspects and understanding the nature of the
query is one of them. Business analysts try to do this from a big data set which may be stored in the form of
data warehouse. In this context, analysis of historical data has become a subject of interest. Regarding this,
different techniques are being developed to study the pattern of customer behavior. Materialized view is a
database object which can be extensively used in data analysis. Different approaches are there to generate
optimum materialized view. This paper proposes an algorithm which generates a materialized view by
considering the frequencies of the attributes taken from a database with the help of Apriori algorithm.
KEYWORDS
Data Warehouse, OLAP, Materialized View, Apriori Algorithm, Minimum Support Value
1. INTRODUCTION
Business enterprises deal with a large amount of data and their profits significantly depend on
how the data are actually interpreted. So, data analysis has become an important topic of research
now-days and has a huge potential, especially in the e-commerce sector. Moreover, it has a
notable contribution in the field of social media as well. In this regard, data analysts and data
scientists are in the process of developing different algorithms to analyze data and store the data
that are of more importance. So, data analysis operation is executed to increase the business
intelligence of a commercial organization. From the different approaches that are prevalent today,
materialized view can be substantially used to store the important data. A materialized view is
used to store the outputs of the queries. But unlike a logical view, this can store the outputs
permanently in a physical memory. Because of this nature, this database object can be extensively
used to store the results of the queries which are frequently asked for. So, instead of fetching data
each time from the database itself, with the help of a materialized view, results can be directly
obtained. This type of view can be used as a cache which can be quickly accessed. It will
effectively reduce the network load, if the data are stored in distributed environment and at the
same time, it will reduce the query execution time. But the problem remains that from a huge set
of data transactions, which data are to be materialized. Different algorithms have been proposed
to identify the optimal data set for materialization and the most of these algorithms are mainly
based on greedy approach of selection. Of late, genetic algorithm has also been used to select data
for materialization. The research work that is presented in this paper is based on Apriori
algorithm proposed in [4]. This algorithm has been used to design a method to identify the data to
be materialized based on their frequencies and the dependencies on other data. The next section
gives an overview of some useful algorithms discussed in different research papers in connection
DOI : 10.5121/ijdms.2015.7602
17
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
with materialized view selection. Section 3 gives an overall idea about the steps of Apriori
algorithm that is applied in the process of development of the present work. Section 4 describes
the steps followed in this research work and also puts forward the algorithm for selection of
materialized view. The next section shows the results obtained after applying the proposed
algorithm on different data sets and these results are analyzed. Finally, the last section focuses on
the concluding points.
2. RELATED WORK
A data warehouse is a collection of historical data gathered from previously used data from a
number of queries executed on a specific database. The data stored in a data warehouse is actually
used for On Line Analytical Processing or OLAP which is a method for decision support system.
A materialized view is normally created on the data available in a data warehouse. Since a data
warehouse contains a huge amount of data, extracting a specific set of data often becomes time
consuming and thus may lead to an inefficient processing. A materialized view has its role exactly
in this case. This type of view is a database object stores data physically for minimizing data
processing time. Since the materialized views are essentially used with data warehouses only, the
usefulness of constructing a data warehouse is also a point of concern. A detailed study on this
aspect was done in [19] to explain the role of OLAP along with data warehouse. Further, different
research has been going on to extract the optimal data set to be used for materialization. Earlier
research work, as described in [7] had shown that the optimal materialized view selection was an
NP-complete problem and the same research work had also proposed a greedy algorithmic based
approach for view materialization to optimize query evaluation cost. The approach shown in [7]
was dependent on a data structure called data cube. Another data structure, tree was used in view
materialization in another research work that was proposed in [8]. The work, as discussed in [8]
took a decisive parameter for view generation using tree and that parameter was the overall
workload for query execution. Since the nature of the query may change from time to time, more
data may have to be added with the existing data set available with the materialized view. So, a
materialized view has to be scalable. In this issue, an approach had been discussed in [5] for
OLAP processing. Another suck work was also described in [9] and the main characteristic of that
work was to deal with a portion of the queries, instead of considering the entire query. Use of
materialized view can also be extended into knowledge discovery of data, i.e., related to data
mining applications. Quite a few researches have been done in this field. Using data clustering
techniques, view materialization for data mining was proposed in [1] and the method shown in [1]
could generate effective results. If the data are continuously processes, if the data are streamed
data then also materialized view can be formed in a dynamic way. One such method was
proposed and discussed in [6]. Use of dynamic programming model was seen in the same domain
and as described in [10], this model can be effectively used for view materialization. As the first
commercial database package, Oracle databases have used the materialized view with a large
volume of data and this was discussed in [11]. Different research papers have done comparative
studies on different approaches for view selection. One such review study was done in [15] and it
was shown that a greedy algorithmic based approach with a polynomial time complexity would
have been an optimal way for view selection for materialization. Based on the greedy algorithmic
approach, a cost model was developed in [18]. In that work, different calculations were made on
evaluation of the total cost and the benefits involved in each materialized view selection and
based on the outcome, the most optimized materialized view was selected for a data warehouse.
Along with the selection of views to be materialized, maintenance of the same is also very
important and a subject matter of research. One such research work was done in [17]. In that
research work, common sub expressions were used for selecting and maintaining materialized
view and that work described about three different kinds of materialization transient, permanent
18
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
and incremental which were very much inter-dependent. That paper had mainly focused on the
maintenance of the materialized view generated optimally. A research was done in [20] where the
features of on dynamic materialized view were used in designing a special type progressive
query, which is a set of step-queries, known as monotonic linear progressive query. Of late,
modern approaches like evolutionary algorithms are used in view selection process. One of the
initial papers regarding this is [14] where an evolutionary approach was proposed to find out the
optimal set of views based on the total view election cost. The paper also discussed the proposed
method in details. A genetic algorithmic based method was proposed and discussed in [3] where
the views were represented in the form of chromosomes where each gene had represented a
selected view. The selection of views to be incorporated within a chromosome from a population
of chromosomes was done on the basis of a fitness function. The main parameter that was
considered for the formation of the fitness function was the size of each view in the question.
Different steps like crossover and mutation were used to reorganize the chromosomes. The same
paper also showed with some graphical representations that the approach of generating
materialized view using genetic algorithm would have generated more optimal materialized view
compared to earlier greedy based approaches.
In other words, the parameter support identifies the percentage of transactions where both A and
B occur and the parameter confidence identifies the percentage of transactions containing A that
also contain B.
With all these parameters defined, Apriori algorithm identifies the frequent item sets. An item is
said to be frequent if it crosses a pre-defined limit, defined as the minimum support value. This
process involves multiple checking through iterations on the given large data set. The details of
the process are described in [4]. The entire method is divided into two basic steps: join step and
prune step. The first step, i.e., the join step generates kth candidate item set from (k 1)st item sets
after joining them. Each kth candidate contains k number of items considered for the final
selection. This selection is based on a pre-defined parameter known as the minimum support. So,
the first step finds out a larger item set from a smaller one. The next step, i.e., the prune step
removes irrelevant item sets, if any. Irrelevance is identified by some predefined conditions
imposed on the item sets depending upon the applicability of the considered data set. The pseudocodes for these two steps are given and explained in [4].
4. PROPOSED WORK
From a given set of database transactions, the attributes, which are frequently accessed, can be
identified by Apriori Algorithm. Each transaction is basically the execution of a query and each
query deals with a set of attributes. So, each transaction can be thought of as a set of attributes on
19
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
which the query is to be executed. All of these sets of attributes are considered for Apriori
Algorithm. Since the Apriori algorithm can be effectively used to find out the frequent itemsets,
the present research work has considered this algorithm for finding out the frequent attributes that
can be considered for materialization. So, for the sake of the research work, the attributes
involved in the transactions have been considered to be analogous with the itemsets that were
considered in the original description of Apriori algorithm in [4].
The output of the algorithm will be the sets of attributes containing the most frequently attributes
that are asked for. There may be three different cases:
Case 1: The algorithm may generate more than one set of attributes.
Case 2: The algorithm may generate a single set of attributes.
Case 3: The algorithm may generate a null set.
In the first case, the intersection of the output sets will be considered for materialization. As far as
the second case is considered, the output set itself is considered for materialization. In the final
case, the output of the last but one iteration will be considered for materialization.
The number of iterations depends on a pre-defined threshold minimum support value. This
threshold value is application specific and should be assigned by the business analysts depending
on the nature of the business operation and the nature of the desired output.
Some other attributes may need to be attached to this set for materialization and that is to be
identified next. This is done by finding out the confidence value of the attributes which are not
selected initially for materialization on the attributes which are selected initially for
materialization.
For example, if a transaction has five attributes A1 to A5 and only A1 and A3 are selected for
materialization after applying the first phase then in the second phase, the confidence values of
A2, A4 and A5 on A1 and A3 are identified and if any confidence value is above the pre-defined
threshold confidence value, which works like the minimum support value as described in [4], then
the attributes corresponding to these confidence values are added with the materialized view.
The present method, which is named as Materialized View Generation using Apriori Algorithm or
MVG_AA is based on the above-mentioned two steps. In the next section, two different test cases
have been considered after applying the method MVG_AA. The data sets that have been
considered for explaining the algorithm have been generated randomly.
The following is the pseudo-code for MVG_AA:
Algorithm MVG_AA ( )
{
Input: T = A set of n number of database transactions and ATR = A set of attributes on
which different transactions are to be executed
Output: M = A set of attributes to be materialized
Let T = {T1, T2, T3, , Tn}
Initialize M by , i.e., null set
for i =1 to n
do
Let A = A set of attributes involved in ith transaction Ti
R = Apriori (A) /* R is a set which stores the output generated by the Apriori
algorithm and Apriori ( ) is a method to invoke Apriori algorithm and its takes A as its
parameter */
M=MR
done
20
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
Transaction
ID
1
2
3
4
5
6
7
8
9
Attribute Set
Involved
1,2,3,4
1,2,3
2,3
3,4
5,6
4,5,6
1,2,3,4
4,5,6
1,2,3
Binary Value
1111
111
110
1100
110000
111000
1111
111000
111
Decimal Value
15
7
6
12
48
56
15
56
7
21
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
10
11
12
13
14
15
16
17
18
4,6
2,6
3,4,6
3,4,6
2,4,6
2,4
1,2,3,5
1,2,3,5
1,2,3,5
101000
100010
101100
101100
101010
1010
10111
10111
10111
40
34
44
44
42
10
23
23
23
For example, from the table 1, if the tuple corresponding to the transaction id 4 is considered,
only attributes A3 and A4 are selected, so its equivalent binary entry will be 1100, where the
leftmost 1s identify that two attributes A4 and A3 are selected in this transaction and the rightmost
0s signify that in this transaction, two more attributes A1 and A2 are missing, i.e., not
participating. Finally, the fourth column, i.e., the column Decimal Value stores the equivalent
decimal values of the binary values that are stored in the third column. The next table, i.e., table
2, is split into two pages and it stores all the frequency values against iterations which are based
on Apriori algorithm. According to this algorithm, as stated in (R. Agarwal & R. Srikant, 1994), a
number of iterations required to find out the most frequent item sets. The number of iterations is
dependent on how fast the resultant set after the join step is a null set. After the final iteration is
over, the attribute sets which have frequencies over a threshold value are identified and are
chosen to be the most frequent ones. In this experimental process, described in this paper, the
threshold frequency value has been chosen to be 2.
Table 2. The outputs of Apriori Algorithm applied on the transaction set as shown in Table 1.
Iteration
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
Frequent Attribute
1Sets
2
4
8
16
32
3
5
9
17
6
10
18
34
12
20
36
24
Frequency
7
11
11
10
6
8
7
7
2
3
8
4
3
2
5
3
2
2
22
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
2
2
3
3
3
3
3
3
3
3
3
4
4
40
48
7
11
19
13
21
14
22
44
56
15
23
6
3
7
2
3
2
3
2
3
2
2
2
3
From the table 2, it is clear that the required frequent attribute sets are 15 and 23, i.e., 11112 and
101112 respectively because these two sets have frequency, which is termed as the support value
according to (R. Agarwal & R. Srikant, 1994), above the threshold of 2. In other words, attributes
A1, A2, A3, A4 and A1, A2, A3 and A5 are identified separately. Since two different sets of attributes
are selected, their intersection is found out and it generates A1, A2 and A3, which has an
equivalent decimal value of 7. So, according to the algorithm MVG_AA ( ), the attributes A1, A2
and A3 are to be materialized.
Table 3. The list of confidence values obtained from the result as shown in Table 2.
Confidence on 15
1=>14 = 0.2857142857142857
2=>13 = 0.18181818181818182
3=>12 = 0.2857142857142857
4=>11 = 0.18181818181818182
5=>10 = 0.2857142857142857
6=>9 = 0.25
7=>8 = 0.2857142857142857
8=>7 = 0.2
9=>6 = 1.0
10=>5 = 0.5
11=>4 = 1.0
12=>3 = 0.4
13=>2 = 1.0
14=>1 = 1.0
Confidence on 23
1=>22 = 0.42857142857142855
2=>21 = 0.2727272727272727
3=>20 = 0.42857142857142855
4=>19 = 0.2727272727272727
5=>18 = 0.42857142857142855
6=>17 = 0.375
7=>16 = 0.42857142857142855
16=>7 = 0.5
17=>6 = 1.0
18=>5 = 1.0
19=>4 = 1.0
20=>3 = 1.0
21=>2 = 1.0
22=>1 = 1.0
As the next step, the process will try to identify whether any other attribute is there to be
materialized along with the attributes already chosen. For this, the confidence values, as defined
in the association rules and stated in the previous section, of other attributes on the already
selected attributes are to be calculated. The calculation is done by the standard expression as
given in (Han, J. & Kamber, M., 2006). Accordingly, the confidence values are calculated and
these values are shown in the next table, i.e., table 3. The confidence threshold value that has
been considered for calculation is 0.5. According to the proposed method, if any other attribute
has the confidence value greater than or equal to the threshold confidence value then that attribute
23
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
would be considered along with the already selected list of attributes. Table 3 contains two
columns as in the previous step, two sets of attributes, having decimals values 15 and 23
respectively have satisfied the minimum support value.
From the entries as shown in Table 3, the attribute or the attribute set that is dependent on 7, i.e.,
A1, A2 and A3 is marked in bold. It is clear that there is no attribute or set of attributes whose
confidence value on 7 is above the threshold value of 0.5. So, no more attribute will be added
with the already obtained list of attributes to be materialized. So, the final content of the
materialized view will be A1, A2 and A3.
Test case 2:
Table 4 stores all the transaction details from another example transaction along with their binary
values and decimal values in the same way the data were stored in table 1. The next table, i.e.,
table 5 stores all the frequency values against iterations which are based on Apriori algorithm.
The same threshold vale of frequency, i.e., a threshold frequency value of 2, has been chosen for
this test as well. From the table 5, it is clear that frequent attribute sets are 15 and 57, i.e., 11112
and 1110012 respectively because these two sets have frequency support values above the
threshold of 2. In other words, attributes A1, A2, A3, A4 and A1, A4, A5 and A6 are identified
separately. Since two different sets of attributes are selected, their intersection is found out and it
generates A1 and A4 or 10012 which has an equivalent decimal value of 9. So, A1 and A4 are to be
materialized.
To find out the other attributes, if any, on the basis of the confidence values, the same confidence
threshold of 0.5 has been chosen for this test as well.
From the entries as shown in table 6, the attribute or the attribute set that is dependent on 9, i.e.,
A1 and A4 is marked as bold. It is clear that only 6 (or 1102), i.e., the set, containing A2 and A3 is
dependent on 9 because it has a confidence value above the threshold value of 0.5. So, these two
attributes will be added with the already obtained list of attributes to be materialized. So, the final
content of the materialized view will be A1, A2, A3 and A4.
In this way, different attributes can be identified to be added in the final materialized view. The
number of attributes and the attributes themselves may vary mainly if the confidence level and
the support value are altered. These two parameters exclusive depend on the requirement of the
applications for which the data analysis is to be performed.
Table 4. The attributes in all transactions along with their binary and decimal values.
Transaction
1ID
2
3
4
5
6
7
8
9
10
Binary Value
Decimal Value
1,2,3,4
1,2,3
2,3
3,4
5,6
4,5,6
1,2,3,4
4,5,6
1,2,3
4,6
1111
111
110
1100
110000
111000
1111
111000
111
101000
15
7
6
12
48
56
15
56
7
40
24
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
11
12
13
14
15
16
17
18
100010
101100
101100
101010
1010
111001
111001
111001
2,6
3,4,6
3,4,6
2,4,6
2,4
1,4,5,6
1,4,5,6
1,4,5,6
34
44
44
42
10
57
57
57
Table 5. The outputs of Apriori Algorithm applied on the transaction set as shown in Table 4.
Iteration
1
1
1
1
1
1
2
2
Frequent Attribute
1Sets
2
4
8
16
32
3
5
Frequency
7
8
8
13
6
11
4
4
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
4
4
9
17
33
6
10
34
12
36
24
40
48
7
11
13
25
41
49
14
44
56
15
57
5
3
3
5
4
2
5
2
5
9
6
4
2
2
3
3
3
2
2
5
2
3
25
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
Table 6. The list of confidence values obtained from the result as shown in Table 5.
Confidence on 15
1=>14 = 0.2857142857142857
2=>13 = 0.25
3=>12 = 0.5
4=>11 = 0.25
5=>10 = 0.5
6=>9 = 0.4
7=>8 = 0.5
8=>7 = 0.15384615384615385
9=>6 = 0.4
10=>5 = 0.5
11=>4 = 1.0
12=>3 = 0.4
13=>2 = 1.0
14=>1 = 1.0
Confidence on 57
1=>56 = 0.42857142857142855
8=>49 = 0.23076923076923078
9=>48 = 0.6
16=>41 = 0.5
17=>40 = 1.0
24=>33 = 0.6
25=>32 = 1.0
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Aouiche, K., Jouve, P. (2006). Clustering-based materialized view selection in data warehouses. In
Proceedings of 10th East European conference on Advances in Databases and Information Systems
(pp. 81 95).
Dey, Kashi Nath, & Datta, Debabrata. (2013). Materialized View A Novel Approach. In Proceedings
of the 2nd International Conference on Computing and Systems (pp. 280 282).
Vijay Kumar, T.V., & Kumar, S. (2012). Materialized View Selection using Genetic Algorithm.In
Proceedings of the 5th International Conference on Communications and Information Science (pp.
225 237).
Agarwal R., & Srikant, R. (1994). Fast Algorithms for Mining Association Rules.In Proceedings of
the 20th International Conference on Very Large Data Bases (pp. 487 499).
Thomas P. Nadeau, & Toby J. Teorey. (2002). Achieving Scalability in OLAP Materialized View
Selection. In Proceedings of the 5th ACM International workshop on Data Warehousing and OLAP
(pp. 28 - 34).
Chandrasekaran, S., & Franklin, M.J. (2002). Streaming queries over Streaming Data. In Proceedings
of the 28th International Conference on Very Large Data Bases (pp. 203 214).
Harinarayan, V., Rajaraman, A. & J. Ullman. (1996). Implementing data cubes efficiently. In
Proceedings of ACM SIGMOD International Conference on Management of Data (pp. 205 216).
26
International Journal of Database Management Systems ( IJDMS ) Vol.7, No.6, December 2015
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
Bhagat, A.P., & Harle, B.R. (2011). Materialized View Management in Peer to Peer Environment.In
Proceedings of International Conference & Workshop on Emerging Trends
in Technology (pp.
480 484). .
Goldstein, J., & Larson, Per-ke. (2001). Optimizing Queries Using Materialized Views: A Practical,
Scalable Solution. In Proceedings of the ACM SIGMOD International Conference on Management of
Data (pp. 331 342).
Chaudhuri, S., Krishnamurthy, S., Potamianos, S. & Shim, K. (1995). Optimizing Queries with
Materialized Views. In Proceedings of the International Conference on Data Engineering (pp. 190
200).
Bello, R.G., Dias, K., Feenan, J., Finnerty, J., Norcott, W.D., Sun, H., Witkowski, A., & Ziauddin, M.
(1998). Materialized Views in Oracle. In Proceedings of the 24th International
Conference
on
Very Large Data Bases (pp. 659 664).
Baralis E., Paraboschi S. & Teniente E. (1997). Materialized view selection in a multidimensional
database. In Proceeding of the 23rd International Conference on Very Large Data Bases (pp. 156
165).
Gupta, H. & Mumick, I. (2005). Selection of Views to Materialize in a Data Warehouse . IEEE
Transactions on Knowledge and Data Engineering, 17(1) (pp. 24 43).
Horng, J.T., Chang, Y.J., Liu, B.J. & Kao, C.Y. (1999). Materialized view selection using genetic
algorithms in a data warehouse system. In Proceedings of the World Congress on Evolutionary
Computation (pp. 2221 2227).
Vijay Kumar, T.V., & Ghosal, A. (2009). Greedy Selection of Materialized Views. International
Journal of Communication Technology, Volume 1, Number 1 (pp. 156 172).
Zhang, C., Yao, X., & Yang, J. (2001). An evolutionary approach to materialized views selection in a
data warehouse environment. IEEE Transactions on Systems, Man, and Cybernetics Part C:
Applications and Reviews, Volume 31, Number 3 (pp. 282 294).
Mistry, H., Roy, P., Sudarshan, S., & Ramamritham, K. (2001). Materialized view selection and
maintenance using multi-query optimization. In Proceedings of ACM SIGMOD International
Conference on Management of Data (pp. 307 318).
Chan, G. K.Y., Li, Q., & Feng, L. (1999). Design and selection of materialized views in a data
warehousing environment: A case study. In Proceedings of 2nd ACM International Workshop on Data
Warehousing and OLAP (pp. 42 47).
Chaudhuri, S., & Dayal, U. (1997). An Overview of Data Warehousing and OLAP Technology. In
ACM Sigmod Record, Volume 26, Issue 1 (pp. 65 74).
Zhu, C., Zhu, Q., & Zuzarte, C. (2010). Efficient Processing of Monotonic Linear Progressive Queries
via Dynamic Materialized Views. In Proceedings of Conference of Center for Advanced Studies on
Collaborative Research (pp. 224 237).
Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques, Second Edition. Morgan
Kauffman Publisher
AUTHORS
Debabrata Datta is presently an Assistant Professor of the Department of Computer
Science, St. Xavier's College(Autonomous), Kolkata. He has a teaching experience of
about more than 10 years. His research interests include data warehousing and data
mining. He has published fifteen research papers in different national and international
conferences as well as journals
Presently, Kashi Nath Dey is an Associate Professor in the Department of Computer
Science & Engineering, University of Calcutta, India. He has about 8 years of industrial
experience in different IT companies in India. He has about 26 years of experience in
teaching and research. He has more than 35 research publications in different
International conferences and International Journals and authored 7 books published by
Pearson Education (India).
27