Security Issues in Data Warehouse
Security Issues in Data Warehouse
SAIQA
SAIQAALEEM
ALEEM LUIZ FERNANDO CAPRETZ FAHEEM AHMED
Department
Western of Electrical & Computer Engineering
University Department of Computing Science
Western University
Department of Electrical& Thompson Rivers University
London, ON, Canada
Computer Engineering Kamloops, BC, Canada
London, ON,{saleem4,
Canada lcapretz}@uwo.ca [email protected]
[email protected]
Abstract — Data Warehouse (DWH) provides storage for huge amounts of historical data from heterogeneous
operational sources in the form of multidimensional views, thus supplying sensitive and useful information which
help decision-makers to improve the organization’s business processes. A data warehouse environment must ensure
that data collected and stored in one big repository are not vulnerable. A review of security approaches specifically for
data warehouse environment and issues concerning each type of security approach have been provided in this paper.
ISBN: 978-1-61804-264-4 15
Recent Advances in Information Technology
requirement for DWH development, starting from 2.2 DWH Security Approaches for Integrity
requirements and continuing through implementation Integrity involves data protection from accidental or
and maintenance. Security solutions for on-line malicious changes such as false data insertion,
transactional processing (OLTP) systems cannot be contamination, or destruction. The disadvantage of
appropriate for DWHs because in OLTP, security access-control mechanisms is that they do not capture
controls are applied on rows, columns, or tables, while inferences on data in the case of an aggregated OLAP
DWHs need to be accessed by different numbers of query. Inferences on data lead to the integrity issue. For
users for different content because multidimensionality more than thirty years, inference-control approaches
is a basic principle of a DWH [1, 5]. have been studied in statistical and census databases [7,
Data extraction, transformation, cleaning, and 8, 9]. The proposed approaches can be categorized into
preparation have all been done before the data are restriction-based and perturbation-based techniques.
loaded into the DWH. Security concerns must be Restriction-based inference control techniques simply
addressed at all layers of a DWH system. Moreover, deny unsafe queries to prevent malicious inference.
DWH security cannot be ensured unless the security of Perturbation techniques add noise to data, swap data, or
the underlying operating system and the network have modify the original data and can also apply data
been addressed [6]. Various security solutions have modification to each query dynamically. The
been proposed in the DWH literature and are described approaches presented to solve the integrity issue can be
below, categorized according to how they address basic classified further as described below.
security concerns such as CIA.
2.2.1 Restriction-based approaches
2.1 DWH Security Approaches for In restriction-based inference-control techniques, the
Confidentiality Issues safety of a query is determined based on the maximum
Confidentiality emphasizes protection of information number of values aggregated by dissimilar queries [8],
from unauthorized disclosure, either by indirect logical the minimum number of values aggregated by a query
inference or by direct retrieval [3]. In order to address [10], and the highest rank of the matrix expressing
DWH confidentiality concerns, many approaches have answered queries [11].
been proposed dealing with access control. Access- Micro-aggregation and partitioning considers
control mechanisms involve controlling both invocation specific type of aggregations. In partitioning methods, a
and administration of the DWH and the source partition is defined on sensitive data, and a restriction is
databases. Authentication and audit mechanisms also applied on a complete block of a partition for aggregate
fall under access control and must be installed in a queries [12, 13]. Micro-aggregation also replaces
DWH environment. cluster averages with their sensitive values [14]. Both
Conventionally, DWHs have been accessed by high- methods are not based on dimensional hierarchies and
level users such as business analysts and executive therefore may contain meaningless blocks that are not
management. Therefore, critical access-control issues useful for users.
also arise at the front end of a DWH. Most DWH or
OLAP vendors assume that there is no need to provide 2.2.2 Combined Access- and Inference-Control
fine-grained access-control support for a DWH front Approaches
end because it hinders discovery of analytical In order to remove security threats, access control and
information. However, this assumption is not inference control together can provide a good solution.
appropriate because many users can access analytical Ensuring security should not affect the usefulness of
tools to query the DWH. Front-end DWH applications DWH and OLAP systems. Wand & Jajodia [15]
can provide both static and dynamic reporting. proposed a three-tier security architecture for a DWH.
Imposing access control on static reports is not a Usually, two tiers can be found in statistical databases,
problem because it can be defined on a report basis. For such as sensitive data and aggregation queries. This
dynamic reporting like data-mining queries, it is two-tier architecture has some inherent drawbacks:
difficult to provide appropriate access-control policies. inference checking during run-time query processing
This leads to the problem of data inference; for may result in unacceptable delays, and also under this
example, a user may not be authorized to obtain two-tier architecture, inference-control techniques
particular information, but may retrieve it through an cannot benefit from the special characteristics of
aggregated query. OLAP. To overcome these drawbacks, the research has
defined a three-tier architecture to provide access
ISBN: 978-1-61804-264-4 16
Recent Advances in Information Technology
control between the first and second tiers and inference numerical values. The proposed approach was based on
control between the second and third tiers. mathematical modulus operators such as division,
The basic lattice-based inference method [16] can be remainder, and two simple arithmetic operations, which
used and implemented on the three-tier inference- can be used without changing DBMS source code and
control model. The first methodology used existing user applications. They claimed that the proposed
inference-control methods for statistical databases, formula required low computational effort and that as a
whereas the second methodology was designed to result, query response-time overheads became
remove the limitations of existing inference-control relatively small while still providing an appropriate
methods. The work claims that both methods could be security level.
applied on the basis of a three-tier inference control
architecture that is more appropriate for DWH and 2.3 DWH Security Approaches for the
OLAP systems specifically. Availability Issues
Data availability is of utmost importance in any DWH
2.2.3 Modelling-based Approaches to DWH Security system. This involves data recovery from real-time
Triki et al. [17] proposed approach provides semi- corruption or incorrect data modification and
automatic inference detection at the DWH design level. continuous 24/7 user access. Data replication is
The approach presented consists of three phases. The performed to be able to restore damaged data using
first phase identifies sensitive data from DWH many proposed solutions. In this way, database
schemata with the collaboration of security designers downtime because of maintenance interventions can
and experts in the field. In the second phase, an also be avoided, and query-processing efforts can be
inference graph based on a class diagram is constructed divided, avoiding data-access hotspots. Well-known
to detect elements which may cause inferences in RAID architectures can be used for mirroring data [22,
future. The security designer also distinguishes between 23] on systems where centralized servers contain the
elements leading to precise and partial inferences. database. However, organizations have been
Precise inference means that exact information is implementing their DWHs in low-cost machines for
disclosed, whereas partial inference leads only to partial cost-optimization purposes. RAID technology is not
disclosure of information. suitable for this kind of situation because typically only
The inference graph consists of a set of nodes one disk drive is present.
representing the data. Then nodes are connected to each In today’s market, commercial solutions for the
other by oriented arcs representing the direction of DWH data-availability issue are available, such as
inference and its type (partial or precise). In the third Oracle RAC [24] and Aster Data [25]. Hamming codes
phase, DWH schemata are enriched automatically by provide another approach to recover corrupted data
UML annotations which flag the elements that may using error-correction codes. The proposed data-storage
lead to both types of inferences. The work claimed that system makes it possible to recover corrupted data
their approach had two advantages: independence of blocks by using error-correcting codes, remapping bad
the data domain, and use of available data to detect blocks, and replicating blocks [26, 27]. Marsh &
inferences. Schneider [28] proposed a technique for distributed
storage used the same features as described earlier plus
2.2.4 Data Masking and Perturbation-Based encryption methods. Other researchers [29, 30, 31, 32,
Security Approaches 33] have also proposed architecture assessment and
Data disclosure can be easily avoided by data-masking self-healing methods to address the availability issue.
approaches. Using data masking, original data values Recently, Darwish et al. [34] have establish cloud-
can be replaced or changed. Currently, the best based protocols to defend against denial-of-services
practices for data masking are used by Oracle in their attacks.
DBMS [18]. In data masking, encryption is an
advanced form of enforcing privacy. Oracle has also
developed Transparent Data Encryption (TDE) in the 3 Discussion
10g and 11g versions of its DBMS. TDE incorporates A literature review of the various approaches to DWH
the well-known AES and 3DES encryption algorithms security has been presented above. A DWH needs
[19, 20]. powerful security features in addition to its normal
Santos et al. [21] proposed a data-masking functionalities. The primary security requirements are
technique for data warehouses consisting only of summarized by the Confidentiality, Integrity and
Availability (CIA) acronym. A full set of security
ISBN: 978-1-61804-264-4 17
Recent Advances in Information Technology
features can be defined under these three basic h) A model is needed that helps to identify security
properties, such as access control, inference control, requirements automatically throughout the DWH
non-repudiation, authentication, authorization, and life cycle and makes it possible to provide proper
availability. The best security model is one that authentication.
provides end-to-end security in all phases of DWH, None of the existing approaches addressed this
starting from modelling and continuing through issue. The proper identification of security policies
implementation and maintenance. Moreover, the is a highly critical starting point in implementing
security model must address the three basic CIA security in a DWH.
security requirements. Some of these approaches i) Most of the approaches used standard encryption
consider security requirement confidentiality. methods and tried to provide strong data privacy.
Security approaches which discussed integrity issues However, use of this type of encryption method
were further classified by how they address this type of makes them inefficient for DWH use. Encryption
security concern. Some of the approaches also tried to algorithms like AES and 3DES require large
address the issue of DWH availability. In short, all the computational effort and have a huge impact on
proposed approaches addressed only some aspects of performance. A technique is therefore needed that
security, and a DWH security model are needed that provides strong data privacy with less computational
covers all the security requirements and also help in effort and also maintains high performance, which is
developing a secure DWH. The identified issues with the basic requirement of DWH use.
security approaches in DWHs are listed below:
In order to provide DWH security, the real goal is to
a) Proper identification of security policies is a highly protect data Security and to preserve an appropriate
critical starting point in implementing security in a level of privacy requirements must be considered in all
DWH. layers of the system involved. No efforts have been
b) Most of the approaches used standard encryption made until now to integrate security into the complete
methods and tried to provide strong data privacy. DWH development cycle. Some approaches consider
However, use of this type of encryption method security requirements in the early stages of the DWH
makes them inefficient for DWH use. Encryption development life cycle. More efforts have been put in
algorithms like AES and 3DES require large logical modelling of DWH security requirements, but
computational effort and have a huge impact on they have not provided any tool support for
performance. A technique is therefore needed that implementing the modelled security requirements
provides strong data privacy with less computational automatically in the target DWH system. A holistic
effort and also maintains high performance, which is approach of security throughout the software life cycle
the basic requirement of DWH use. [35], may also benefit from a neuro-fuzzy framework
c) A method is also needed that specifically addresses [36, 37] - like it has been applied to other application
the DWH availability issue. It will improve existing domains.
data-recovery methods to repair or restore corrupted
data quickly, efficiently, and effectively. 4. Conclusion
d) Evaluation methods for DWH security are also This study has provided a literature review of existing
needed. None of the approaches examined addresses DWH security solutions, discussing their issues and
the issue of how one can assess the maturity level of their impact on DWH scalability and performance
security in a DWH. requirements. It has become apparent that the proposed
e) Confidentiality, data integrity, and availability are solutions are infeasible or inefficient for use in DWH
also basic requirements for DWH security. A environments. A DWH requires specific functionality
combination of the approaches discussed above with tight scalability and performance requirements. A
could be helpful in providing a solution to this complete solution is therefore needed that makes it
problem. possible to address these directives. DWH security is an
f) Most of the approaches are domain-dependent, not active research relevance to any industrial project.
generic, or are somehow constraints-based. Further research in DWH security is needed to address
g) A DWH security maintenance mechanism is needed the issues discussed above because many more aspects
that takes specific security requirements into remain to be considered, and there many open
consideration and applies them appropriately. questions to be answered.
ISBN: 978-1-61804-264-4 18
Recent Advances in Information Technology
References
[1] H. Inmon, Building the Data Warehouse, 3rd ed., [15] L. Wang and S. Jajodia, Security in Data
John Wiley, USA, 2002. Warehouses and OLAP Systems, in Handbook of
[2] N. Yuhanna, Your Enterprise Database Security Database Security, Springer Verlag, pp. 191-212,
Strategy, Forrester Research, 2010. 2008.
[3] C. Farkas, and S. Jajodia, The Inference Problem: [16] L.Wang, S. Jajodia and D. Wijesekera,
a Survey, ACM SIGKDD Explorations Newsletter, Lattice-based Inference Control in Data
Vol. 4, Issue 2, pp. 6-11, December 2002. Cubes, in book Preserving Privacy in On-Line
[4] P. Devbandu, and S. Stubblebine, Software Analytical Processing (OLAP), Springer, pp.
Engineering for Security: a Road Map, 119-145, 2007.
Proceedings of Conference on the Future of [17] S. Triki, H. Ben-Abdallah, N. Harbi, and O.
Software Engineering, pp. 227-239. ACM Press, Boussaid, Securing the Data Warehouse: a Semi-
NY, 2000. Automatic Approach for Inference Prevention at
[5] N. Kaite, M. Stolba and A.Y. Tjoa, A Prototype the Design Level, Model and Data Engineering
Model for Data Warehouse Security Based on Lecture Notes in Computer Science, Vol. 6918, pp.
Metadata, International Conference of Database 71-84, Springer-Verlag, 2011.
and Expert Systems, Vienna, pp. 300-308, IEEE [18] Oracle Corporation, Oracle Advanced Security
Press, August, 1998. Transparent Data Encryption Best Practices,
[6] E.R. Weippl, Security in Data Warehouses, Data Oracle White Paper, July 2010.
Warehousing Design and Advanced Engineering [19] Oracle Corporation, Security and the Data
Applications: Methods for Complex Construction, Warehouse, Oracle White Paper, April 2005.
L. Bellatreche (Ed.), Chapter 15, pp. 272-27, [20] Oracle Corporation, Data Masking Best Practices,
Information Science Reference, 2010. Oracle White Paper, July 2010.
[7] N. M. Adam and J. C. Wortmann, Security- [21] R. J. Santos, J. Bernardino and M. Vieira , A Data
Control Methods for Statistical Databases: a Masking Technique for Data Warehouses,
Comparative Study, ACM Computing Surveys, Proceedings of the 15th Symposium on
Vol. 21, Issue 4, pp. 515–556, December, 1989. International Database Engineering &
[8] D.E. Denning and J. Schlorer, Inference Controls Applications, pp. 61-69, ACM Digital Library,
for Statistical Databases, IEEE Computer, Vol. 16, 2011.
Issue 7, pp. 69–82, IEEE Computer Society 1983. [22] IBM Corporation, Understanding RAID Level 5,
[9] L. Willenborg, and T. DeWalal, Statistical IBM Systems Software Information Center, 2007.
Disclosure Control in Practice, Springer Verlag, [23] IBM Corporation, Understanding RAID Level 6,
New York, 1996. IBM Systems Software Information Center, 2007.
[10] D. Dobkin, A.K. Jones and R.J. Lipton, Secure [24] Oracle, Oracle Real Application Clusters (RAC),
Databases: Protection Against User Influence, www.oracle.com/us/products/database/options/real
ACM Transactions on Database Systems, Vol. 4, -applicationclusters/index.htm, September 2010.
Issue 1, pp. 97–106, 1979. [25] AsterData Systems, Aster Data nCluster: Always
[11] F. H. Chin and G. Ozsoyoglu, Auditing and on, for 24x7 Big Data Analytics,
Inference Control in Statistical Databases, IEEE http://www.asterdata.com/product/alwayson.php,
Transactions on Software Engineering, Vol. 8, 2010.
Issue 6, pp. 574–582, 1982. [26] V. Prabhakaran, L.N. Bairavasundaram, N.
[12] F. H. Chin and G. Ozsoyoglu, Statistical Database Agrawal, H.S. Gunawi,, A.C. Arpaci-Dusseau and
Design, ACM Transactions on Database Systems, R.H. Arpaci-Dusseau, IRON file systems,
Vol. 6, Issue 1, pp. 113–139, 1981. International Symposis on Operating System
[13] C.T. Yu. and F.Y. Chin, A Study on the Protection Principles (SOSP), pp. 206-220, Brighton, UK,
of Statistical Data- bases, Proceedings of ACM October, 2005.
SIGMOD International Conference on [27] K. Vijayasankar, G. Sivathanu, S. Swaminathan
Management of Data, pp. 169–181, 1977. and E. Zadok, Exploiting Type-Awareness in a
[14] J.M. Mateo-Sanz, J.M. and J. Domingo-Ferrer, A Self-Recovery Disk, Proceedings of Workshop on
Method for Data-oriented Multivariate Micro Storage Security and Surveillance, VA, USA, pp.
Aggregation, Proceeding of Conference on 25-30, October, 2007.
Statistical Data Protection, pp. 89–99, 1998.
ISBN: 978-1-61804-264-4 19
Recent Advances in Information Technology
[28] M.A. Marsh and F.B. Schneider, CODEX: a [37] F. Ahmed, L.F. Capretz and J. Samarabandu,
Robust and Secure Secret Distribution System, Fuzzy Inference System for Software Product
IEEE Transactions on Dependable and Secure Family Process Evaluation, Information
Computing, Vol. 1, Issue 1 , pp. 34-47, 2004. Sciences, Volume 178, Issue 13, pp. 2780-
[29] P. Bohannon, R. Rastogi, S. Seshadri, A. 2793, DOI: 10.1016/j.ins.2008.03.002,
Silberschatz and S. Sudarshan, Detection and
Elsevier, July 2008.
Recovery Techniques for Database Corruption,
IEEE Transaction on Knowledge and Data
Engineering, Vol. 15, Issue 5, pp. 1120-1136,
2003.
[30] A. Chakraborty, A.K. Majumdar and S. Sural, A
Column Dependency-based Approach for Static
and Dynamic Recovery of Databases from
Malicious Transactions, International Journal of
Information Security, Vol. 9 , Issue 1, pp. 51-67,
2010.
[31] T. Chiueh, and D. Pilania, Design,
Implementation, and Evaluation on a Repairable
Database Management System, Proceedings of
20th Annual Computer Security Applications
Conference, pp. 179-188, IEEE Computer Society,
2004.
[32] P. Liu and J. Jing, Architectures for Self-healing
Databases under Cyber-Attacks, International.
Journal of Computer Science and Network
Security, Vol. 6, Issue 1B, pp. 204-215, 2006.
[33] P. Luenam, and P. Liu, ODAM: An on-the-Fly
Damage Assessment and Repair System for
Commercial Database Applications, Proceedings
of International Conference on DataBase Security
(DBSec), pages 10, 2001.
[34] M. Darwish, A. Ouda and L.F. Capretz, Cloud-
Based DDoS Attacks and Defenses, IEEE
International Conference on Information Society
(i-Society 2013), Toronto, Canada, pp. 67-71,
IEEE Press, June 2013.
[35] L.F. Capretz and P.A. Lee, Reusability and
Life Cycle Issues within an Object-Oriented
Design Methodology (refereed). Ege R., Singh
M. and Meyer B. (editors), in book:
Technology of Object-Oriented Languages
and Systems, Prentice Hall, Englewood Cliffs,
USA, pp. 139-150, 1992.
[36] A.B Nassif, L.F. Capretz and D. Ho,
Estimating Software Effort Based on Use Case
Point Model Using Sugeno Fuzzy Inference
System, 23rd IEEE International Conference
on Tools with Artificial Intelligence (ICTAI),
Boca Raton, Florida, USA, pp. 393-398, DOI:
10.1109/ICTAI.2011.64, IEEE Press,
November 2011.
ISBN: 978-1-61804-264-4 20