Overview of Data Warehousing and OLAP
Overview of Data Warehousing and OLAP
Payal Wankhede, Aswini Chinnenahalli Siddareddy, Divyasri Gundala, Varasri Boddupalli, Juhi
Parvanda
● Virtual Warehouse
● Data mart
● Enterprise Warehouse
Virtual Warehouse
● The view over an operational data warehouse is
known as a virtual warehouse.
● It is easy to build a virtual warehouse.
● Building a virtual warehouse requires excess capacity
on operational database servers.
MartData
Data mart contains a subset of organization-wide
data. This subset of data is valuable to specific
Fig 1: Data Warehouse architecture groups of an organization.
As we can see, The first layer is the Data Source
Points to remember about data marts −
layer, which refers to various data stores in multiple
formats like relational database, Excel file and ● The implementation data mart cycles is measured in
others. These stores can consists of different types of short periods of time, i.e., in weeks rather than
data – Operational data including business data like months or years.
Sales, Customer, Finance, Product and others, web
● Data marts are small in size.
server logs, Internet research data and data relating
to third party like census, survey. ● Data marts are customized by department.
● The source of a data mart is departmentally
The next step is Extract, where the data from data structured data warehouse.
sources is extracted and put into the warehouse ● Data mart are flexible.
staging area. The extracted data is minimally cleaned Enterprise Warehouse
with no major transformations.
Then comes the Staging area, which is divided into ● An enterprise warehouse collects all the information
two stages – data cleaning and data ordering. As the and the subjects spanning an entire organization
name suggests, this layer takes care of data ● It provides us enterprise-wide data integration.
processing methods, i.e. cleaning (removing data ● The data is integrated from operational systems and
redundancy, filtering bad data) and ordering external information providers.
(allowing proper integration) of data. Overall, this ● This information can vary from a few gigabytes to
stage allows application of business intelligent logic hundreds of gigabytes, terabytes or beyond.
to transform transactional data into analytical data. It
is indeed the most time consuming phase in the III. DATA MODELLING FOR DATA
whole DWH architecture and is the chief process WAREHOUSES
between data source and presentation layer of DWH.
“A data model is a graphical view of data created
Finally, we have the Data Presentation layer, which
for analysis and design purposes”. There are three
is the target data warehouse – the place where the
levels of data modelling-Conceptual data model,
successfully cleaned, integrated, transformed and
logical data model and physical data model. In below
ordered data is stored in a multi-dimensional
figures we can see the conceptual, logical, and
environment. Now, the data is available for analysis
physical levels of a single data model.
and query purposes. The information is also available
to end-users in the form of data marts.
A. Conceptual data model
This model helps in identifying the highest-level we always first start with the conceptual model so
relationships among the different entities that we can better understand the entities in our data
and how they relate to each other. Next, we move on
to logical model to understand the detail of data and
then finally we look into physical model to know
how to implement our data model in database.
TABLE I
COMPARISON OF DIFFERENT LEVELS OF DATA MODEL
E. Dimensional model
Data Warehouses combine several different data
sources in multidimensional structures in support of
the decision-making process. Generally traditional
database deal with two-dimensional data which is
similar to spreadsheet. But when compare to two-
dimensional model multi-dimensional data storage
Fig 3: A logical data model model is much more efficient in query performance.
Some of the examples of dimensions used in a
C. Physical data model
Corporate data warehouse include fiscal periods,
This is a fully-attributed data model which denotes
product categories, geographic regions.
how the model will be built in the database and is
Below figures shows the example of two-dimensional
dependent on a specific version of a data persistence
and multidimensional model.
technology.
or fact tables.
Data Loading: Data Loading fetches the prepared
data,applies it to the data warehouse and stores it in
Major Building Blocks of Data warehouse
the database.
A. Extraction Transformation and Loading
Data Extraction Types of Loading
The process of extracting data in distributed
application from business and departmental units Initial Load Populates all the data warehouse tables
across the organization and importing them into data for the first time.
warehouse is called ETL.The initial step of the
procedure includes the extraction of data from Incremental Load Applying ongoing changes as
operational information sources.These data sources necessary in a periodic manner
are normally databases however sometimes
information is put away in flat or XML documents. Refresh Data Completely erases the contents of one
or more tables and reloading with fresh data.
Data Extraction Strategies
● Full Extraction B. Storing data in Data warehouse
● Partial Extraction- with update notification ● Storing the data according to the data model of the
● Partial Extraction-without update notification warehouse
● Creating and maintaining required data structures
Data Transformation ● Creating and maintaining appropriate access
The Transformation procedure requires change and paths,Initially constructing the warehouse is
standardization of information. This procedure can be simple,but update of sheer volume of data in the
computerized with ETL software.This software warehouse generally makes it impossible to reload
supports the application of extracted data functions the warehouse entirely.
and series of rules in the data warehouse.The series ● Providing for time-variant data as new data are added
of rules installed in ETL software ensures that the Data may come from different systems, language
data is in correct format and error free.This process areas and time-zones
of transformation is otherwise called as cleansing. ● Supporting the updating of warehouse
data.Refreshing the data Alternatives include
Data extracted to the server is raw data and cannot selective (partial)refreshing of data and separate
be used as it is and should be cleansed,mapped and warehouse versions (which requires double storage
transformed and the transformation tasks to be capacity for the warehouse).
performed are selection,matching,data cleansing or ● Purging data Data may need to be purged
consolidation.The process of returning cleaned data periodically.
to the source is called Backflushing.
C. Data Warehouse Design Considerations
Usage Projections : In prior to the design of
warehouse,expecting about who will use it and how
they will use it.
The fit of the data model: The data comes from
various operational sources which should represented
in the data model.
Modular component design :
Modular design is a practical necessity to allow the TABLE II
warehouse to evolve with the organization and its
COMPARISON OF TWO APPROACHES
information environment.
Design for manageability and change: A well-built Inmon Kimball
data warehouse should be designed for
maintainability,enabling the warehouse managers to Building a data Time Takes less time
plan,change,manage and provide optimal support to warehouse consuming
users.
Distributed warehouse: Distributed data warehouse maintenance easy Difficult
deals with issues related to distributed database.
Distributed architecture can provide benefits cost High initial Low initial
particularly important to warehouse performance, cost cost
such as improved load balancing, scalability of
performance and higher availability. Time High initial Shorter time
time for initial setup
Federated warehouse is an autonomous data
warehouses,each with its own repository.
Skill Specialist team Generalist
Metadata component: The metadata repository is a
requirement team
key data warehouse component which includes both
technical and business data. Enterprise-wide
Integration Individual
Technical Data:Technical data covers details of
requirements business-areas
acquisition, processing,storage structures, data
descriptions, warehouse operations ,maintenance and
access support.
Business Data:Business data includes the relevant
business rules and organizational details supporting
the warehouse.
Recent Research
Below table shows the comparison of three major VI. DATA WAREHOUSE Vs. VIEWS
OLAP systems i.e. MOLAP, ROLAP, HOLAP
TABLE III
Data warehouse consist of data extracted from a
COMPARISON OF MOLAP, ROLAP, HOLAP SYSTEMS
different data sources. Whereas, Views are just
temporary tables, which consist of a data extracted c. Quality Control:
from a table or different sets of table in a database. ● Both quality and consistency of data as well as data
management are major concerns.
Views and data warehouse are similar in a way, that ● Melding data from heterogeneous and disparate
they both have read-only extracts from the databases. sources is a major challenge given differences in
Thus, data from both cannot be edited or updated. naming, domain definitions, identification numbers,
Many people believe that data warehouses are the and the like. [1]
extensions of views, but in reality, views only
provide a subset of the functions and capabilities of 2. Quality Assurance
data warehouses.[1] ● The end user of data warehousing who is using Big
Data reporting will expect 100% accuracy in data.
However, data warehouses are different from views
● This requires testing to be a higher priority which
in the following ways:
consequently require a lot of resources. [13]
● Data Warehouses exist as persistent storage instead
of being materialized on demand. Whereas, views are
3. Performance
virtual tables and do not hold any place in disk or
● The initial overall design must be carefully thought
memory.
out to provide a stable foundation from which to
● Data Warehouses are not just relational, but rather
start. [13]
multi-dimensional with multiple levels of
aggregation. Whereas, views are relational.
4. User Acceptance
● Data Warehouses can be indexed for optimal
● People are not keen to changing their daily routine
performance. Views cannot be indexed directly.
especially if the new process is not intuitive. [13]
● Data Warehouses provide specific support of
functionality; views cannot.
5. Cost
● Data Warehouses deals with large volumes of
● All the above mentioned factors, ultimately increase
integrated data that is contained generally in more
computational cost.
than one database, whereas views are an extract of a
database.
RECENT RESEARCH:
● Data warehouses bring in data periodically from
multiple sources via a complex ETL process, whereas Literature Review of Issues in Data Warehousing and
views are an extract from a database through a OLTP, OLAP Technology
predefined query. [1]
Issue and Problems discussed:
VII. DIFFICULTIES OF IMPLEMENTING DATA
WAREHOUSE Storing historical data: Data warehouse contains very
old data in its repository. To main such a volume of
Challenges that needs to be considered before data is difficult.
building a data warehouses are as follows: ● Storing transactional data: There are larger number of
transactions per day in any organization. Again, to
1. Operational Challenge manage data of per day transactions is overhead.
a. Construction: ● Mismatch in data type of data: To merge incoming
● Lead time is huge in building a data warehouse data from different data sources leads to data type
● Potentially it takes years to build, implement and mismatch issue.
efficiently maintain a data warehouse. [1] ● Costing Problem: To manage data, security, and
resources to large computational cost.
b. Administration:
● Representation of data to user: Dashboards and
The administration of a data warehouse is reports generated for data analysis from data
proportional to the size and complexity of the warehousing has to be modified to make it more user
warehouse. (non-technical) friendly.
● On updating the source database, warehouse’s ● Data Profiling: Data profiling is all about the pattern,
schema and component.s must be able to handle these format matching of data with stored data and shown
changes. [1] data.
● Real time data feed and access: This issue is 3. Data movement:
regarding mostly with storing the current data which
is not analyzed properly which can create difficult to ● There exist potential security implications while
forecast information. moving the data.
● Data backtracking: For backtrack all data in such a ● When the data is loaded into the data warehouse,
huge large data warehouse it creates issue for any the following questions are raised −
changes in data and backtrack that data in repository. ● Where the data file will be stored?
[14] ● Who has access to that disk space?