Data Warehouse Lab Manual
Data Warehouse Lab Manual
INTRODUCTION:
WEKA is open source java code created by researchers at the University of Waikato
in New Zealand. It provides many different machine learning algorithms, including the
following classifiers:
• Explorer:
An environment for exploring data with WEKA (the rest of this documentation deals With
this application in more detail).
• Experimenter:
An environment for performing experiments and conducting statistical tests between
learning schemes.
• Knowledge Flow:
This environment supports essentially the same functions as the Explorer but with a drag-
and-drop interface. One advantage is that it supports incremental learning.
• Simple CLI:
Provides a simple command-line interface that allows direct execution of WEKA
commands for operating systems that do not provide their own command line interface.
WEKA Explorer:
• Explorer:
An environment for exploring data with WEKA (the rest of this documentation deals With
this application in more detail).
• Experimenter:
An environment for performing experiments and conducting statistical tests between
learning schemes.
• Knowledge Flow:
This environment supports essentially the same functions as the Explorer but with a drag-
and-drop interface. One advantage is that it supports incremental learning.
• Simple CLI:
Provides a simple command-line interface that allows direct execution of WEKA
commands for operating systems that do not provide their own command line interface.
PROGRAM
@RELATION STUDENT
@DATA
1,barath,101,2,A,100,male,456546,12000,bc
2,gayathri,102,2,A,50,female,456456,14000,bc
3,laya,103,2,A,45,female,123213,15000,bc
4,suganya,104,2,A,78,female,1321321,12000,mbc
5,harish,105,2,A,58,male,12312321,12000,mbc
6,minion,106,2,A,65,male,21665456,12000,mbc
7,prdhi,107,2,A,12,female,53546456,14000,mbc
8,ashvi,108,2,B,100,female,456456546,15000,bc
9,bansa,109,2,B,65,male,54645645,15000,sc
10,barath,110,2,B,35,male,54654654,15000,sc
11,kiruthi,111,2,B,26,female,456456,14000,bc
12,rajesh,112,2,B,55,male,45645654,12000,mbc
13,elakkiya,113,2,B,55,female,54654654,14000,sc
14,palani,114,2,B,48,male,45645645,15000,bc
15,anu,115,2,B,52,female,45645,14000,mbc
OUTPUT:
Classification – decision tree (j48)
RESULT:
The classification and clustering algorithms using the given training data and test
with the unknown sample by using “Weka tool” is successfully implemented.
EX.NO:2 APPLY WEKA TOOL FOR DATA VALIDATION
AIM
Work with Weka tool for data validation in the classification,clustering algorithms to
the given training data and test with the unknown sample.
INTRODUCTION
Data validation processes check for the validity of the data. Using a set of rules, it
checks whether the data is within the acceptable values defined for the field or not. The
system ensures the inputs stick to the set rules, for instance, the type, uniqueness, format, or
consistency of the data. The data validation checks are employed by declaring integrity
rules through business rules. This enforcement of rules needs to be done at the start of the
process to ensure the system does not accept any invalid data.
PROGRAM
@RELATION WEATHER
@DATA
D1,sunny,hot,high,weak,no
D2,sunny,hot,high,strong,no
D3,overcast,hot,high,weak,yes
D4,rain,mild,high,weak,yes
D5,rain,cool,normal,weak,yes
D6,rain,cool,normal,strong,no
D7,overcast,cool,normal,strong,yes
D8,sunny,mild,high,weak,no
D9,sunny,cool,normal,weak,yes
D10,rain,mild,normal,weak,yes
D11,sunny,mild,normal,strong,yes
D12,overcast,mild,high,strong,yes
D13,overcast,hot,normal,weak,yes
D14,rain,mild,high,strong,no
OUTPUT:
Classification
The classification algorithms using the given training data and test with the unknown sample
by using “Weka tool” is successfully implemented.
EX.NO:3 PLAN THE ARCHITECTURE FOR REAL TIME APLLICATION
AIM:
To write and implement the architecture for real time application using iris dataset.
PROCEDURE:
This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica)
petal and sepal length, stored in a 150x4 numpy.ndarray
The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal
Length and Petal Width.
The below plot uses the first two features. See here for more information on this dataset.
iris= datasets.load_iris()
PROGRAM:
OUTPUT:
RESULT:
AIM:
The fact table maintains one-to-many relations with all the dimension tables. Every row in a
fact table is associated with its dimension table rows with a foreign key reference.
Due to the above reason, navigation among the tables in this model is easy for querying
aggregated data. An end-user can easily understand this structure. Hence all the Business
Intelligence (BI) tools greatly support the Star schema model.
While designing star schemas the dimension tables are purposefully de-normalized. They are
wide with many attributes to store the contextual data for better analysis and reporting.
Queries use very simple joins while retrieving the data and thereby query
performance is increased.
It is simple to retrieve data for reporting, at any point of time for any period.
Disadvantages Of Star Schema
If there are many changes in the requirements, the existing star schema is not
recommended to modify and reuse in the long run.
Data redundancy is more as tables are not hierarchically divided.
An example of a Star Schema is given below.
An end-user can request a report using Business Intelligence tools. All such requests will be
processed by creating a chain of “SELECT queries” internally. The performance of these
queries will have an impact on the report execution time.
From the above Star schema example, if a business user wants to know how many Novels
and DVDs have been sold in the state of Kerala in January in 2018, then you can apply the
query as follows on Star schema tables:
SELECT pdim.NameProduct_Name,
Sum(sfact.sales_units) Quanity_Sold
Sales sfact,
Store sdim,
Dateddim
WHEREsfact.product_id = pdim.product_id
ANDsfact.store_id = sdim.store_id
ANDsfact.date_id = ddim.date_id
ANDsdim.state = 'Kerala'
SELECT pdim.NameProduct_Name,
Sum(sfact.sales_units) Quanity_Sold
Sales sfact,
Store sdim,
Dateddim
WHEREsfact.product_id = pdim.product_id
ANDsfact.store_id = sdim.store_id
ANDsfact.date_id = ddim.date_id
ANDsdim.state = 'Kerala'
ANDddim.month = 1
ANDddim.year = 2018
ANDpdim.Namein(‘Novels’, ‘DVDs’)
GROUPBYpdim.Name
Results:
Product_Name Quantity_Sold
Novels 12,702
DVDs 32,919
Star schema acts as an input to design a SnowFlake schema. Snow flaking is a process that
completely normalizes all the dimension tables from a star schema.
Due to normalized dimension tables, the ETL system has to load the number of
tables.
You may need complex joins to perform a query due to the number of tables added.
Hence query performance will be degraded.
Different levels of hierarchies from the above diagram can be referred to as follows:
Quarterly id, Monthly id, and Weekly ids are the new surrogate keys that are created
for Date dimension hierarchies and those have been added as foreign keys in the
Date dimension table.
State id is the new surrogate key created for Store dimension hierarchy and it has
been added as the foreign key in the Store dimension table.
Brand id is the new surrogate key created for the Product dimension hierarchy and it
has been added as the foreign key in the Product dimension table.
City id is the new surrogate key created for Customer dimension hierarchy and it has
been added as the foreign key in the Customer dimension table.
We can generate the same kind of reports for end-users as that of star schema structures with
SnowFlake schemas as well. But the queries are a bit complicated here.
From the above SnowFlake schema example, we are going to generate the same query that
we have designed during the Star schema query example.
That is if a business user wants to know how many Novels and DVDs have been sold in the
state of Kerala in January in 2018, you can apply the query as follows on SnowFlake schema
tables.
SELECT pdim.NameProduct_Name,
Sum(sfact.sales_units) Quanity_Sold
ANDmdim.month = 1
ANDddim.year = 2018
ANDpdim.Namein(‘Novels’, ‘DVDs’)
GROUPBYpdim.Name
Results:
Product_Name Quantity_Sold
Novels 12,702
DVDs 32,919
SELECT Clause:
The attributes specified in the select clause are shown in the query results.
The Select statement also uses groups to find the aggregated values and hence we
must use group by clause in the where condition.
FROM Clause:
All the essential fact tables and dimension tables have to be chosen as per the
context.
WHERE Clause:
Appropriate dimension attributes are mentioned in the where clause by joining with
the fact table attributes. Surrogate keys from the dimension tables are joined with the
respective foreign keys from the fact tables to fix the range of data to be queried.
Please refer to the above-written star schema query example to understand this. You
can also filter data in the from clause itself if in case you are using inner/outer joins
there, as written in the SnowFlake schema example.
Dimension attributes are also mentioned as constraints on data in the where clause.
By filtering the data with all the above steps, appropriate data is returned for the
reports.
As per the business needs, you can add (or) remove the facts, dimensions, attributes, and
constraints to a star schema (or) SnowFlake schema query by following the above structure.
You can also add sub-queries (or) merge different query results to generate data for any
complex reports.
RESULT:
Thus the different types of Data Warehouse Schemas, along with their benefits and
disadvantages.
EX.NO: 5 DESIGN DATA WAREHOUSE FOR REAL TIME
APPLICATION
AIM:
To write and implement the design of data warehouse for real time application
INTRODUCTION:
A data warehouse is a single data repository where a record from multiple data
sources is integrated for online business analytical processing (OLAP). This implies a data
warehouse needs to meet the requirements from all the business stages within the entire
organization. Thus, data warehouse design is a hugely complex, lengthy, and hence error-
prone process. Furthermore, business analytical functions change over time, which results in
changes in the requirements for the systems. Therefore, data warehouse and OLAP systems
are dynamic, and the design process is continuous.
Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such as
answering management related queries. The target of the design becomes how the record
from multiple data sources should be extracted, transformed, and loaded (ETL) to be
organized in a database as the data warehouse.
1. "top-down" approach
2. "bottom-up" approach
Data marts include the lowest grain data and, if needed, aggregated data too. Instead
of a normalized database for the data warehouse, a denormalized dimensional database is
adapted to meet the data delivery requirements of data warehouses. Using this method, to
use the set of data marts as the enterprise data warehouse, data marts should be built with
conformed dimensions in mind, defining that ordinary objects are represented the same in
different data marts. The conformed dimensions connected the data marts to form a data
warehouse, which is generally called a virtual data warehouse.
The advantage of the "bottom-up" design approach is that it has quick ROI, as
developing a data mart, a data warehouse for a single subject, takes far less time and effort
than developing an enterprise-wide data warehouse. Also, the risk of failure is even less.
This method is inherently incremental. This method allows the project team to learn and
grow.
the locations of the data warehouse and the data marts are reversed in the bottom-up
approach design.
PROGRAM:
@RELATION IRIS
@DATA
1,0.2,1.4,3.5,5.1, Setosa
1,0.2,1.4,3,4.9, Setosa
1,0.2,1.3,3.2,4.7, Setosa
1,0.2,1.5,3.1,4.6, Setosa
1,0.2,1.4,3.6,5, Setosa
1,0.4,1.7,3.9,5.4, Setosa
1,0.3,1.4,3.4,4.6, Setosa
1,0.2,1.5,3.4,5, Setosa
1,0.2,1.4,2.9,4.4, Setosa
1,0.1,1.5,3.1,4.9, Setosa
1,0.2,1.5,3.7,5.4, Setosa
1,0.2,1.6,3.4,4.8, Setosa
1,0.1,1.4,3,4.8, Setosa
1,0.1,1.1,3,4.3, Setosa
1,0.2,1.2,4,5.8, Setosa
1,0.4,1.5,4.4,5.7, Setosa
1,0.4,1.3,3.9,5.4, Setosa
1,0.3,1.4,3.5,5.1, Setosa
1,0.3,1.7,3.8,5.7, Setosa
1,0.3,1.5,3.8,5.1, Setosa
1,0.2,1.7,3.4,5.4, Setosa
1,0.4,1.5,3.7,5.1, Setosa
1,0.2,1,3.6,4.6, Setosa
1,0.5,1.7,3.3,5.1, Setosa
1,0.2,1.9,3.4,4.8, Setosa
1,0.2,1.6,3,5, Setosa
1,0.4,1.6,3.4,5, Setosa
1,0.2,1.5,3.5,5.2, Setosa
1,0.2,1.4,3.4,5.2, Setosa
1,0.2,1.6,3.2,4.7, Setosa
1,0.2,1.6,3.1,4.8, Setosa
1,0.4,1.5,3.4,5.4, Setosa
1,0.1,1.5,4.1,5.2, Setosa
1,0.2,1.4,4.2,5.5, Setosa
1,0.2,1.5,3.1,4.9, Setosa
1,0.2,1.2,3.2,5, Setosa
1,0.2,1.3,3.5,5.5, Setosa
1,0.1,1.4,3.6,4.9, Setosa
1,0.2,1.3,3,4.4, Setosa
1,0.2,1.5,3.4,5.1, Setosa
1,0.3,1.3,3.5,5, Setosa
1,0.3,1.3,2.3,4.5, Setosa
1,0.2,1.3,3.2,4.4, Setosa
1,0.6,1.6,3.5,5, Setosa
1,0.4,1.9,3.8,5.1, Setosa
1,0.3,1.4,3,4.8, Setosa
1,0.2,1.6,3.8,5.1, Setosa
1,0.2,1.4,3.2,4.6, Setosa
1,0.2,1.5,3.7,5.3, Setosa
1,0.2,1.4,3.3,5, Setosa
2,1.4,4.7,3.2,7, Versicolor
2,1.5,4.5,3.2,6.4, Versicolor
2,1.5,4.9,3.1,6.9, Versicolor
2,1.3,4,2.3,5.5, Versicolor
2,1.5,4.6,2.8,6.5, Versicolor
2,1.3,4.5,2.8,5.7, Versicolor
2,1.6,4.7,3.3,6.3, Versicolor
2,1,3.3,2.4,4.9, Versicolor
2,1.3,4.6,2.9,6.6, Versicolor
2,1.4,3.9,2.7,5.2, Versicolor
2,1,3.5,2,5, Versicolor
2,1.5,4.2,3,5.9, Versicolor
2,1,4,2.2,6, Versicolor
2,1.4,4.7,2.9,6.1, Versicolor
2,1.3,3.6,2.9,5.6, Versicolor
2,1.4,4.4,3.1,6.7, Versicolor
2,1.5,4.5,3,5.6, Versicolor
2,1,4.1,2.7,5.8, Versicolor
2,1.5,4.5,2.2,6.2, Versicolor
2,1.1,3.9,2.5,5.6, Versicolor
2,1.8,4.8,3.2,5.9, Versicolor
2,1.3,4,2.8,6.1, Versicolor
2,1.5,4.9,2.5,6.3, Versicolor
2,1.2,4.7,2.8,6.1, Versicolor
2,1.3,4.3,2.9,6.4, Versicolor
2,1.4,4.4,3,6.6, Versicolor
2,1.4,4.8,2.8,6.8, Versicolor
2,1.7,5,3,6.7, Versicolor
2,1.5,4.5,2.9,6, Versicolor
2,1,3.5,2.6,5.7, Versicolor
2,1.1,3.8,2.4,5.5, Versicolor
2,1,3.7,2.4,5.5, Versicolor
2,1.2,3.9,2.7,5.8, Versicolor
2,1.6,5.1,2.7,6, Versicolor
2,1.5,4.5,3,5.4, Versicolor
2,1.6,4.5,3.4,6, Versicolor
2,1.5,4.7,3.1,6.7, Versicolor
2,1.3,4.4,2.3,6.3, Versicolor
2,1.3,4.1,3,5.6, Versicolor
2,1.3,4,2.5,5.5, Versicolor
2,1.2,4.4,2.6,5.5, Versicolor
2,1.4,4.6,3,6.1, Versicolor
2,1.2,4,2.6,5.8, Versicolor
2,1,3.3,2.3,5, Versicolor
2,1.3,4.2,2.7,5.6, Versicolor
2,1.2,4.2,3,5.7, Versicolor
2,1.3,4.2,2.9,5.7, Versicolor
2,1.3,4.3,2.9,6.2, Versicolor
2,1.1,3,2.5,5.1, Versicolor
2,1.3,4.1,2.8,5.7, Versicolor
3,2.5,6,3.3,6.3, Verginica
3,1.9,5.1,2.7,5.8, Verginica
3,2.1,5.9,3,7.1, Verginica
3,1.8,5.6,2.9,6.3, Verginica
3,2.2,5.8,3,6.5, Verginica
3,2.1,6.6,3,7.6, Verginica
3,1.7,4.5,2.5,4.9, Verginica
3,1.8,6.3,2.9,7.3, Verginica
3,1.8,5.8,2.5,6.7, Verginica
3,2.5,6.1,3.6,7.2, Verginica
3,2,5.1,3.2,6.5, Verginica
3,1.9,5.3,2.7,6.4, Verginica
3,2.1,5.5,3,6.8, Verginica
3,2,5,2.5,5.7, Verginica
3,2.4,5.1,2.8,5.8, Verginica
3,2.3,5.3,3.2,6.4, Verginica
3,1.8,5.5,3,6.5, Verginica
3,2.2,6.7,3.8,7.7, Verginica
3,2.3,6.9,2.6,7.7, Verginica
3,1.5,5,2.2,6, Verginica
3,2.3,5.7,3.2,6.9, Verginica
3,2,4.9,2.8,5.6, Verginica
3,2,6.7,2.8,7.7, Verginica
3,1.8,4.9,2.7,6.3, Verginica
3,2.1,5.7,3.3,6.7, Verginica
3,1.8,6,3.2,7.2, Verginica
3,1.8,4.8,2.8,6.2, Verginica
3,1.8,4.9,3,6.1, Verginica
3,2.1,5.6,2.8,6.4, Verginica
3,1.6,5.8,3,7.2, Verginica
3,1.9,6.1,2.8,7.4, Verginica
3,2,6.4,3.8,7.9, Verginica
3,2.2,5.6,2.8,6.4, Verginica
3,1.5,5.1,2.8,6.3, Verginica
3,1.4,5.6,2.6,6.1, Verginica
3,2.3,6.1,3,7.7, Verginica
3,2.4,5.6,3.4,6.3, Verginica
3,1.8,5.5,3.1,6.4, Verginica
3,1.8,4.8,3,6, Verginica
3,2.1,5.4,3.1,6.9, Verginica
3,2.4,5.6,3.1,6.7, Verginica
3,2.3,5.1,3.1,6.9, Verginica
3,1.9,5.1,2.7,5.8, Verginica
3,2.3,5.9,3.2,6.8, Verginica
3,2.5,5.7,3.3,6.7, Verginica
3,2.3,5.2,3,6.7, Verginica
3,1.9,5,2.5,6.3, Verginica
3,2,5.2,3,6.5, Verginica
3,2.3,5.4,3.4,6.2, Verginica
3,1.8,5.1,3,5.9, Verginica
OUTPUT:
RESULT:
Thus the design of data warehouse real time application using datasets.
EX.NO: 6 CASE STUDY:ANALYSE THE DIMENSINAL MODELING
AIM:
To write and analyse the dimensional modelling for data warehouse.
PROCEDURE:
Dimensional modeling provides set of methods and concepts that are used in DW
design. According to DW consultant, Ralph Kimball, dimensional modeling is a design
technique for databases intended to support end-user queries in a data warehouse. It is
oriented around understandability and performance. According to him, although transaction-
oriented ER is very useful for the transaction capture, it should be avoided for end-user
delivery.
Dimensional modeling always uses facts and dimension tables. Facts are numerical values
which can be aggregated and analyzed on the fact values. Dimensions define hierarchies and
description on fact values.
Dimension Table
Dimension table stores the attributes that describe objects in a Fact table. A Dimension table
has a primary key that uniquely identifies each dimension row. This key is used to associate
the Dimension table to a Fact table.
Dimension tables are normally de-normalized as they are not created to execute transactions
and only used to analyze data in detail.
Example
In the following dimension table, the customer dimension normally includes the name of
customers, address, customer id, gender, income group, education levels, etc.
1 Brian Edge M 2 3 4
2 Fred Smith M 3 5 1
3 Sally Jones F 1 7 3
Fact Tables
Fact table contains numeric values that are known as measurements. A Fact table has two
types of columns − facts and foreign key to dimension tables.
Example
4 17 2 1
8 21 3 2
8 4 1 1
This fact tables contains foreign keys for time dimension, product dimension, customer
dimension and measurement value unit sold.
Suppose a company sells products to customers. Every sale is a fact that happens within the
company, and the fact table is used to record these facts.
Common facts are − number of unit sold, margin, sales revenue, etc. The dimension table
list factors like customer, time, product, etc. by which we want to analyze the data.
Now if we consider the above Fact table and Customer dimension then there will also be a
Product and time dimension. Given this fact table and these three dimension tables, we can
ask questions like: How many watches were sold to male customers in 2010?
The functional difference between dimension tables and fact tables is that fact tables hold
the data we want to analyze and dimension tables hold the information required to allow us
to query it.
Aggregate Table
Aggregate table contains aggregated data which can be calculated by using different
aggregate functions.
An aggregate function is a function where the values of multiple rows are grouped together
as input on certain criteria to form a single value of more significant meaning or
measurement.
Average()
Count()
Maximum()
Median()
Minimum()
Mode()
Sum()
These aggregate tables are used for performance optimization to run complex queries in a
data warehouse.
Example
You save tables with aggregated data like yearly (1 row), quarterly (4 rows), monthly (12
rows) and now you have to do comparison of data, like Yearly only 1 row will be processed.
However in an un-aggregated table, all the rows will be processed.
Select Avg (salary) from employee where title = ‘developer’. This statement will return the
average salary for all employees whose title is equal to 'Developer'.
Aggregations can be applied at database level. You can create aggregates and save them in
aggregate tables in the database or you can apply aggregate on the fly at the report level.
RESULT:
AIM:
To write and analyse the OLAP operation.
INTODUCTION:
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
Multidimensional OLAP
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
INNERJOINMonthmdim ONddim.month_id = mdim.month_id
WHEREstdim.state = 'Kerala'
ANDmdim.month = 1
ANDddim.year = 2018
ANDpdim.Namein(‘Novels’, ‘DVDs’)
GROUPBYpdim.Name
Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Drill-down
Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation.
RESULT:
AIM:
For example: POS (point of sale) system of any supermarket is a OLTP System.
Every industry in today’s world use OLTP system to record their transactional data. The
main concern of OLTP systems is to enter, store and retrieve the data. They covers all day
to day operations such as purchasing, manufacturing, payroll, accounting, etc.of an
organization. Such systems have large numbers of user which conduct short transaction. It
supports simple database query so the response time of any user action is very fast.
The data acquired through an OLTP system is stored in commercial RDBMS, which can
be used by an OLAP System for data analytics and other business intelligence operations.
Some other examples of OLTP systems include order entry, retail sales, and financial
transaction systems
OLTP OLAP
OLTP is used for Backup religiously. OLAP is used for regular backup.
OLTP OLAP
AIM:
The Goals of Data Warehouse Testing, ETL Testing Responsibilities, Errors in DW and
ETL Deployment in detail in this tutorial.
In general, a defect found at the later stages of the software development life cycle costs
more to fix that defect. This situation in the DW can be worsened because the wrong data
found at the later stages might have been used in important business decisions by that time.
Thus, the fix in the DW is more expensive in terms of process, people and technology
changes. You can begin the DW testing right from the requirements gathering phase.
A requirement traceability matrix is prepared & reviewed, and this mainly maps the DW
features with their respective business requirements. The traceability matrix acts as an input
to the DW test plan that is prepared by the testers. The test plan describes the tests to be
performed to validate the DW system.
It also describes the types of tests that will be performed on the system. After the test plan is
ready all the detailed test cases will be prepared for various DW scenarios. Then all the test
cases will be executed and defects will be logged.
There is a standard in the operational world that maintains different environments for
development, testing, and production. In the DW world, both the developers and testers will
make sure that the development and test environments are available with the replica of
production data before starting their work.
This is copied for a list of tables with limited or full data depending on the project needs, as
the production data is really large. The developers develop their code in the developer’s
environment and deliver it to the testers.
The testers will test the code delivered in the testing environments to ensure if all the
systems are working. Then the code will go live in the production environments. The DW
code is also maintained in different versions based on the defects fixed in each release.
Maintaining multiple environments and code versions helps to build a good quality system.
#1) Data Completeness: Ensure that all data from various sources is loaded into a Data
Warehouse. The testing team validates if all the DW records are loaded, against the source
database and flat files by following the below sample strategies.
The total number of records count uploaded from the source system should match the
total number of records loaded into DW. If there is a difference then you can think
about the rejected records.
Compare the data loaded into each field of DW with the source system data fields.
This will bring out the data errors if any.
#2) Data Transformation: While uploading the source data to the Data warehouse, few
fields can be directly loaded with the source data but few fields will be loaded with the data
that is transformed as per the business logic. This is the complex portion of testing DW
(ETL).
#3) Data Quality: Data warehouse (ETL) system must ensure the quality of the data loaded
into it by rejecting (or) correcting the data.
DW may reject a few of the source system data based on the business requirements
logic. For Example, reject a record if a certain field has non-numeric data. All the rejected
records are loaded into the reject table for reference.
The rejected data is reported to the clients because there is no chance of getting to know
about this missed data, as it will not be loaded into the DW system. DW may correct the
data by loading zero in the place of null values etc.
#4) Scalability and Performance: Data warehouse must ensure the scalability of the system
with increasing loads. With this, there should not be any degradation in the performance
while executing the queries, with anticipated results in specific time frames. Thus
performance testing uncovers any issues and fixes it before the production.
Below are sample strategies for Performance and Scalability Testing:
Do the performance testing by loading production volumes of data and ensure that
the time frames are not missed.
Validate the performance of each query with bulk data. Test the performance by
using simple joins and multiple joins.
Load double (or) triple to the volumes of data expected to calculate the capacity of
the system approximately.
Test by running jobs for all the listed reports at the same time.
#5) Integration Testing: Data warehouse should perform Integration Testing with other
upstream and downstream applications. If possible, it is better to copy the production data
into the test environment for Integration Testing.
All system teams should be involved in this phase to bridge the gaps while understanding
and testing all the systems together.
#6) Unit Testing: This is performed by the individual developers on their deliverables.
Developers will prepare unit test scenarios based on their understanding of the requirements,
run the unit tests and document the results. This helps the developers to fix any bugs if
found, before delivering the code to the testing team.
#7) Regression Testing: Validates that the DW system is not malfunctioning after fixing
any defects. This is performed many times with every new code change.
#8) User Acceptance Testing: This testing is performed by business users to validate
system functionality. UAT environment is different from the QA environment. The sign off
from UAT implies that we are ready to move the code to production.
From the Data Warehouse and Business Intelligence system perspective, business users can
validate various reports through a User Interface (UI). They can validate the report
specifications against the requirements, can validate the correctness of data in the reports,
can validate how quickly the system is returning the results, etc.
Enlisted below are the various teams involved in delivering a successful DW system:
Business Analysts: Gather all the business requirements for the system and
document those for everyone’s preference.
Infrastructure Team: Set up various environments as required for both developers
and testers.
Developers: Develop ETL code as per the requirements and perform unit tests.
QA (Quality Assurance)/Testers: Develop test plan, test cases, etc. Identifies
defects in the system by executing the test cases. Perform various levels of testing.
DBAs: DBAs take charge of converting logical ETL database scenarios into physical
ETL database scenarios and also involve in performance testing.
Business Users: Involve in User Acceptance Testing, run queries and reports on DW
tables.
Errors In Data Warehouse
When you are Extracting, Transforming and Loading (ETL) data from multiple sources
there are chances that you will get bad data that may abort the long-running jobs.
The documentation will tell others about the sequence of jobs to run, failure recovery
scenarios, training materials to the DW support teams to monitor the system after
deployment and to the administrative support team to execute the reports.
RESULT:
Thus the Goals of Data Warehouse Testing, ETL Testing Responsibilities, Errors in
DW and ETL Deployment in detail in this tutorial.