Model For Data Migration
Model For Data Migration
noticeable change is that we nearly always record that the biggest complaint is query speed, said by Barney
Finucane, a BARC analyst and lead author of the BI Survey’s. BARC has been conducting the survey of BI end
users since 2001. This year’s survey included responses from nearly 2,200 end users, or consultants on behalf of
end users, most from Europe and North America. Impact of data quality on business domains, survey emphasis
that bad data means big business intelligence problems. The International Association for Information and Data
Quality (IAIDQ) was established in 2004 to provide a focal point for professionals and researchers in this field
[5]. IAIDQ have recently completed survey in 2010 with full report shows the importance of data quality in the
present real business world. Manjunath T N et.al [2011] highlighted different data quality dimensions in
different stages of data migration or ETL cycle. No literature found on the developing uniform data quality
assessment model using Key Performance Indicators (KPI) and decision tree method of accessing the data
quality for data migration business enterprise.
III. DATA QUALITY ASSESSMENT FRAMEWORK AND ALGORITHM
After Data migration | ETL process from legacy system to data warehouse / target system, need to assess the
data quality of the target system with respect to underlying data, author derives the data quality characteristics in
terms of KPI’s (Key performance indicators), compute the KPI values for target system, compare these
computed KPI’s with the threshold values, construct the decision tree to predict the data quality of the target
system and give the feedback on the data quality for the end user[6] [7].
Data Quality Measurement is necessary for any business decisions; figure-1 describes the components used for
the framework. Accuracy: Is a measure of the degree to which data agrees with data contained in an original
source. Validity: Is a measure of degree of conformance of data values to its domain and business rules.
Derivation integrity: Is the correctness with which two or more pieces of data are combined to create new data.
Completeness: Is the characteristic of having all required values for the data fields. Timeliness: Is the relative
availability of data to support a given process within the timetable required to perform the process [4] [5]. The
figure-2 confers the algorithm for data quality assessment.
Algorithm-Data Quality assessment Framework
Begin algorithm
for each table in the source database SD (1...N)
for each attribute Aij in SD (i rows and j columns)
for each record in the target database TD (1...M)
for each attribute Bij in TD (i rows and j columns)
Evaluate DQ KPI’s
for i=1 to n (where n=no of Entities in D)
Begin
Accuracy ()
Derivation integrity ()
Validity ()
Completeness ()
Timeliness ()
Accessibility ()
Consistency ()
End
Store all DQ KPI’s in Temp table
end for.
end for.
end for
end for.
Construct Decision tree for different KPI’s i.e. DT
If (DT is Acceptable) break
Else not acceptable for decision making
End algorithm.
Figure-2: Algorithm for data quality assessment framework
IV. DATA QUALITY COMPUTATION CASE STUDY: LOAN REPAYMENT DATA MART
Concise indication of Loan Repayment Data Mart. Leading Bank is having operations in all the states across the
country. The Bank offers a wide range of banking products and financial services to corporate and retail
customers through a variety of delivery channels and through its specialized subsidiaries and affiliates in the
areas of investment banking, venture capital and asset management. The Bank has ERP systems and legacy
systems for capturing the transaction data. The bank has decided to go ahead with the implementation of data
warehousing solution to solve the repayment business need. Figure-3 describes the complete business
requirements to perform ETL Process | data migration from legacy to data mart.Figure-4 gives the multi-
dimensional model with start schema.
4.1 Typical Requirements Specification for legacy Data system
Bank wants to design a data mart that will meet the following requirements. The bank should be able to view the
following information from the data mart.
1. List of defaulters
2. List of customers who have repaid the loan amount completely.
3. List of customers who have opted for partial prepayment of loan.
4. List of customers who have opted for full prepayment of loan.
5. List of customers who have completed 25 percent of their loan.
4.2 Design: Loan Repayment Data Mart
The data mart will have the following tables.
1. Loan product
2. Customer
3. Branch
4. Payment mode
5. Time
4.3 Mapping Document with typical business rules
Field Name Data Type Description Business Rules
Actual_date_payment Date type Date when the loan installment (EMI) is Entered in dd-mm-yyyy
actually paid by the customer format
Address Text The address of the customer Maximum characters: 40,
can also include special
characters
Balance_loan_amount Currency The outstanding loan amount for the Can take only integers (no
customer decimals). Values from Re 1
to Rs 10 million
Branch Text The name of the branch from where the Can take values only
loan is availed. mentioned in the drop down
box
Branch_id Text The unique id given to the branch Can take values only
mentioned in the drop down
box
City Text The city of the customer Maximum characters: 20,
cannot include integers or
special characters
Credit_rating Text Credit rating given to the customer. Can take values only
mentioned in the drop down
box. Four values: A, B, C &
D
Customer_id Text The unique customer id allotted to the Unique customer id allotted
customer. to the customer.
Same for all banking
Delayed_payment_days Number This is the payment delay (in number of Would be an integer.
days). It is equal to the
"Actual_date_payment" minus
"Due_date_payment".
Description Text Loan description (Housing loan, Vehicle Can take values only
loan, Personal loan) mentioned in the drop down
box
DOB Date type Date of Birth of the Customer Entered in dd-mm-yyyy
format
Due_date_payment Date type Date when the loan installment (EMI) need Entered in dd-mm-yyyy
to be paid by the customer format
Duration (months) Number This is the total number of months for It should be between 12
which the loan has been sanctioned. months and 240 months.
Must be an integer.
Eff_date Date type This is the data from which the revised
interest rate is effective.
Equated_Monthly_Installment Currency The monthly installment which the Calculated amount. The
customer has to pay every month. decimals values are rounded
off.
Exception_id Number The id for all the exceptions which can
happen during the tenure of the loan
First_name Text The first name of the customer Maximum characters: 20,
cannot include integers and
special characters
Fulldate Date type This is the date in dd-mm-yyyy format.
Last_name Text The last name of the customer Maximum characters: 20,
cannot include integers and
special characters
Loan_End_date Date type Date when the loan installment for the loan Entered in dd-mm-yyyy
shall be paid by the customer format
Loan_id Text This is the id allocated to the loan taken by Should be unique. One
the customer. customer id can have
multiple loan ids.
Loan_product_id Text The id allocated to various types of loans Should accept only allocated
like HL, PL and VL. ids.
Loan_Start_date Date type This is the date when the disbursement of
the loan takes place.
Month Text This contains all the months of the year.
No_of_installments_defaulted Number This is the number of months for which the Can accept up to one
customer has defaulted. decimal.
Pay_mode_description Text The description of various modes of Should not accept any other
payments such as Cash, ECS and Cheque. value.
Payment_made Date type This is the date when the payment is Entered in dd-mm-yyyy
actually made by the customer. format
Prepaid_full_penalty_charges Number This is the penalty (in %) which is charged
on the outstanding principal amount when
the customer wants to foreclose the loan.
Repayment_no Auto Number This is the installment number being paid Auto generated
by the customer. Should be increased by
one over the previous repayment number.
ROI Number This is the interest rate being charged to the For HL, it should be equal to
customer. bank rate plus spread. Can
include up to 2 decimals.
ROI_type Text Includes two types: fixed rate and floating Should not accept any other
rate. value.
Salesperson_id Text The id allocated to the sales person Should be unique for every
sales person.
Salesperson_name Text Name of the sales person against whom this Maximum characters: 40,
loan id is tagged. cannot include integers and
special characters
Spread Number This would be the percentage mark up for Should increase with a fall in
the customer over the bank rate. Depends the credit rating. Can include
on credit rating. up to 2 decimals.
Status_flag Text It contains two flags: Yes and No. Should not accept any other
value.
Total_Loan_Amount Currency The total loan amount sanctioned by the Can take values between Rs
bank 5000 to Rs 10 million.
The exercised data mart consist of six dimension tables namely loan product, customer, time, payment mode,
time, marketing team and branch and one fact table loan fact and all these tables have be arranged in start style.
The transactional data present in legacy system is loaded into data mart on the daily basis. To ensure the data
quality of the data mart is challenge and tedious task, the proposed KPI (Key Performance Evaluation) method
helps in ensuring the data quality. The exercised requirements are highlighted in mapping document in figure-3
and multidimensional model in figure-4
Accuracy: Is a measure of the degree to which data agrees with data contained in an original source.
1,2,3 … …
Derivation integrity: Is the correctness with which two or more pieces of data are combined to create new data.
1
One should be very cautious, when computing the Derivation Integrity, which would not be literally calculated
by one single formula, but should be considered under different circumstances, conditions based on the database
instances.
Validity: Is a measure of degree of conformance of data values to its domain and business rules.
Automated data assessments can test validity and reasonability, but they will not be able to assess accuracy of
the values
Completeness: Is the characteristic of having all required values for the data fields i.e. completeness can be
measured by taking a ratio of the number of incomplete items to the total number of items and subtracting from
1
1
Timeliness: Is the relative availability of data to support a given process within the timetable required to
perform the process. Timeliness is measured as a maximum of one of the two terms: 0 and one minus the ratio
of currency to volatility.
Where, Currency is the age plus delivery time minus the input time. Volatility refers to the length of time the
data remains valid. Delivery time refers to when the data is delivered to the user.
Consistency: It can be measured by a ratio of violations of a specific consistency type to the total number of
consistency checks subtracted from one.
1
Validated the data quality of the decision / target database using decision tree method: For a given a collection
of records (training set), each record contains a set of attributes, and one of the attributes is the class. Find a
model for class attribute as a function of the values of other attributes.
Goal: Previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and
test sets, with training set used to build the model and test set used to validate it.
Decision tree is a flow-chart-like tree structure, internal node denotes a test on an attribute, Branch represents an
outcome of the test and Leaf nodes represent class labels or class distribution. At start, all the training examples
are at the root and Partition examples recursively based on selected attributes. The below are the training data
set and test data set with different databases of loan repayment data mart. Figure-8 represents training set for
different KPI’s and figure-9 shows the test set for the different databases with respect to KPI’s.Figure-10
represents the decision tree construction for data quality classification with respect to KPI’s.
Training Data set
Dimension Value Class
Completeness 80 to 100
Business Conformance 80 to 100
Non Duplicates 100
Accuracy Field 90 to 100 Yes
Accuracy Entity 80 to 100
Derivation Integrity 80 to 100
Consistency 80 to 100
Completeness 98
Business Conformance 98
Non Duplicates 100
Oracle Accuracy Field 100 Yes
Accuracy Entity 66.66
Derivation Integrity 100
Consistency 100
Completeness 77
Business Conformance 78
Non Duplicates 84 No
SQL Server Accuracy Field 100
Accuracy Entity 85
Derivation Integrity 76
Consistency 80
Completeness 97
Business Conformance 96
Non Duplicates 77 Yes
Flat Files Accuracy Field 100
Accuracy Entity 100
Derivation Integrity 98
Consistency 100
Completeness 100
Business Conformance 100
Non Duplicates 100 Yes
DB2 Accuracy Field 98
Accuracy Entity 77
Derivation Integrity 96
Consistency 98
Figure-9: Test Set for Different KPI’s and dataset
Decision Tree Construction for different test set based on the training set
VI. CONCLUSIONS
In current information technology, process improvement is vital part of any businesses, in this regard data
migration is essential from legacy systems to newer system or data warehouses for their business decisions, in
this connection there is an essence of developing data quality assessment model to assess the underlying data in
the decision databases. The Proposed data quality assessment model evaluates the data at different dimensions
to give confidence for the end users to rely on their businesses. Author extended to classify various data sets
suitable for decision making. The results reveal the proposed model is performing an average of 12.8 percent of
improvement in evaluation criteria dimensions with respect to selected case study.
REFERENCES
[1] Alex Berson and Larry Dobov [2007]. "Master Data Management and Customer Data Integration for a Global Enterprise”, Tata McGraw-
Hill Publishing Company Limited.
[2] Allen Dreibelbis, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run, Dan Wolfson [2008]. "Enterprise Master Data
Management: An SOA Approach to Managing Core Information”, Dorling Kindersley (India) Pvt. Ltd.
[3] Jack E. Olson [2003]. "Data Quality: The Accuracy Dimension”, Elsevier.
[4] Ralph Kimball and Joe Caserta [2004]. "The Data Warehouse ETL Toolkit”, Wiley Publishing, Inc.
[5] Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications - Batini, Scannapieco - 2006
[6] A Data Quality Management Maturity Model, Kyung-Seok Ryu, Joo-Seok Park, and Jae-Hong Park, ETRI Journal, Volume 28, Number
2, April 2006
[7] Manjunath T.N., Ravindra S. Hegadi, Ravikumar G.K.“Analysis of Data Quality Aspects in Data Warehouse Systems”. (IJCSIT)
International Journal of Computer Science and Information Technologies, 2 (1), 2011, 477-485.
[8] Manjunath T.N., Ravindra S. Hegadi, RaviKumar G.K.,“Design and Analysis of DWH and BI in Education Domain”,IJCSI International
Journal of Computer Science Issues, 8(2),March 2011 ISSN (Online): 1694-0814.545-551.
[9] Manjunath T.N., Ravindra S. Hegadi and Mohan H.S.Article: “Automated Data Validation for Data Migration Security”. International
Journal of Computer Applications, 30(6), 41-46, September 2011. Published by Foundation of Computer Science, New York.
[10] G. John and P. Langley, “Static versus Dynamic Sampling for Data Mining”, Proceedings of the 5th International Conference on
Knowledge Discovery and Data Mining (KDD-96), pp. 367-370, AAAI Press, Menlo Park, CA, 1996.
[11] Microsoft® CRM Data Migration Framework White Paper by Parul Manek, Program Manager Published: April 2003.
[12] “Data Migration Best Practices NetApp Global Services”, January 2006.
[13] Liew, C. K., Choi, U. J., and Liew, C. J. 1985. “A Data Distortion by Probability Distribution,”ACM Transactions on Database Systems
(10:3), pp.395-411.
[14] Xiao-Bai Li, Luvai Motiwalla BY “Protecting Patient Privacy with Data Masking” WISP 2009.
[15] Domingo-Ferrer J., and Mateo-Sanz, J. M. 2002. “Practical Data-Oriented Microaggregation for Statistical Disclosure Control,” IEEE
Transactions on Knowledge and Data Engineering (14:1), pp. 189-201.
[16] A. Bonifati, F. Cattaneo, S. Ceri, A. Fuggetta, and S. Paraboschi. Designing data marts for data warehouses. ACM Transactions on
Software Engineering Methodologies, 10(4):452{483, 2001}.