0% found this document useful (0 votes)
1K views

ETL Testing Goals

The document discusses testing goals for Extract, Transform, and Load (ETL) applications. It outlines several key goals for ETL testing: ensuring data completeness, validating correct data transformation according to business rules, testing data quality handling, assessing performance and scalability, integration testing with other systems, and user acceptance testing. Specific strategies are provided for testing each of these goals, such as comparing record counts, validating data transformations with sample records, including diverse data scenarios to test quality rules, and performing load tests at expected production volumes.

Uploaded by

sinhraj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

ETL Testing Goals

The document discusses testing goals for Extract, Transform, and Load (ETL) applications. It outlines several key goals for ETL testing: ensuring data completeness, validating correct data transformation according to business rules, testing data quality handling, assessing performance and scalability, integration testing with other systems, and user acceptance testing. Specific strategies are provided for testing each of these goals, such as comparing record counts, validating data transformations with sample records, including diverse data scenarios to test quality rules, and performing load tests at expected production volumes.

Uploaded by

sinhraj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

ETL testing goals

Extract, transform and load (ETL) is the core process of data integration and is typically associated with data
warehousing. ETL tools extract data from a chosen source(s), transform it into new formats according to business
rules, and then load it into target data structure(s). Managing rules and processes for the increasing diversity of data
sources and high volumes of data processed that ETL must accommodate, make management, performance and
cost the primary and challenges for users. The traditional ETL approach requires users to map each physical data
item with a unique metadata description; newer ETL tools allow the user to create an abstraction layer of common
business definitions and map all similar data items to the same definition before applying target-specific business
rules, isolating business rules from data and allowing easier ETL management.

Testing Goals

There is an exponentially increasing cost associated with finding software defects later in the development lifecycle.
In data warehousing, this is compounded because of the additional business costs of using incorrect data to make
critical business decisions. Given the importance of early detection of software defects, let's first review some general
goals of testing an ETL application:

 Data completeness. Ensures that all expected data is loaded.


 Data transformation. Ensures that all data is transformed correctly according to business rules and/or
design specifications.
 Data quality. Ensures that the ETL application correctly rejects, substitutes default values, corrects or
ignores and reports invalid data.
 Performance and scalability. Ensures that data loads and queries perform within expected time frames
and that the technical architecture is scalable.
 Integration testing. Ensures that the ETL process functions well with other upstream and downstream
processes.
 User-acceptance testing. Ensures the solution meets users' current expectations and anticipates their
future expectations.
 Regression testing. Ensures existing functionality remains intact each time a new release of code is
completed.

Data Completeness

One of the most basic tests of data completeness is to verify that all expected data loads into the data warehouse.
This includes validating that all records, all fields and the full contents of each field are loaded. Strategies to consider
include:
 Comparing record counts between source data, data loaded to the warehouse and rejected records.
 Comparing unique values of key fields between source data and data loaded to the warehouse. This is a
valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
 Utilizing a data profiling tool that shows the range and value distributions of fields in a data set. This can
be used during testing and in production to compare source and target data sets and point out any data
anomalies from source systems that may be missed even when the data movement is correct.
 Populating the full contents of each field to validate that no truncation occurs at any step in the process.
For example, if the source data field is a string(30) make sure to test it with 30 characters.
 Testing the boundaries of each field to find any database limitations. For example, for a decimal(3) field
include values of -99 and 999, and for date fields include the entire range of dates expected. Depending on the
type of database and how it is indexed, it is possible that the range of values the database accepts is too small.

Data Transformation

Validating that data is transformed correctly based on business rules can be the most complex part of testing an ETL
application with significant transformation logic. One typical method is to pick some sample records and "stare and
compare" to validate data transformations manually. This can be useful but requires manual testing steps and testers
who understand the ETL logic. A combination of automated data profiling and automated data movement validations
is a better long-term strategy. Here are some simple automated data movement techniques:
 Create a spreadsheet of scenarios of input data and expected results and validate these with the
business customer. This is a good requirements elicitation exercise during design and can also be used during
testing.
 Create test data that includes all scenarios. Elicit the help of an ETL developer to automate the process
of populating data sets with the scenario spreadsheet to allow for flexibility because scenarios will change.
 Utilize data profiling results to compare range and distribution of values in each field between source and
target data.
 Validate correct processing of ETL-generated fields such as surrogate keys.
 Validate that data types in the warehouse are as specified in the design and/or the data model.
 Set up data scenarios that test referential integrity between tables. For example, what happens when the
data contains foreign key values not in the parent table?
 Validate parent-to-child relationships in the data. Set up data scenarios that test how orphaned child
records are handled.

Data Quality

For the purposes of this discussion, data quality is defined as "how the ETL system handles data rejection,
substitution, correction and notification without modifying data." To ensure success in testing data quality, include as
many data scenarios as possible. Typically, data quality rules are defined during design, for example:
 Reject the record if a certain decimal field has nonnumeric data.
 Substitute null if a certain decimal field has nonnumeric data.
 Validate and correct the state field if necessary based on the ZIP code.
 Compare product code to values in a lookup table, and if there is no match load anyway but report to
users.

Depending on the data quality rules of the application being tested, scenarios to test might include null key values,
duplicate records in source data and invalid data types in fields (e.g., alphabetic characters in a decimal field).
Review the detailed test scenarios with business users and technical designers to ensure that all are on the same
page. Data quality rules applied to the data will usually be invisible to the users once the application is in production;
users will only see what's loaded to the database. For this reason, it is important to ensure that what is done with
invalid data is reported to the users. These data quality reports present valuable data that sometimes reveals
systematic issues with source data. In some cases, it may be beneficial to populate the "before" data in the database
for users to view.

Performance and Scalability

As the volume of data in a data warehouse grows, ETL load times can be expected to increase and performance of
queries can be expected to degrade. This can be mitigated by having a solid technical architecture and good ETL
design. The aim of the performance testing is to point out any potential weaknesses in the ETL design, such as
reading a file multiple times or creating unnecessary intermediate files. The following strategies will help discover
performance issues:
 Load the database with peak expected production volumes to ensure that this volume of data can be
loaded by the ETL process within the agreed-upon window.
 Compare these ETL loading times to loads performed with a smaller amount of data to anticipate
scalability issues. Compare the ETL processing times component by component to point out any areas of
weakness.
 Monitor the timing of the reject process and consider how large volumes of rejected data will be handled.
 Perform simple and multiple join queries to validate query performance on large database volumes. Work
with business users to develop sample queries and acceptable performance criteria for each query.
Integration Testiing

 Typically, system testing only includes testing within the ETL application. The endpoints for system
testing are the input and output of the ETL code being tested. Integration testing shows how the
application fits into the overall flow of all upstream and downstream applications. When creating
integration test scenarios, consider how the overall process can break and focus on touch points between
applications rather than within one application. Consider how process failures at each step would be
handled and how data would be recovered or deleted if necessary.
 Most issues found during integration testing are either data related to or resulting from false assumptions
about the design of another application. Therefore, it is important to integration test with production-like
data. Real production data is ideal, but depending on the contents of the data, there could be privacy or
security concerns that require certain fields to be randomized before using it in a test environment. As
always, don't forget the importance of good communication between the testing and design teams of all
systems involved. To help bridge this communication gap, gather team members from all systems
together to formulate test scenarios and discuss what could go wrong in production. Run the overall
process from end to end in the same order and with the same dependencies as in production. Integration
testing should be a combined effort and not the responsibility solely of the team testing the ETL
application.

You might also like