Assignment On Chapter 8 Data Warehousing and Management
Assignment On Chapter 8 Data Warehousing and Management
I. OBJECTIVES
At the end of this chapter, the students should be able to:
Understand how we can use Business Process Modeling Notation (BPMN) for
conceptual modeling of ETL processes.
Understand extraction, tranformation, and loading in data warehouses.
Describe ETL and other data integration methods.
Design a conceptual model for the ETL process.
III. PROCEDURE
A. Preliminaries
Pre- Assessment
1. Brief introduction and overview of Business Process Modelling Notation (BPMN) and
the ETL process.
2. Discuss in-depth Extraction, Transformation, and Loading.
3. Present and describe the common tools used for ETL process.
4. Examine and describe ETL and other data integration methods.
5. Present and give overview on the benefits and challenges of ETL.
B. Lesson Proper
Extraction, transformation, and loading (ETL) processes are used to extract data from
internal and external sources of an organization, transform these data, and load them into a
data warehouse. Since ETL processes are complex and costly, it is important to reduce their
development and maintenance costs. Modeling ETL processes at a conceptual level is a way to
achieve this goal. However, existing ETL tools, like Microsoft Integration Services or Pentaho
Data Integration (also known as Kettle), have their own specific language to define ETL
processes. Further, there is no agreed-upon conceptual model to specify such processes. In
1
DATA WAREHOUSING AND MANAGEMENT MODULE 8
this chapter, we study the design of ETL processes using a conceptual approach. The model we
use is based on the Business Process Modeling Notation (BPMN), a de facto standard for
specifying business processes. The model provides a set of primitives that cover the
requirements of frequently used ETL processes. Since BPMN is already used for specifying
business processes, users already familiar with BPMN do not need to learn another language
for defining ETL processes. Further, BPMN provides a conceptual and implementation-
independent specification of such processes, which hides technical details and allows users
and designers to focus on essential characteristics of such processes. Finally, ETL processes
expressed in BPMN can be translated into executable specifications for ETL tools.
ETL, which stands for extract, transform and load, is a data integration process that combines
data from multiple data sources into a single, consistent data store that is loaded into a data
warehouse or other target system.
As the databases grew in popularity in the 1970s, ETL was introduced as a process for integrating
and loading data for computation and analysis, eventually becoming the primary method to
process data for data warehousing projects.
ETL provides the foundation for data analytics and machine learning workstreams. Through a
series of business rules, ETL cleanses and organizes data in a way which addresses specific
business intelligence needs, like monthly reporting, but it can also tackle more advanced
analytics, which can improve back-end processes or end user experiences. ETL is often used by an
organization to:
2
DATA WAREHOUSING AND MANAGEMENT MODULE 8
during this extraction process. The size of the extracted data varies from hundreds of kilobytes up
to gigabytes, depending on the source system and the business situation. The same is true for the
time delta between two (logically) identical extractions: the time span may vary between
days/hours and minutes to near real-time. Web server log files, for example, can easily grow to
hundreds of megabytes in a very short period.
Transformation of Data
After data is extracted, it has to be physically transported to the target system or to an
intermediate system for further processing. Depending on the chosen way of transportation,
some transformations can be done during this process, too. For example, a SQL statement which
directly accesses a remote target through a gateway can concatenate two columns as part of the
SELECT statement.
Why do you need ETL?
There are many reasons for adopting ETL in the organization:
It helps companies to analyze their business data for taking critical business decisions.
Transactional databases cannot answer complex business questions that can be answered
by ETL example.
ETL provides a method of moving the data from various sources into a data warehouse.
Well-designed and documented ETL system is almost essential to the success of a Data
Warehouse project.
ETL process allows sample data comparison between the source and the target system.
ETL process can perform complex transformations and requires the extra area to store the
data.
ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and
types to adhere to one consistent system.
ETL is a predefined process for accessing and manipulating source data into the target
database.
ETL in data warehouse offers deep historical context for the business.
3
DATA WAREHOUSING AND MANAGEMENT MODULE 8
It helps to improve productivity because it codifies and reuses without a need for
technical skills.
In this step of ETL architecture, data is extracted from the source system into the staging area.
Transformations if any are done in staging area so that performance of source system in not
degraded. Also, if corrupted data is copied directly from the source into Data warehouse
database, rollback will be a challenge. Staging area gives an opportunity to validate extracted
data before it moves into the Data warehouse.
Data warehouse needs to integrate systems that have different DBMS, Hardware, Operating
Systems and Communication Protocols. Sources could include legacy applications like
Mainframes, customized applications, Point of contact devices like ATM, Call switches, text files,
spreadsheets, ERP, data from vendors, partners amongst others.
Hence one needs a logical data map before data is extracted and loaded physically. This data map
describes the relationship between sources and target data.
4
DATA WAREHOUSING AND MANAGEMENT MODULE 8
Logical Extraction Methods
Full Extraction
The data is extracted completely from the source system. Because this extraction reflects all the
data currently available on the source system, there's no need to keep track of changes to the
data source since the last successful extraction. The source data will be provided as-is and no
additional logical information (for example, timestamps) is necessary on the source site. An
example for a full extraction may be an export file of a distinct table or a remote SQL statement
scanning the complete source table.
Incremental Extraction
At a specific point in time, only the data that has changed since a well-defined event back in
history is extracted. This event may be the last time of extraction or a more complex business
event like the last booking day of a fiscal period. To identify this delta change there must be a
possibility to identify all the changed information since this specific time event. This information
can be either provided by the source data itself such as an application column, reflecting the last-
changed timestamp or a change table where an appropriate additional mechanism keeps track of
the changes besides the originating transactions. In most cases, using the latter method means
adding extraction logic to the source system.
Physical Extraction Methods
Depending on the chosen logical extraction method and the capabilities and restrictions on the
source side, the extracted data can be physically extracted by two mechanisms. The data can
either be extracted online from the source system or from an offline structure. Such an offline
structure might already exist or it might be generated by an extraction routine.
There are the following methods of physical extraction:
Online Extraction
The data is extracted directly from the source system itself. The extraction process can connect
directly to the source system to access the source tables themselves or to an intermediate system
that stores the data in a preconfigured manner (for example, snapshot logs or change tables).
Note that the intermediate system is not necessarily physically different from the source system.
Offline Extraction
5
DATA WAREHOUSING AND MANAGEMENT MODULE 8
The data is not extracted directly from the source system but is staged explicitly outside the
original source system. The data already has an existing structure (for example, redo logs, archive
logs or transportable tablespaces) or was created by an extraction routine.
You should consider the following structures:
Flat files
Data in a defined, generic format. Additional information about the source object is necessary for
further processing.
Dump files
Oracle-specific format. Information about the containing objects may or may not be included,
depending on the chosen utility.
Transportable tablespaces
Some validations are done during Extraction:
Data extracted from source server is raw and not usable in its original form. Therefore, it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL process
adds value and changes data such that insightful BI reports can be generated.
It is one of the important ETL concepts where you apply a set of functions on extracted
data. Data that does not require any transformation is called as direct move or pass through data.
In transformation step, you can perform customized operations on data. For instance, if the
user wants sum-of-sales revenue which is not in the database. Or if the first name and the last
name in a table is in different columns. It is possible to concatenate them before loading.
Transformation Flow
6
DATA WAREHOUSING AND MANAGEMENT MODULE 8
Multistage Data Transformation
The data transformation logic for most data warehouses consists of multiple steps. For example,
in transforming new records to be inserted into a sales table, there may be separate logical
transformation steps to validate each dimension key.
7
DATA WAREHOUSING AND MANAGEMENT MODULE 8
1. Different spelling of the same person like Jon, John, etc.
2. There are multiple ways to denote company name like Google, Google Inc.
3. Use of different names like Cleaveland, Cleveland.
4. There may be a case that different account numbers are generated by various applications
for the same customer.
5. In some data required files remains blank
6. Invalid product collected at POS as manual entry can lead to mistakes.
Data threshold validation check. For example, age cannot be more than two digits.
Data flow validation from the staging area to the intermediate tables.
Cleaning ( for example, mapping NULL to 0 or Gender Male to “M” and Female to “F” etc.)
8
DATA WAREHOUSING AND MANAGEMENT MODULE 8
Split a column into multiples and merging multiple columns into a single column.
Using any complex data validation (e.g., if the first two columns in a row are empty then it
automatically reject the row from processing)
Loading data into the target datawarehouse database is the last step of the ETL process. In
a typical Data warehouse, huge volume of data needs to be loaded in a relatively short period
(nights). Hence, load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the point
of failure without data integrity loss. Data Warehouse admins need to monitor, resume, cancel
loads as per prevailing server performance.
Types of Loading:
Initial Load — populating all the Data Warehouse tables. Though there may be times this
is useful for research purposes, initial loading produces data sets that grow exponentially
and can quickly become difficult to maintain
Full Refresh —erasing the contents of one or more tables and reloading with fresh data.
Load verification
Ensure that the key field data is neither missing nor null.
9
DATA WAREHOUSING AND MANAGEMENT MODULE 8
ETL solutions improve quality by performing data cleansing prior to loading the data to a different
repository. A time-consuming batch operation, ETL is recommended more often for creating
smaller target data repositories that require less frequent updating, while other data integration
methods—including ELT (extract, load, transform), change data capture (CDC), and data
virtualization—are used to integrate increasingly larger volumes of data that changes or real-time
data streams.
ETL Tools
In the past, organizations wrote their own ETL code. There are now many open source and
commercial ETL tools and cloud services to choose from. Typical capabilities of these products
include the following:
Comprehensive automation and ease of use: Leading ETL tools automate the entire data
flow, from data sources to the target data warehouse. Many tools recommend rules for
extracting, transforming and loading the data.
A visual, drag-and-drop interface: This functionality can be used for specifying rules and
data flows.
Security and compliance: The best ETL tools encrypt data both in motion and at rest and
are certified compliant with industry or government regulations, like HIPAA and GDPR.
Implementing ETL in Data Warehouse
When an ETL process is used to move data into a data warehouse, a separate layer represents
each phase:
Mirror/Raw layer: This layer is a copy of the source files or tables, with no logic or enrichment.
The process copies and adds source data to the target mirror tables, which then hold
historical raw data that is ready to be transformed.
Staging layer: Once the raw data from the mirror tables transform, all transformations wind up in
staging tables. These tables hold the final form of the data for the incremental part of the ETL
cycle in progress.
Schema layer: These are the destination tables, which contain all the data in its final form after
cleansing, enrichment, and transformation.
Aggregating layer: In some cases, it's beneficial to aggregate data to a daily or store level from
the full dataset. This can improve report performance, enable the addition of business logic to
calculate measures, and make it easier for report developers to understand the data.
10
DATA WAREHOUSING AND MANAGEMENT MODULE 8
Best Practices of ETL
Following are the best practices for ETL Process steps:
Never try to cleanse all the data:
Every organization would like to have all the data clean, but most of them are not ready to pay to
wait or not ready to wait. To clean it all would simply take too long, so it is better not to try to
cleanse all the data.
Never cleanse Anything:
Always plan to clean something because the biggest reason for building the Data Warehouse is to
offer cleaner and more reliable data.
11
DATA WAREHOUSING AND MANAGEMENT MODULE 8
ACTIVITY 1: SPIDERGRAM
What comes to your mind when you hear the word “ETL”? Write the key word inside the circles.
Also, discuss the reasons why ETL is essential to be adopted by different organizations.
ETL
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
12
DATA WAREHOUSING AND MANAGEMENT MODULE 8
ACTIVITY 2: SHORT VIDEO CLIP VIEWING
Now you are going to watch a short video clip about What is ETL? as you view the clip take down
notes. Type this link to access the video:
https://youtu.be/yicphAV80rA
After watching the videos, answer the following questions. You may use your notes as reference.
13