0% found this document useful (0 votes)
19 views

Lecture 5

The document discusses the extract, transform, and load (ETL) process which is the foundation of any data warehouse system. It involves extracting data from various source systems, transforming it to fit the data warehouse structures and requirements, and loading it into the data warehouse. The ETL process consumes significant resources but is not visible to end users. It requires input from various stakeholders to integrate data from different sources and keep the data warehouse updated as the business needs change.

Uploaded by

chan chanchan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lecture 5

The document discusses the extract, transform, and load (ETL) process which is the foundation of any data warehouse system. It involves extracting data from various source systems, transforming it to fit the data warehouse structures and requirements, and loading it into the data warehouse. The ETL process consumes significant resources but is not visible to end users. It requires input from various stakeholders to integrate data from different sources and keep the data warehouse updated as the business needs change.

Uploaded by

chan chanchan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

CST3340

Extraction, Transformation and


Load Process

Joanna Loveday

CST3340 _ Business Intelligence 1


What is the ETL process?
• The process that take data from the operational systems
and loads it into the data warehouse.
• Extract
– Extract relevant data from the various operational
data sources.
• Transform
– Transform extracted data to Data Warehouse format.
– Includes cleaning the data.
• Load
– Load data into Data Warehouse.

CST3340 _ Business Intelligence 2


The ETL Process
• ETL process is the foundation of any Data Warehouse system.
• Complex and technically challenging process which is often
under rated!
• Consumes about 70-80% of resources needed for
implementation & maintenance of a Data Warehouse.
• Construction of the ETL process is not visible to the end
users.
• Requires input from many stakeholders such as developers,
analysts, managers.
• Needs to be agile, automated and well documented to allow
the Data Warehouse to change as the business changes.

CST3340 _ Business Intelligence 3


Why is an ETL process required?
• Makes data available for business decision making by
integrating data from various sources to the data
warehouse.
• Allows the Data Warehouse to change as the data sources
change.
• Well documented and designed ETL process is critical to
the success of a data warehouse.
• Keeps track of the transformations used to format the data
for the data warehouse, such as calculations required,
aggregations made and transformations undertaken.

CST3340 _ Business Intelligence 4


Why is an ETL Process Required? Cont.

• Tracks movement of data between source


systems and the data warehouse.
• Staging area used to transform data which
does not effect the efficiency of the source
system or the data warehouse.
• Allows a historic view of the business data.
• Improves productivity as can be coded and
reused.

CST3340 _ Business Intelligence 5


ETL and the Data Warehouse
DBMS

SQL
Systems

Other
Systems
Staging Area

EDW
Files

Various Source Data

CST3340 _ Business Intelligence 6


Extraction
• Extraction
– The operation of extracting data from operational Data Sources
for use in a data warehouse environment
– Data placed in a staging area, to allow the data to be
transformed before it is loaded into the data warehouse.
• Extraction: advantages of a staging area
– Does not affect the efficiency of the operations systems or the
data warehouse.
– Minimises corrupt data being loaded into the data warehouse.
– Allow data from multiple systems to be integrated:
• Different DBMS; operating systems; communication protocols;
• Sources: Mainframes; Customised apps; text files; spreadsheets;
ATMs; management systems;

CST3340 _ Business Intelligence 7


Extraction Methods 1
• Full Extractions
– Entire Data Warehouse is periodically
refreshed.
– Used if system not able to identify which data
is updated.
– Heavily taxes the network connections
between the source and staging area.
– Requires a copy of previously extracted data
to identify which data is new
– Should be used as last resort.

CST3340 _ Business Intelligence 8


Extraction Methods 2
• Partial Extraction
– Update notification
• Operational system notifies when data is changed.
• Supported by many databases to support database
replication
– Without update notification
• Can identify which data has been changed but not send
notification.
• Provide an extraction of this data.
• Many not identify deleted data.

CST3340 _ Business Intelligence 9


Data Cleaning
• Data in a data warehouse is used during the
decision-making process, therefore quality
data required.
• If the data input into the data warehouse is
dirty, inaccurate or incorrect, the decision-
making process will be misleading.
• Most time-consuming stage of ETL process.
• Note: Garbage in – Garbage out.

CST3340 _ Business Intelligence 10


Validation during extraction

• Validation implemented during the extraction


process:
– Making sure data matches the source data.
– Extracting only relevant data – removing unwanted
data and spam.
– Check data types match and are relevant.
– Remove all redundant or fragmented data
– Check key are relevant and complete.
11
CST3340 _ Business Intelligence
Pre-Transformation: Data Cleansing
• Finds and removes duplicate tuples. E.g. Julie Mary
Green vs Ms J M Green.
• Detect inconsistent or wrong data. E.g. text values in
numeric attributes.
• Unreadable or incomplete data.
• Attribute mismatches: check attribute order and format.
E.g. American date vs European date; Currency;
• Combining data from various sources to enrich data.
E.g. combining data from sales, purchasing and
marketing.
• Use metadata for schema related transformations.
• Functions used for data cleaning should be stored for
reuse.

CST3340 _ Business Intelligence 12


Data Cleaning issues
• Redundancy.
• Inconsistent spelling.
• Domain Constraints violated.
• Data entry errors from source systems.
• Integrity Constraints violated.
• Mismatched data values.
• Data from different source systems can have:
– Naming conflicts.
– Different data structures.
– Different data aggregation.

CST3340 _ Business Intelligence 13


Data Transformation

• Most complex and time consuming part


of ETL process.
• Carried out in the staging area.
• Structures data ready for loading into
the data warehouse.
• Interrelated with data cleaning process.

CST3340 _ Business Intelligence 14


Transformation Operations.
• Standardises Data Format: e.g. data type and
length.
• Mapping data values to coded meaning: e.g.
coding ‘Female’ to ‘F’ or null values by ‘0’.
• Implementing Constraints: e.g. Establishment
and checking of key constraints across tables.

CST3340 _ Business Intelligence 15


Transformation Operations Cont.
• Recoding of Records from Multiple Sources: Data coming from
multiple sources can be coded in different often incompatible or poorly
documented forms. Makes data readily available to the business user.
• Merging of Related Information: related records can be stored
together in one single entity, e.g. item, item price, item type,
description, etc.
• Splitting large records: Large text field can be split into smaller records.
e.g. splitting address into Street, City, County and Country.
• Calculated and Derived attributes: Aggregated or calculated attributes
can be created before loading it to a Data Warehouse, e.g. Revenue,
Profit or Total Costs.
• Calculate Summaries: Attribute values can be sumariised and stored as
business fact in multidimensional tables. E.g. Average sales.

CST3340 _ Business Intelligence 16


Data Loading
• Final stage of the ETL process where the data is
loaded into the data warehouse’s fact and
dimension tables.
• Load process considerations:
– Building indices.
– Checking integrity constraints. This can be disabled for
very large files and maintained by the ETL process
instead.
– Aggregation of fact by dimensions: stored in the
multidimensional database.
– Managing Partitions: Data split be dimensional values.
E.g. Year, Month etc.
CST3340 _ Business Intelligence 17
Loading Methods
• For a typical data warehouses vast amounts of
data need to be loaded in a short period of time
e.g. over night.
• Initial Load: data loaded into all data warehouse
tables.
• Incremental Load: periodically load data that
has changed since last load.
• Full Refresh: Subsections of the data warehouse
are deleted and completely reloaded. E.g.
specific tables.
CST3340 _ Business Intelligence 18
Data Integration and the Extraction,
Transformation, and Load Process

Packaged Transient
application data source

Data
warehouse

Legacy Extract Extract Extract Extract


system

Data
marts
Other internal
applications

CST3340 _ Business Intelligence Source: Sharda, Delen, Turban (2018)


Reading

Chapter 3, Section 3.5:

 

Sharda, Delen, Turban (2018), Business Intelligence
Analytics and Data Science: A management
Perspective. Pearson.

20
CST3340 _ Business Intelligence

You might also like