ETL (Extract, Transform, and Load) Process
ETL (Extract, Transform, and Load) Process
What is ETL?
ETL is an abbreviation of Extract, Transform and Load. In this process, an ETL tool
extracts the data from different RDBMS source systems then transforms the data
like applying calculations, concatenations, etc. and then load the data into the
Data Warehouse system.
It's tempting to think a creating a Data warehouse is simply extracting data from
multiple sources and loading into database of a Data warehouse. This is far from
the truth and requires a complex ETL process. The ETL process requires active
inputs from various stakeholders including developers, analysts, testers, top
executives and is technically challenging.
In order to maintain its value as a tool for decision-makers, Data warehouse system
needs to change with business changes. ETL is a recurring activity (daily, weekly,
monthly) of a Data warehouse system and needs to be agile, automated, and well
documented.
• It helps companies to analyze their business data for taking critical business
decisions.
• Transactional databases cannot answer complex business questions that
can be answered by ETL.
• A Data Warehouse provides a common data repository
• ETL provides a method of moving the data from various sources into a data
warehouse.
• As data sources change, the Data Warehouse will automatically update.
• Well-designed and documented ETL system is almost essential to the
success of a Data Warehouse project.
• Allow verification of data transformation, aggregation and calculations rules.
• ETL process allows sample data comparison between the source and the
target system.
Hence one needs a logical data map before data is extracted and loaded physically.
This data map describes the relationship between sources and target data.
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and
response time of the source systems. These source systems are live production
databases. Any slow down or locking could effect company's bottom line.
In this step, you apply a set of functions on extracted data. Data that does not
require any transformation is called as direct move or pass through data.
Types of Loading:
Load verification
• Ensure that the key field data is neither missing nor null.
• Test modeling views based on the target tables.
• Check that combined values and calculated measures.
• Data checks in dimension table as well as history table.
• Check the BI reports on the loaded fact and dimension table.
1. MarkLogic:
MarkLogic is a data warehousing solution which makes data integration easier and
faster using an array of enterprise features. It can query different types of data like
documents, relationships, and metadata.
http://developer.marklogic.com/products
2. Oracle:
https://www.oracle.com/index.html
3. Amazon RedShift:
https://aws.amazon.com/redshift/?nc2=h_m1
Every organization would like to have all the data clean, but most of them are not
ready to pay to wait or not ready to wait. To clean it all would simply take too long,
so it is better not to try to cleanse all the data.
Always plan to clean something because the biggest reason for building the Data
Warehouse is to offer cleaner and more reliable data.
Before cleansing all the dirty data, it is important for you to determine the
cleansing cost for every dirty data element.
To reduce storage costs, store summarized data into disk tapes. Also, the trade-off
between the volume of data to be stored and its detailed usage is required. Trade-
off at the level of granularity of data to decrease the storage costs.
Summary:
• ETL is an abbreviation of Extract, Transform and Load.
• ETL provides a method of moving the data from various sources into a data
warehouse.
• In the first step extraction, data is extracted from the source system into the
staging area.
• In the transformation step, the data extracted from source is cleansed and
transformed.
• Loading data into the target Datawarehouse is the last step of the ETL
process.