Unit 2
Unit 2
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to
change with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data
warehouse system and needs to be agile, automated, and well documented.
1. Extraction
Extraction is the process of extracting the data from one source system for
additional use in the data warehouse environment. This is the first phase of the ETL
process.
The extraction process is the most time-consuming task of the ETL.
2. Cleansing
The cleansing phase is important in the data warehouse technique because it has to
enhance data quality. The main data cleansing features discovered in the ETL tools
are homogenisation. They use particular dictionaries for rectifying the typing
mistakes and for recognising the synonyms.
3. Transformation
Transformation is the key to the reconciliation phase. It transforms the records from
functional source format into a specific data warehouse format. If we deploy a
three-layer architecture, this phase gives a reconciled data layer.
4. Loading
Loading is the mechanism of writing the data into the intent database. In the
loading step, we must ensure that we perform the data loading correctly. We can
perform data loading in two ways:
Refresh: We completely rewrite the data warehouse data. This indicates that we
replace the older file. Generally, we use the refresh in association with the static
extraction for populating the data warehouse originally.
What is MetaData in Data Warehouse?
Metadata is data that describes other data. In data warehousing, Metadata refers to
information representing the characteristics and structure of the data present in the
warehouse. Metadata can include information such as column names, data types,
relationships between tables, and any constraints or business rules that apply to the data.
Metadata is vital for managing and maintaining a data warehouse as it provides a clear
understanding of the data, ensures data quality and consistency, and improves query
performance. You can use Metadata for data lineage, which is the ability to trace the origin
and lineage of data in the warehouse. This is important for compliance, auditing, and
troubleshooting.
1. File metadata: This includes information about a file, such as its name, size, type,
and creation date.
2. Image metadata: This includes information about an image, such as its resolution,
color depth, and camera settings.
3. Music metadata: This includes information about a piece of music, such as its title,
artist, album, and genre.
4. Video metadata: This includes information about a video, such as its length,
resolution, and frame rate.
5. Document metadata: This includes information about a document, such as its author,
title, and creation date.
6. Database metadata: This includes information about a database, such as its structure,
tables, and fields.
7. Web metadata: This includes information about a web page, such as its title,
keywords, and description.
Categories of Metadata
Metadata can be broadly categorized into three categories −
4. Searchability: You can use Metadata to improve the searchability of data. This
improvement makes it easier for users to find and access the information they
need.
Applications of MetaData
In a data warehouse, Metadata plays a critical role. Although it has a different function than
the warehouse data, metadata nonetheless have a significant impact. Some of the essential
roles are:
Metadata behaves like a file. The decision support system uses this file to find
the data warehouse's content.
3. Control over Data: In big organizations that have many stakeholders, it isn't
easy to keep the policies or standards in place.
4. Data Integrity: When you are working on the integration of data from different
sources, you have to ensure the consistency of the Metadata.
5. Security over Data: When working with confidential or sensitive data in any
organization, you have to handle privacy and security, which is quite tricky.
Features:
Features:
The foundation of an information system is long-term planning.
It gives a comprehensive perspective of the organization's dynamics and structure.
It functions as a full and comprehensive system that encompasses all interconnected
organizational subsystems.
It is designed from the top down, with the decision makers or management providing
clear guidance throughout the information system's development phase.
What is OLAP?
OLAP, or online analytical processing, is technology for performing high-speed complex
queries or multidimensional analysis on large volumes of data in a data warehouse, data
lake or other data repository. OLAP is used in business intelligence (BI), decision support,
and a variety of business forecasting and reporting applications.
OLAP, Online Analysis Processing, is capable of providing highest level of functionality and
support for decision which is linked for analyzing large collections of historical data. The
functionality of an OLAP tool is purely based on the existing / current data.