0% found this document useful (0 votes)
14 views

Unit 2

Uploaded by

chaitanyagndh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Unit 2

Uploaded by

chaitanyagndh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

UNIT 2

ETL (Extract, Transform, and Load) Process


What is ETL?
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading.

The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.

To maintain its value as a tool for decision-makers, Data warehouse technique needs to
change with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data
warehouse system and needs to be agile, automated, and well documented.

How ETL Works?


ETL consists of three separate phases:
Phases of ETL

1. Extraction

Extraction is the process of extracting the data from one source system for
additional use in the data warehouse environment. This is the first phase of the ETL
process.
The extraction process is the most time-consuming task of the ETL.

2. Cleansing

The cleansing phase is important in the data warehouse technique because it has to
enhance data quality. The main data cleansing features discovered in the ETL tools
are homogenisation. They use particular dictionaries for rectifying the typing
mistakes and for recognising the synonyms.

3. Transformation

Transformation is the key to the reconciliation phase. It transforms the records from
functional source format into a specific data warehouse format. If we deploy a
three-layer architecture, this phase gives a reconciled data layer.

4. Loading

Loading is the mechanism of writing the data into the intent database. In the
loading step, we must ensure that we perform the data loading correctly. We can
perform data loading in two ways:
Refresh: We completely rewrite the data warehouse data. This indicates that we
replace the older file. Generally, we use the refresh in association with the static
extraction for populating the data warehouse originally.
What is MetaData in Data Warehouse?

Metadata is data that describes other data. In data warehousing, Metadata refers to
information representing the characteristics and structure of the data present in the
warehouse. Metadata can include information such as column names, data types,
relationships between tables, and any constraints or business rules that apply to the data.

Metadata is vital for managing and maintaining a data warehouse as it provides a clear
understanding of the data, ensures data quality and consistency, and improves query
performance. You can use Metadata for data lineage, which is the ability to trace the origin
and lineage of data in the warehouse. This is important for compliance, auditing, and
troubleshooting.

Several Examples of Metadata:

1. File metadata: This includes information about a file, such as its name, size, type,
and creation date.
2. Image metadata: This includes information about an image, such as its resolution,
color depth, and camera settings.
3. Music metadata: This includes information about a piece of music, such as its title,
artist, album, and genre.
4. Video metadata: This includes information about a video, such as its length,
resolution, and frame rate.
5. Document metadata: This includes information about a document, such as its author,
title, and creation date.
6. Database metadata: This includes information about a database, such as its structure,
tables, and fields.
7. Web metadata: This includes information about a web page, such as its title,
keywords, and description.

Categories of Metadata
Metadata can be broadly categorized into three categories −

 Business Metadata − It has the data ownership information, business


definition, and changing policies.

 Technical Metadata − It includes database system names, table and column


names and sizes, data types and allowed values. Technical metadata also
includes structural information such as primary and foreign key attributes
and indices.
 Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation
applied on it.

Features of MetaData in Data Warehouse

1. Description: Metadata provides a description of the data it is associated with,


such as title, author, date created, and keywords.

2. Organization: You can use Metadata to organize data, such as structuring a


document into chapters or sections.

3. Interoperability: Metadata can be used to ensure interoperability between


different systems by following common metadata standards.

4. Searchability: You can use Metadata to improve the searchability of data. This
improvement makes it easier for users to find and access the information they
need.

5. Contextualization: Metadata can provide context for the data it is associated


with, making it easier for users to understand and interpret the information.

Applications of MetaData
In a data warehouse, Metadata plays a critical role. Although it has a different function than
the warehouse data, metadata nonetheless have a significant impact. Some of the essential
roles are:
 Metadata behaves like a file. The decision support system uses this file to find
the data warehouse's content.

 Metadata assists decision support systems in mapping the data when


transforming data from an operational to a data warehouse environment.
 You can use Metadata for query tools.

 You can use metadata in cleansing and extraction tools.

 Metadata has a crucial role when it comes to loading functions.

Limitations to MetaData Management


Metadata management has some limitations as well. Following are some limitations of
Metadata management.
1. Quality of Data: Issues related to the quality of the data can arise from
improperly organized or inaccurate Metadata, which makes it more challenging
to use and comprehend the data.

2. Devoid of Standardization: When it comes to the management of Metadata,


different systems or organizations use different conventions or standards. So,
when you are managing metadata from different sources, you may face
difficulties.

3. Control over Data: In big organizations that have many stakeholders, it isn't
easy to keep the policies or standards in place.

4. Data Integrity: When you are working on the integration of data from different
sources, you have to ensure the consistency of the Metadata.

5. Security over Data: When working with confidential or sensitive data in any
organization, you have to handle privacy and security, which is quite tricky.

What is Operational System?


A well-known word in data warehousing, operational system refers to a system used to retain
records of everyday business activities inside a company. Online Transaction Processing is a
synonym for operational system (OLTP). Operational systems must deal with real-time data
values, which include payroll, inventory, order, and other operational data.

Features:

 Modes of protection and supervision


 Execution of Programs
 Modifications to the File System
 Handling I/O Operations.

What is Informational System?


Informational systems are standardized systems usually used inside an organization's people,
processes, and technology to enhance interaction. Informational systems are intended to
collect, compile, and extract information from data. Informational systems are used
universally to improve the efficiency of enterprises and organizations.

Features:
 The foundation of an information system is long-term planning.
 It gives a comprehensive perspective of the organization's dynamics and structure.
 It functions as a full and comprehensive system that encompasses all interconnected
organizational subsystems.
 It is designed from the top down, with the decision makers or management providing
clear guidance throughout the information system's development phase.

Difference between Operational Systems and Informational Systems :


S.No Operational Systems Informational Systems

Informational Systems deals with the


Operational systems are designed to
1. collection, compilation and deriving
deal with the running values of data.
information from data.

In operational systems, optimization In informational systems, optimization


2. of data structure is done for of data structure is done for complex
transactions. queries.

While informational systems have a


Operational systems have response
3. response time of few seconds to
time of sub-seconds.
minutes.

Informational Systems are mainly


Operational systems are generally
4. designed for large volumes of data and
suited for small volumes of data.
hence convenient to use.

Operational systems are process While informational systems are subject


5.
oriented. oriented.

Operational systems supports


Informational systems only supports
6. various data access operations such
read operation for data access.
as read, update and delete.

What is OLAP?
OLAP, or online analytical processing, is technology for performing high-speed complex
queries or multidimensional analysis on large volumes of data in a data warehouse, data
lake or other data repository. OLAP is used in business intelligence (BI), decision support,
and a variety of business forecasting and reporting applications.

What is the difference between OLAP and DSS?


Data driven Decision support system is used to access and manipulate data. Data Driven DSS
in conjunction with On line Analytical Processing speeds up the work of analysts to arrive at
a conclusion.

What is the difference between OLAP and DSS?


DSS, Decision Support System, as the name suggests, helps in taking decisions for top
executive professionals. Data accessing, time-series data manipulation of an enterprise’s
internal / some times external data is emphasized by DSS. The manipulation is done by tailor
made tools that are task specific and operators and general tools for providing additional
functionality.

OLAP, Online Analysis Processing, is capable of providing highest level of functionality and
support for decision which is linked for analyzing large collections of historical data. The
functionality of an OLAP tool is purely based on the existing / current data.

You might also like