Unit I
Unit I
ETL
Data warehousing can be defined as the process of data collection and storage from various sources and
managing it to provide valuable business insights.
Electronic storage, where businesses store a large amount of data and information.
Critical component of a business intelligence system that involves techniques for data analysis.
Goal: To produce statistical results that may help in business decision-making.
Data warehousing should be done so that the data stored remains secure, reliable, and can be easily
retrieved and managed.
Data analysis is used to offer deeper information about the performance of an organization by
comparing data from data warehouse.
A data warehouse runs queries and analysis on the historical data.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in
support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.
Non-volatile (permanent): Once data is in the data warehouse, it will not change. So, historical data in
a data warehouse should never be altered.
For example, a college might want to see, how placement results of MCA students has improved over the
last 5 years, in terms of salaries, counts, etc.
Need for Data Warehouse
-- An ordinary Database can store MBs to GBs of data and that too for a
specific purpose.
-- For storing data of TB size, the storage shifted to the Data Warehouse.
-- A transactional database doesn’t offer itself to analytics.
-- To effectively perform analytics, an organization keeps a central Data
Warehouse to closely study its business for making strategic decisions and
analyzing trends.
Steps in Data Warehousing
1. Extraction of data – A large amount of data is gathered from various sources.
2. Cleaning of data – Once the data is compiled, it goes through a cleaning process. The data is scanned
for errors, and any error found is either corrected or excluded.
3. Conversion of data – After being cleaned, the format is changed from the database to a warehouse
format.
4. Storing in a warehouse – Once converted to the warehouse format, the data stored in a warehouse goes
through processes such as consolidation and summarization to make it easier and more coordinated to
use. As sources get updated over time, more data is added to the warehouse.
Data Warehouse Design Process: A data warehouse can be built using a top-down approach, a bottom-
up approach.
Top-down approach
External Sources –
External source is a source from where data is collected irrespective of the type of data. Data can be
structured, semi structured and unstructured as well.
Data Marts –
Data contains subset of the data stored in data warehouse. It stores the information of a particular department
of an organization which is handled by single authority. There can be as many number of data marts in an
organization depending upon the functions. We can also say that data mart
Data Mining –
The practice of analyzing the big data present in data warehouse is data mining. It is used to find the hidden
patterns that are present in the database or in data warehouse with the help of algorithm of data mining.
This approach is defined by Inmon as – data warehouse as a central repository for the complete organization
and data marts are created from it after the complete data warehouse has been created.
Bottom-up approach
Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin, etc. are based
on analyzing large data sets. These sites gather data related to members, groups, locations, etc., and store it
in a single central repository. Being a large amount of data, Data Warehouse is needed for implementing
the same.
Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.
Government: Government uses a data warehouse to store and analyze tax payments which are used to
detect tax thefts.
Centralized Data Repository: Data warehousing provides a centralized repository for all enterprise data
from various sources, such as transactional databases, operational systems, and external sources. This
enables organizations to have a comprehensive view of their data, which can help in making informed
business decisions.
Data Integration: Data warehousing integrates data from different sources into a single, unified view,
which can help in eliminating data silos and reducing data inconsistencies.
Historical Data Storage: Data warehousing stores historical data, which enables organizations to analyze
data trends over time. This can help in identifying patterns and anomalies in the data, which can be used to
improve business performance.
Query and Analysis: Data warehousing provides powerful query and analysis capabilities that enable
users to explore and analyze data in different ways. Ex. Credit card offers.
Data Transformation: This involves cleaning, filtering, and formatting data from various sources to make
it consistent and usable. This can help in improving data quality and reducing data inconsistencies.
Data Mining: Data warehousing provides data mining capabilities, which enable organizations to discover
hidden patterns and relationships in their data. This can help in identifying new opportunities, predicting
future trends, and mitigating risks.
Data Security: Data warehousing provides robust data security features, such as access controls, data
encryption, and data backups, which ensure that the data is secure and protected from unauthorized access.
Complexity: Data warehousing can be complex, and businesses may need to hire specialized personnel to
manage the system.
Time-consuming: Building a data warehouse can take a significant amount of time.
Data integration challenges: Data from different sources can be challenging to integrate, requiring
significant effort to ensure consistency and accuracy.
Data security: Data warehousing can pose data security risks, and businesses must take measures to
protect sensitive data from unauthorized access or breaches.
OLTP (On-Line Transaction Processing) System:
It refers to the system that manage transaction oriented applications.
Designed to support on-line transaction and process query quickly on the Internet.
Online database modifying system, for example, ATM
Every industry in today’s world use OLTP system to record their transactional data.
It supports simple database query so the response time of any user action is very fast.
The data acquired through an OLTP system is stored in commercial RDBMS, which can be used by an
OLAP System for data analytics and other business intelligence operations.
It supports database query like INSERT, UPDATE and DELETE information from the database.
It does not support complex queries.
Some other examples of OLTP systems include order entry, retail sales, and financial transaction systems.
It is technique that gathers or collect data from different It is technique that is used for detailed day to day transaction data
sources into central repository. which keeps changing every day.
It is designed for decision making process. It is designed for business transaction process.
It used for analyzing the business. It used for running the business.
In Data warehousing, the size of database is around In Online transaction processing, the size of data base is around
100GB-2TB . 10MB-100GB.
In Data warehousing, denormalized data is present. In Online transaction processing, normalized data is present.
It is subject-oriented. It is application-oriented.
In Data warehousing, data redundancy is present. In Online transaction processing, there is no data redundancy.
Types of Data Warehouse
There are three main types of data warehouse.
EDW consist of data sources where data collected from various operational and transactional systems
within the organization, such as ERP systems, CRM platforms, finance applications, IoT devices, and
mobile and online systems.
Also consist of staging area where data is aggregated, cleaned, and prepared before being loaded into
the EDW.
EDW has Presentation or Access Space which provides an interface for users to access and interact with
the data stored in the EDW. It enables analytics, querying, reporting, and data sharing.
A data mart is a subset of a data warehouse built to maintain a particular department, region, or business
unit.
Data mart is focused only on particular function of an organization and it is maintained by single
authority only, e.g.m finance, Marketing.
Data Marts are small in size and are flexible.
The data from the data mart is stored in the ODS periodically.
The ODS then sends the data to the EDW, where it is stored and used.
The fundamental use of a data mart is Business Intelligence (BI) applications.
3 types of Data Marts: Dependent, Independent and Hybrid
Data Warehouse Tools
Cloud-Based Data Warehouses
o Amazon Redshift
o Microsoft Azure
o Google BigQuery
o Snowflake
o Micro Focus Vertica
NoSQL Data Stores
o Amazon DynamoDB
o PostgreSQL
Object Storage
o Amazon S3
Enterprise Data Warehouse Solutions
o Teradata
Cloud-Based Relational Databases
o Amazon RDS (Relational Database Service)
o IBM Db2 Warehouse
o Oracle Autonomous Warehouse
Open-Source Relational Databases
o MariaDB
Data Warehouse Implementation: planning and project management
1. Requirements analysis and capacity planning: The first process in data warehousing involves defining
enterprise needs, defining architectures, carrying out capacity planning, and selecting the hardware and software
tools.
2. Hardware integration: Once the hardware and software has been selected, they require to be put by
integrating the servers, the storage methods, and the user software tools.
3. Modeling: Modelling is a significant stage that involves designing the warehouse schema and views.
4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is needed. This
contains designing the physical data warehouse organization, data placement, data partitioning, deciding on
access techniques, and indexing.
5. Sources: The information for the data warehouse comes from several data sources. This step contains
identifying and connecting the sources using the gateway, ODBC drivers.
6. ETL: The collected data will require to go through an ETL phase. ETL phase design and implementation
process includes defining a suitable ETL tool vendors and purchasing and implementing the tools.
7. Populate the data warehouses: Once the ETL tools have been finalized, testing the tools will be needed.
Once everything is working adequately, the ETL tools may be used in populating the warehouses.
8. User applications: For the data warehouses to be helpful, there must be end-user applications. This step
contains designing and implementing applications required by the end-users.
9. Roll-out the warehouses and applications: Once the data warehouse has been populated and the end-client
applications tested, the warehouse system and the operations may be rolled out for the user's community to use.
Data Warehouse Development Life Cycle
1) Requirement gathering
It is done by business analysts, Onsite technical lead and client
In this phase, a Business Analyst prepares business requirement specification(BRS)Document
80% of requirement collection takes place at clients place and it takes 3-4 months for collecting the
requirements
2) Analysis
After collecting the requirements data modeler starts identifying dimensions, facts & aggregation depending on
the requirements
An ETL Lead & BA create ETL specification document which contains how each target table to be populated
from source
4) Data Modeling
It’s a process of designing the database by fulfilling the use requirements
A data modeler is responsible for creating DWH/Data marts with the following kinds of schema
Star Schema
Snowflake Schema
5) ETL Development
Designing ETL applications to fulfill the specifications documents which are prepared in the analysis phase
6) ETL Code review
Code review will be done by the developer. The following activities take place
Check the naming standards
Check the business logic
Check the mapping of source to target
7) Peer Review
A code will be reviewed by a team member: Validation of code but not data
8) ETL Testing
Following tests will be carried out for each ETL Application
Unit testing
Business Functionality testing
Performance testing
User acceptance testing
10) Deployment
A process of migrating the ETL Code & Reports to a pre-production environment for stabilization
It is also known as pilot phase/stabilization phase