0% found this document useful (0 votes)
9 views18 pages

Unit I

Uploaded by

shrutibelekar0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

Unit I

Uploaded by

shrutibelekar0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit I: Data Warehouse Fundamentals

Introduction to Data Warehouse

ETL

Oracle: Data Integrator


Microsoft: SSIS (sql server
integration services)

 Data warehousing can be defined as the process of data collection and storage from various sources and
managing it to provide valuable business insights.
 Electronic storage, where businesses store a large amount of data and information.
 Critical component of a business intelligence system that involves techniques for data analysis.
 Goal: To produce statistical results that may help in business decision-making.
 Data warehousing should be done so that the data stored remains secure, reliable, and can be easily
retrieved and managed.
 Data analysis is used to offer deeper information about the performance of an organization by
comparing data from data warehouse.
 A data warehouse runs queries and analysis on the historical data.
 A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in
support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources.

Time-Variant: Historical data is kept in a data warehouse.

Non-volatile (permanent): Once data is in the data warehouse, it will not change. So, historical data in
a data warehouse should never be altered.
For example, a college might want to see, how placement results of MCA students has improved over the
last 5 years, in terms of salaries, counts, etc.
Need for Data Warehouse
-- An ordinary Database can store MBs to GBs of data and that too for a
specific purpose.
-- For storing data of TB size, the storage shifted to the Data Warehouse.
-- A transactional database doesn’t offer itself to analytics.
-- To effectively perform analytics, an organization keeps a central Data
Warehouse to closely study its business for making strategic decisions and
analyzing trends.
Steps in Data Warehousing
1. Extraction of data – A large amount of data is gathered from various sources.
2. Cleaning of data – Once the data is compiled, it goes through a cleaning process. The data is scanned
for errors, and any error found is either corrected or excluded.
3. Conversion of data – After being cleaned, the format is changed from the database to a warehouse
format.
4. Storing in a warehouse – Once converted to the warehouse format, the data stored in a warehouse goes
through processes such as consolidation and summarization to make it easier and more coordinated to
use. As sources get updated over time, more data is added to the warehouse.
Data Warehouse Design Process: A data warehouse can be built using a top-down approach, a bottom-
up approach.

Top-down approach

External Sources –
External source is a source from where data is collected irrespective of the type of data. Data can be
structured, semi structured and unstructured as well.

Stage Area – ETL Tool


Since the data, extracted from the external sources does not follow a particular format, so there is a need to
validate this data to load into data warehouse.
E(Extracted): Data is extracted from External data source.
T(Transform): Data is transformed into the standard format.
L(Load): Data is loaded into data warehouse after transforming it into the standard format.
Data warehouse –
After cleansing of data, it is stored in the data warehouse as central repository. It actually stores the meta data
and the actual data gets stored in the data marts.

Data Marts –
Data contains subset of the data stored in data warehouse. It stores the information of a particular department
of an organization which is handled by single authority. There can be as many number of data marts in an
organization depending upon the functions. We can also say that data mart

Data Mining –
The practice of analyzing the big data present in data warehouse is data mining. It is used to find the hidden
patterns that are present in the database or in data warehouse with the help of algorithm of data mining.
This approach is defined by Inmon as – data warehouse as a central repository for the complete organization
and data marts are created from it after the complete data warehouse has been created.
Bottom-up approach

 First, the data is extracted from external sources.


 Then, the data go through the staging area and loaded into data marts instead of data warehouse.
 The data marts are created first and provide reporting capability. It addresses a single business area.
 These data marts are then integrated into data warehouse.
 This approach is given by Kimball as – data marts are created first and provides a thin view for analyses
and data warehouse is created after complete data marts have been created.
Applications of Data Warehousing
Data Warehousing can be applied anywhere where we have a huge amount of data and we want to see statistical
results that help in decision making.

 Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin, etc. are based
on analyzing large data sets. These sites gather data related to members, groups, locations, etc., and store it
in a single central repository. Being a large amount of data, Data Warehouse is needed for implementing
the same.

 Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.

 Government: Government uses a data warehouse to store and analyze tax payments which are used to
detect tax thefts.

Features/Characteristics/Properties of Data Warehousing

 Centralized Data Repository: Data warehousing provides a centralized repository for all enterprise data
from various sources, such as transactional databases, operational systems, and external sources. This
enables organizations to have a comprehensive view of their data, which can help in making informed
business decisions.

 Data Integration: Data warehousing integrates data from different sources into a single, unified view,
which can help in eliminating data silos and reducing data inconsistencies.
 Historical Data Storage: Data warehousing stores historical data, which enables organizations to analyze
data trends over time. This can help in identifying patterns and anomalies in the data, which can be used to
improve business performance.

 Query and Analysis: Data warehousing provides powerful query and analysis capabilities that enable
users to explore and analyze data in different ways. Ex. Credit card offers.

 Data Transformation: This involves cleaning, filtering, and formatting data from various sources to make
it consistent and usable. This can help in improving data quality and reducing data inconsistencies.

 Data Mining: Data warehousing provides data mining capabilities, which enable organizations to discover
hidden patterns and relationships in their data. This can help in identifying new opportunities, predicting
future trends, and mitigating risks.

 Data Security: Data warehousing provides robust data security features, such as access controls, data
encryption, and data backups, which ensure that the data is secure and protected from unauthorized access.

Advantages of Data Warehousing


 Intelligent Decision-Making: With centralized data in warehouses, decisions may be made more quickly
and intelligently.
 Business Intelligence: Provides strong operational insights through business intelligence.
 Historical Analysis: Predictions and trend analysis are made easier by storing past data.
 Data Quality: Guarantees data quality and consistency for trustworthy reporting.
 Scalability: Capable of managing massive data volumes and expanding to meet changing requirements.
 Effective Queries: Fast and effective data retrieval is made possible by an optimized structure.
 Cost reductions: Data warehousing can result in cost savings over time by reducing data management
procedures and increasing overall efficiency, even when there are setup costs initially.
 Data security: Data warehouses employ security protocols to safeguard confidential information,
guaranteeing that only authorized personnel are granted access to certain data.

Disadvantages of Data Warehousing


 Cost: Building a data warehouse can be expensive, requiring significant investments in hardware,
software, and personnel.

 Complexity: Data warehousing can be complex, and businesses may need to hire specialized personnel to
manage the system.
 Time-consuming: Building a data warehouse can take a significant amount of time.

 Data integration challenges: Data from different sources can be challenging to integrate, requiring
significant effort to ensure consistency and accuracy.

 Data security: Data warehousing can pose data security risks, and businesses must take measures to
protect sensitive data from unauthorized access or breaches.
OLTP (On-Line Transaction Processing) System:
 It refers to the system that manage transaction oriented applications.
 Designed to support on-line transaction and process query quickly on the Internet.
 Online database modifying system, for example, ATM
 Every industry in today’s world use OLTP system to record their transactional data.
 It supports simple database query so the response time of any user action is very fast.
 The data acquired through an OLTP system is stored in commercial RDBMS, which can be used by an
OLAP System for data analytics and other business intelligence operations.
 It supports database query like INSERT, UPDATE and DELETE information from the database.
 It does not support complex queries.

Some other examples of OLTP systems include order entry, retail sales, and financial transaction systems.

Advantages of an OLTP System:


 User friendly and can be used by anyone having basic understanding
 It allows its user to perform operations like read, write and delete data quickly.
 It responds to its user actions immediately as it can process query very quickly.
 This systems are original source of the data.
 It helps to administrate and run fundamental business tasks
 It helps in widening customer base of an organization by simplifying individual processes
Difference between Data Warehousing and Online-Transaction processing (OLTP)

Data Warehousing DWH Online transaction

It is technique that gathers or collect data from different It is technique that is used for detailed day to day transaction data
sources into central repository. which keeps changing every day.

It is designed for decision making process. It is designed for business transaction process.

It stores large amount of data or historical data. It holds current data.

It used for analyzing the business. It used for running the business.

In Data warehousing, the size of database is around In Online transaction processing, the size of data base is around
100GB-2TB . 10MB-100GB.

In Data warehousing, denormalized data is present. In Online transaction processing, normalized data is present.

It uses Query processing. It uses transaction processing

It is subject-oriented. It is application-oriented.

In Data warehousing, data redundancy is present. In Online transaction processing, there is no data redundancy.
Types of Data Warehouse
There are three main types of data warehouse.

Enterprise Data Warehouse (EDW)


It is a relational data warehouse containing a company's business data, including information about its
customers.
E.g. Amazon Redshift, Google BigQuery, Snowflake, Microsoft Azure, etc.

It provides access to cross-organizational information, offers a unified approach to data representation,


and allows running complex queries.

EDW consist of data sources where data collected from various operational and transactional systems
within the organization, such as ERP systems, CRM platforms, finance applications, IoT devices, and
mobile and online systems.

Also consist of staging area where data is aggregated, cleaned, and prepared before being loaded into
the EDW.

EDW has Presentation or Access Space which provides an interface for users to access and interact with
the data stored in the EDW. It enables analytics, querying, reporting, and data sharing.

Operational Data Store (ODS)

This type of data warehouse refreshes in real-time.


It is often preferred for routine activities like storing employee records.
It is required when data warehouse systems do not support reporting needs of the business.
Data Mart

A data mart is a subset of a data warehouse built to maintain a particular department, region, or business
unit.
Data mart is focused only on particular function of an organization and it is maintained by single
authority only, e.g.m finance, Marketing.
Data Marts are small in size and are flexible.
The data from the data mart is stored in the ODS periodically.
The ODS then sends the data to the EDW, where it is stored and used.
The fundamental use of a data mart is Business Intelligence (BI) applications.
3 types of Data Marts: Dependent, Independent and Hybrid
Data Warehouse Tools
 Cloud-Based Data Warehouses
o Amazon Redshift
o Microsoft Azure
o Google BigQuery
o Snowflake
o Micro Focus Vertica
 NoSQL Data Stores
o Amazon DynamoDB
o PostgreSQL
 Object Storage
o Amazon S3
 Enterprise Data Warehouse Solutions
o Teradata
 Cloud-Based Relational Databases
o Amazon RDS (Relational Database Service)
o IBM Db2 Warehouse
o Oracle Autonomous Warehouse
 Open-Source Relational Databases
o MariaDB
Data Warehouse Implementation: planning and project management

1. Requirements analysis and capacity planning: The first process in data warehousing involves defining
enterprise needs, defining architectures, carrying out capacity planning, and selecting the hardware and software
tools.
2. Hardware integration: Once the hardware and software has been selected, they require to be put by
integrating the servers, the storage methods, and the user software tools.

3. Modeling: Modelling is a significant stage that involves designing the warehouse schema and views.

4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is needed. This
contains designing the physical data warehouse organization, data placement, data partitioning, deciding on
access techniques, and indexing.

5. Sources: The information for the data warehouse comes from several data sources. This step contains
identifying and connecting the sources using the gateway, ODBC drivers.

6. ETL: The collected data will require to go through an ETL phase. ETL phase design and implementation
process includes defining a suitable ETL tool vendors and purchasing and implementing the tools.

7. Populate the data warehouses: Once the ETL tools have been finalized, testing the tools will be needed.
Once everything is working adequately, the ETL tools may be used in populating the warehouses.

8. User applications: For the data warehouses to be helpful, there must be end-user applications. This step
contains designing and implementing applications required by the end-users.

9. Roll-out the warehouses and applications: Once the data warehouse has been populated and the end-client
applications tested, the warehouse system and the operations may be rolled out for the user's community to use.
Data Warehouse Development Life Cycle

1) Requirement gathering
 It is done by business analysts, Onsite technical lead and client
 In this phase, a Business Analyst prepares business requirement specification(BRS)Document
 80% of requirement collection takes place at clients place and it takes 3-4 months for collecting the
requirements
2) Analysis
 After collecting the requirements data modeler starts identifying dimensions, facts & aggregation depending on
the requirements
 An ETL Lead & BA create ETL specification document which contains how each target table to be populated
from source

3) System Requirement Specification (SRS)


 After collection of onsite knowledge transfer, an offshore team will prepare the SRS
 An SRS document includes software, hardware, operating system requirements

4) Data Modeling
 It’s a process of designing the database by fulfilling the use requirements
 A data modeler is responsible for creating DWH/Data marts with the following kinds of schema
 Star Schema
 Snowflake Schema

5) ETL Development
 Designing ETL applications to fulfill the specifications documents which are prepared in the analysis phase
6) ETL Code review
Code review will be done by the developer. The following activities take place
 Check the naming standards
 Check the business logic
 Check the mapping of source to target

7) Peer Review
A code will be reviewed by a team member: Validation of code but not data
8) ETL Testing
Following tests will be carried out for each ETL Application
 Unit testing
 Business Functionality testing
 Performance testing
 User acceptance testing

9) Report development environment


 Design the reports to fulfill report requirement templates/Report data workbook(RDW)

10) Deployment
 A process of migrating the ETL Code & Reports to a pre-production environment for stabilization
 It is also known as pilot phase/stabilization phase

11) Production Environment/Go live


 An active/working environment

You might also like