01 Data Warehouse

Uploaded by

vv9807898

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

01 Data Warehouse

Uploaded by

vv9807898

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

01 Data Warehouse

PRAVEEN KUMAR SRIVASTAVA

UNIT 1

Introduction to Data Warehouse, Building a Data Warehouse, Data Pre-processing & Data cleaning
Data Cleaning methods, Data reduction, Descriptive Data Summarization, Data Discretization, Concept
Hierarchy Generation
KDD Process

KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously
unknown, and potentially valuable information from large datasets. The KDD process is an iterative
process and it requires multiple iterations of the above steps to extract accurate knowledge from the
data.
The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
1.Cleaning in case of Missing values.
2.Cleaning noisy data, where noise is a random or variance error.
3.Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a common source
(Data Warehouse). Data integration using Data Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.

Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the
data collection. For this we can use Neural network, Decision Trees, Naive bayes, Clustering,
and Regression methods.

Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
1.Data Mapping: Assigning elements from source base to destination to capture transformations.
2.Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It
transforms task relevant data into patterns, and decides purpose of model
using classification or characterization.

Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based
on given measures. It find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user.

Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make decisions.
Advantages of KDD

1.Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2.Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready
for analysis, which saves time and money.
3.Better customer service: KDD helps organizations gain a better understanding of their customers’
needs and preferences, which can help them provide better customer service.
4.Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and anomalies
in the data that may indicate fraud.
5.Predictive modeling: KDD can be used to build predictive models that can forecast future trends and
patterns.
Disadvantages of KDD

1.Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large amounts
of data, which can include sensitive information about individuals.
2.Complexity: KDD can be a complex process that requires specialized skills and knowledge to implement
and interpret the results.
3.Unintended consequences: KDD can lead to unintended consequences, such as bias or discrimination,
if the data or models are not properly understood or used.
4.Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or consistent,
the results can be misleading
5.High cost: KDD can be an expensive process, requiring significant investments in hardware, software,
and personnel.
6.Overfitting: KDD process can lead to overfitting, which is a common problem in machine learning where
a model learns the detail and noise in the training data to the extent that it negatively impacts the
performance of the model on new unseen data.
Parameter KDD Data Mining
KDD refers to a process of identifying valid, Data Mining refers to a process of extracting
Definition novel, potentially useful, and ultimately useful and valuable information or patterns
understandable patterns and relationships in data. from large data sets.
Objective To find useful knowledge from data. To extract useful information from data.
Data cleaning, data integration, data selection,
Association rules, classification, clustering,
data transformation, data mining, pattern
Techniques Used regression, decision trees, neural networks, and
evaluation, and knowledge representation and
dimensionality reduction.
visualization.
Patterns, associations, or insights that can be
Structured information, such as rules and models,
Output used to improve decision-making or
that can be used to make decisions or predictions.
understanding.
Focus is on the discovery of useful knowledge, Data mining focus is on the discovery of
Focus
rather than simply finding patterns in data. patterns or relationships in data.
Domain expertise is important in KDD, as it helps Domain expertise is less critical in data mining,
Role of domain
in defining the goals of the process, choosing as the algorithms are designed to identify
expertise
appropriate data, and interpreting the results. patterns without relying on prior knowledge.
Data Warehouse

According to William H. Inmon, a leading architect in the construction of data warehouse systems,

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in

support of management’s decision making process”

Data Warehouse stores a huge amount of data, which is typically collected from multiple
heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that may help in
decision-making.
For example, a college might want to see quick different results, like how the placement of CS
students has improved over the last 10 years, in terms of salaries, counts, etc.
The four keywords—subject-oriented, integrated, time-variant, and nonvolatile—distinguish data
warehouses from other data repository systems, such as relational database systems, transaction
processing systems, and file systems.
Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales. Rather than concentrating on the day-to-day operations and transaction
processing of an organization, a data warehouse focuses on the modeling and analysis of data for
decision makers

Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,

such as relational databases, flat files, and online transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding structures,
attribute measures, and so on.
Time-variant: Data are stored to provide information from an historic perspective (e.g., the past 5–10
years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time element.

Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse does not
require transaction processing, recovery, and concurrency control mechanisms. It usually requires only
two operations in data accessing: initial loading of data and access of data.
Data Warehouse Design Process

A data warehouse can be built using a top-down approach, a bottom-up approach, or a combination
of both.
The top-down approach starts with the overall design and planning. It is useful in cases where the
technology is mature and well known, and where the business problems that must be solved are clear
and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of
business modeling and technology development. It allows an organization to move forward at
considerably less expense and to evaluate the benefits of the technology before making significant
commitments.
In the combined approach, an organization can exploit the planned and strategic nature of the top-down
approach while retaining the rapid implementation and opportunistic application of the bottom-up
approach.
The warehouse design process consists of the following steps:
 Choose a business process to model, for example, orders, invoices, shipments, inventory, account administration,
sales, or the general ledger. If the business process is organizational and involves multiple complex object collections, a
data warehouse model should be followed. However, if the process is departmental and focuses on the analysis of one
kind of business process, a data mart model should be chosen.
 Choose the grain of the business process. The grain is the fundamental, atomic level of data to be represented in the
fact table for this process, for example, individual transactions, individual daily snapshots, and so on.
 Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier,
warehouse, transaction type, and status.
 Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like
dollars sold and units sold.
Operational Database Data Warehouse
Operational systems are designed to support high-volume Data warehousing systems are typically designed to
transaction processing. support high-volume analytical processing (i.e., OLAP).

Operational systems are usually concerned with current Data warehousing systems are usually concerned with
data. historical data.
Data within operational systems are mainly updated Non-volatile, new data may be added regularly. Once
regularly according to need. Added rarely changed.
It is designed for real-time business dealing and processes. It is designed for analysis of business measures by subject
area, categories, and attributes.
It is optimized for a simple set of transactions, generally It is optimized for extent loads and high, complex,
adding or retrieving a single row at a time per table. unpredictable queries that access many rows per table.

It is optimized for validation of incoming information during Loaded with consistent, valid information, requires no real-
transactions, uses validation data tables. time validation.

It supports thousands of concurrent clients. It supports a few concurrent clients relative to OLTP.

Operational systems are widely process-oriented. Data warehousing systems are widely subject-oriented

Operational systems are usually optimized to perform fast Data warehousing systems are usually optimized to
inserts and updates of associatively small volumes of data. perform fast retrievals of relatively high volumes of data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on-line transactional Data Warehouse designed for on-line Analytical
Processing (OLTP) Processing (OLAP)

Data Mining Assignment
No ratings yet
Data Mining Assignment
11 pages
DWDM 5 Unit Notes
No ratings yet
DWDM 5 Unit Notes
86 pages
DWDM Lecture Notes III-II
No ratings yet
DWDM Lecture Notes III-II
86 pages
Data Warehousing Mining 20APE0501 Min
No ratings yet
Data Warehousing Mining 20APE0501 Min
87 pages
Data Mining - Reference - 1
No ratings yet
Data Mining - Reference - 1
91 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
15 pages
A Paper Presentation On: - Information Repository With Knowledge Discovery
No ratings yet
A Paper Presentation On: - Information Repository With Knowledge Discovery
23 pages
DATA MINING Unit 1
No ratings yet
DATA MINING Unit 1
22 pages
Unit 1 - Introduction
No ratings yet
Unit 1 - Introduction
8 pages
Unit I Data Mining
No ratings yet
Unit I Data Mining
34 pages
DWDM Lecture Notes III-II (1)
No ratings yet
DWDM Lecture Notes III-II (1)
81 pages
Data Warehousingdata Mining
No ratings yet
Data Warehousingdata Mining
86 pages
Great Compiled Notes Data Mining V1
No ratings yet
Great Compiled Notes Data Mining V1
92 pages
5, Data Warehousing
No ratings yet
5, Data Warehousing
16 pages
DMDW Technical Paper Presentation.
No ratings yet
DMDW Technical Paper Presentation.
12 pages
Data Warehousing and Data Mining Final Year Seminar Topic
No ratings yet
Data Warehousing and Data Mining Final Year Seminar Topic
10 pages
Knowledge Discovery in Databases (KDD) Lect 4
No ratings yet
Knowledge Discovery in Databases (KDD) Lect 4
28 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
DMDA MID 1
No ratings yet
DMDA MID 1
20 pages
Dw Assignment
No ratings yet
Dw Assignment
6 pages
DMDA MID 1 LAQS
No ratings yet
DMDA MID 1 LAQS
10 pages
DWDM Lecture Notes U-1
No ratings yet
DWDM Lecture Notes U-1
11 pages
What Is Data Mining
No ratings yet
What Is Data Mining
10 pages
Data Mining New
No ratings yet
Data Mining New
21 pages
data ware house
No ratings yet
data ware house
203 pages
Unit I
No ratings yet
Unit I
20 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
36 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
What Motivated Data Mining? Why Is It Important?
No ratings yet
What Motivated Data Mining? Why Is It Important?
14 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
14 pages
Dbms Data Warehosuing
No ratings yet
Dbms Data Warehosuing
80 pages
DM NOTES
No ratings yet
DM NOTES
193 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
135 pages
Unit 2 Data Warehouse
No ratings yet
Unit 2 Data Warehouse
22 pages
DWDM external
No ratings yet
DWDM external
30 pages
Unit-2
No ratings yet
Unit-2
29 pages
Data Mining Lab-Weka Edited
No ratings yet
Data Mining Lab-Weka Edited
55 pages
ALL YOU NEED Data_Mining_and_Warehousing
No ratings yet
ALL YOU NEED Data_Mining_and_Warehousing
42 pages
Introduction To Data Mining-1
100% (1)
Introduction To Data Mining-1
24 pages
CSEE8
No ratings yet
CSEE8
10 pages
Module 1 Notes
No ratings yet
Module 1 Notes
29 pages
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
53 pages
DMDW 1 2nd Module
No ratings yet
DMDW 1 2nd Module
29 pages
adbms-unit5 (1)
No ratings yet
adbms-unit5 (1)
10 pages
Current Trends
No ratings yet
Current Trends
35 pages
Data Mining and Warehousing - L1 & L2
No ratings yet
Data Mining and Warehousing - L1 & L2
30 pages
Dmdw-Unit-1 R16
No ratings yet
Dmdw-Unit-1 R16
17 pages
HU-DM-2024
No ratings yet
HU-DM-2024
205 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
108 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
29 pages
Knowledge-Driven Decision Support System Based On Knowledge Warehouse and Data Mining For Market Management
No ratings yet
Knowledge-Driven Decision Support System Based On Knowledge Warehouse and Data Mining For Market Management
9 pages
DM - MOD - 1 Part I
No ratings yet
DM - MOD - 1 Part I
9 pages
Data-Driven Decision Making
From Everand
Data-Driven Decision Making
Aadinath Pothuvaal
No ratings yet
BDA U2
No ratings yet
BDA U2
44 pages
Defining Data Mining and Data Warehouse (Adugna Gutema)
No ratings yet
Defining Data Mining and Data Warehouse (Adugna Gutema)
9 pages
DWM 4
No ratings yet
DWM 4
23 pages
DWM Unit 3. Data Warehousing Designing & OLAP II
100% (1)
DWM Unit 3. Data Warehousing Designing & OLAP II
21 pages
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Database Management Systems Notes
No ratings yet
Database Management Systems Notes
49 pages
Full Stack Development PDF
No ratings yet
Full Stack Development PDF
5 pages
Big - Data My Notes
No ratings yet
Big - Data My Notes
37 pages
Name: Vaishnavi Kailas Raut Seat No.: S190088621 Group B Assignment No: 1
No ratings yet
Name: Vaishnavi Kailas Raut Seat No.: S190088621 Group B Assignment No: 1
42 pages
Oracle 8i and 9i PL-SQL Collections and Records
No ratings yet
Oracle 8i and 9i PL-SQL Collections and Records
25 pages
MIE1628 Big Data Analytics Lecture7
No ratings yet
MIE1628 Big Data Analytics Lecture7
77 pages
Lab 03
No ratings yet
Lab 03
12 pages
(PDF Download) The Big R-Book: From Data Science To Learning Machines and Big Data Philippe J. S. de Brouwer Fulll Chapter
100% (4)
(PDF Download) The Big R-Book: From Data Science To Learning Machines and Big Data Philippe J. S. de Brouwer Fulll Chapter
64 pages
Board Practical Question Paper - 2024-2025
100% (1)
Board Practical Question Paper - 2024-2025
9 pages
Database Lecture10
No ratings yet
Database Lecture10
30 pages
Module 3_(Prepare Data for Exploration)
No ratings yet
Module 3_(Prepare Data for Exploration)
29 pages
Krishna's Resume
No ratings yet
Krishna's Resume
1 page
FK08 Issue
No ratings yet
FK08 Issue
2 pages
Solutions of PYQ XII CS ======
No ratings yet
Solutions of PYQ XII CS ======
84 pages
Access Reference: Working With Tables: Susan Harkins
No ratings yet
Access Reference: Working With Tables: Susan Harkins
17 pages
Developer Training For Apache Spark and Hadoop: Hands-On Exercises
No ratings yet
Developer Training For Apache Spark and Hadoop: Hands-On Exercises
113 pages
unit3
No ratings yet
unit3
3 pages
Database System CUIT201 PDF
No ratings yet
Database System CUIT201 PDF
3 pages
Hibernate
No ratings yet
Hibernate
22 pages
Replication Management Agent (RMA) 15.7.1 SP202: Configuration and Users Guide
No ratings yet
Replication Management Agent (RMA) 15.7.1 SP202: Configuration and Users Guide
44 pages
CPQ dumps (1) 2 (1)
No ratings yet
CPQ dumps (1) 2 (1)
24 pages
DBMS-Cardnality and Constraints
No ratings yet
DBMS-Cardnality and Constraints
27 pages
Chapter 6 - Rps 4: The Database and Database Management Systems
No ratings yet
Chapter 6 - Rps 4: The Database and Database Management Systems
39 pages
Rsaikumar Resume PLSQL
No ratings yet
Rsaikumar Resume PLSQL
3 pages
Microservices White Paper
No ratings yet
Microservices White Paper
10 pages
SQL Assignment
No ratings yet
SQL Assignment
1 page
Lab Manual SQL
No ratings yet
Lab Manual SQL
20 pages
Whats New in Oracle Database 23c
No ratings yet
Whats New in Oracle Database 23c
49 pages
Iseries Journal Code Documentation
No ratings yet
Iseries Journal Code Documentation
106 pages
DATA PROCESSING FULL No805673418
No ratings yet
DATA PROCESSING FULL No805673418
48 pages

01 Data Warehouse

Uploaded by

01 Data Warehouse

Uploaded by

01 Data Warehouse

PRAVEEN KUMAR SRIVASTAVA

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in

Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,

Data In Data Out

You might also like