01 Data Warehouse
01 Data Warehouse
Introduction to Data Warehouse, Building a Data Warehouse, Data Pre-processing & Data cleaning
Data Cleaning methods, Data reduction, Descriptive Data Summarization, Data Discretization, Concept
Hierarchy Generation
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously
unknown, and potentially valuable information from large datasets. The KDD process is an iterative
process and it requires multiple iterations of the above steps to extract accurate knowledge from the
data.
The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
1.Cleaning in case of Missing values.
2.Cleaning noisy data, where noise is a random or variance error.
3.Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a common source
(Data Warehouse). Data integration using Data Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the
data collection. For this we can use Neural network, Decision Trees, Naive bayes, Clustering,
and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
1.Data Mapping: Assigning elements from source base to destination to capture transformations.
2.Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It
transforms task relevant data into patterns, and decides purpose of model
using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based
on given measures. It find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make decisions.
Advantages of KDD
1.Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2.Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready
for analysis, which saves time and money.
3.Better customer service: KDD helps organizations gain a better understanding of their customers’
needs and preferences, which can help them provide better customer service.
4.Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and anomalies
in the data that may indicate fraud.
5.Predictive modeling: KDD can be used to build predictive models that can forecast future trends and
patterns.
Disadvantages of KDD
1.Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large amounts
of data, which can include sensitive information about individuals.
2.Complexity: KDD can be a complex process that requires specialized skills and knowledge to implement
and interpret the results.
3.Unintended consequences: KDD can lead to unintended consequences, such as bias or discrimination,
if the data or models are not properly understood or used.
4.Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or consistent,
the results can be misleading
5.High cost: KDD can be an expensive process, requiring significant investments in hardware, software,
and personnel.
6.Overfitting: KDD process can lead to overfitting, which is a common problem in machine learning where
a model learns the detail and noise in the training data to the extent that it negatively impacts the
performance of the model on new unseen data.
Parameter KDD Data Mining
KDD refers to a process of identifying valid, Data Mining refers to a process of extracting
Definition novel, potentially useful, and ultimately useful and valuable information or patterns
understandable patterns and relationships in data. from large data sets.
Objective To find useful knowledge from data. To extract useful information from data.
Data cleaning, data integration, data selection,
Association rules, classification, clustering,
data transformation, data mining, pattern
Techniques Used regression, decision trees, neural networks, and
evaluation, and knowledge representation and
dimensionality reduction.
visualization.
Patterns, associations, or insights that can be
Structured information, such as rules and models,
Output used to improve decision-making or
that can be used to make decisions or predictions.
understanding.
Focus is on the discovery of useful knowledge, Data mining focus is on the discovery of
Focus
rather than simply finding patterns in data. patterns or relationships in data.
Domain expertise is important in KDD, as it helps Domain expertise is less critical in data mining,
Role of domain
in defining the goals of the process, choosing as the algorithms are designed to identify
expertise
appropriate data, and interpreting the results. patterns without relying on prior knowledge.
Data Warehouse
According to William H. Inmon, a leading architect in the construction of data warehouse systems,
Data Warehouse stores a huge amount of data, which is typically collected from multiple
heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that may help in
decision-making.
For example, a college might want to see quick different results, like how the placement of CS
students has improved over the last 10 years, in terms of salaries, counts, etc.
The four keywords—subject-oriented, integrated, time-variant, and nonvolatile—distinguish data
warehouses from other data repository systems, such as relational database systems, transaction
processing systems, and file systems.
Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales. Rather than concentrating on the day-to-day operations and transaction
processing of an organization, a data warehouse focuses on the modeling and analysis of data for
decision makers
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse does not
require transaction processing, recovery, and concurrency control mechanisms. It usually requires only
two operations in data accessing: initial loading of data and access of data.
Data Warehouse Design Process
A data warehouse can be built using a top-down approach, a bottom-up approach, or a combination
of both.
The top-down approach starts with the overall design and planning. It is useful in cases where the
technology is mature and well known, and where the business problems that must be solved are clear
and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of
business modeling and technology development. It allows an organization to move forward at
considerably less expense and to evaluate the benefits of the technology before making significant
commitments.
In the combined approach, an organization can exploit the planned and strategic nature of the top-down
approach while retaining the rapid implementation and opportunistic application of the bottom-up
approach.
The warehouse design process consists of the following steps:
Choose a business process to model, for example, orders, invoices, shipments, inventory, account administration,
sales, or the general ledger. If the business process is organizational and involves multiple complex object collections, a
data warehouse model should be followed. However, if the process is departmental and focuses on the analysis of one
kind of business process, a data mart model should be chosen.
Choose the grain of the business process. The grain is the fundamental, atomic level of data to be represented in the
fact table for this process, for example, individual transactions, individual daily snapshots, and so on.
Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier,
warehouse, transaction type, and status.
Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like
dollars sold and units sold.
Operational Database Data Warehouse
Operational systems are designed to support high-volume Data warehousing systems are typically designed to
transaction processing. support high-volume analytical processing (i.e., OLAP).
Operational systems are usually concerned with current Data warehousing systems are usually concerned with
data. historical data.
Data within operational systems are mainly updated Non-volatile, new data may be added regularly. Once
regularly according to need. Added rarely changed.
It is designed for real-time business dealing and processes. It is designed for analysis of business measures by subject
area, categories, and attributes.
It is optimized for a simple set of transactions, generally It is optimized for extent loads and high, complex,
adding or retrieving a single row at a time per table. unpredictable queries that access many rows per table.
It is optimized for validation of incoming information during Loaded with consistent, valid information, requires no real-
transactions, uses validation data tables. time validation.
It supports thousands of concurrent clients. It supports a few concurrent clients relative to OLTP.
Operational systems are widely process-oriented. Data warehousing systems are widely subject-oriented
Operational systems are usually optimized to perform fast Data warehousing systems are usually optimized to
inserts and updates of associatively small volumes of data. perform fast retrievals of relatively high volumes of data.