Unit 3 Data Warehousing and Data Mining

Uploaded by

ANIME ADDICTS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Unit 3 Data Warehousing and Data Mining

Uploaded by

ANIME ADDICTS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Unit 3 Data warehousing and data mining

Data mining
Definition and functionalities
Data Mining functions are used to define the trends or correlations contained in data mining
activities. In comparison, data mining activities can be divided into 2 categories:
1.Descriptive Data Mining:
This category of data mining is concerned with finding patterns and relationships in the data
that can provide insight into the underlying structure of the data. Descriptive data mining is
often used to summarize or explore the data.
Cluster analysis:
This technique is used to identify groups of data points that share similar characteristics.
Clustering can be used for segmentation, anomaly detection, and summarization.
Association rule mining:
This technique is used to identify relationships between variables in the data. It can be used to
discover co-occurring events or to identify patterns in transaction data.
Visualization:
This technique is used to represent the data in a visual format that can help users to identify
patterns or trends that may not be apparent in the raw data.
2.Predictive Data Mining: This category of data mining is concerned with developing
models that can predict future behaviour or outcomes based on historical data. Predictive data
mining is often used for classification or regression tasks.
Decision trees: This technique is used to create a model that can predict the value of a target
variable based on the values of several input variables. Decision trees are often used for
classification tasks.
Neural networks: This technique is used to create a model that can learn to recognize
patterns in the data. Neural networks are often used for image recognition, speech
recognition, and natural language processing.
Regression analysis: This technique is used to create a model that can predict the value of a
target variable based on the values of several input variables. Regression analysis is often
used for prediction tasks.
Both descriptive and predictive data mining techniques are important for gaining insights
and making better decisions. Descriptive data mining can be used to explore the data and
identify patterns, while predictive data mining can be used to make predictions based on
those patterns. Together, these techniques can help organizations to understand their data and
make informed decisions based on that understanding.
Data Mining Functionality:
1. Class/Concept Descriptions: Classes or definitions can be correlated with results. In
simplified, descriptive and yet accurate ways, it can be helpful to define individual groups
and concepts. These class or concept definitions are referred to as class/concept descriptions.
 Data Characterization: This refers to the summary of general characteristics or
features of the class that is under the study. The output of the data characterization can
be presented in various forms include pie charts, bar charts, curves, multidimensional
data cubes.
Example: To study the characteristics of software products with sales increased by 10% in
the previous years. To summarize the characteristics of the customer who spend more than
$5000 a year at All Electronics, the result is general profile of those customers such as that
they are 40-50 years old, employee and have excellent credit rating.
 Data Discrimination: It compares common features of class which is under study. It
is a comparison of the general features of the target class data objects against the
general features of objects from one or multiple contrasting classes.
Example: we may want to compare two groups of customers those who shop for computer
products regular and those who rarely shop for such products(less than 3 times a year), the
resulting description provides a general comparative profile of those customers, such as 80%
of the customers who frequently purchased computer products are between 20 and 40 years
old and have a university degree, and 60% of the customers who infrequently buys such
products are either seniors or youth, and have no university degree.
2. Mining Frequent Patterns, Associations, and Correlations: Frequent patterns are
nothing but things that are found to be most common in the data. There are different kinds of
frequencies that can be observed in the dataset.
 Frequent item set: This applies to a number of items that can be seen together
regularly for eg: milk and sugar.
 Frequent Subsequence: This refers to the pattern series that often occurs regularly
such as purchasing a phone followed by a back cover.
 Frequent Substructure: It refers to the different kinds of data structures such as trees
and graphs that may be combined with the itemset or subsequence.
Association Analysis: The process involves uncovering the relationship between data and
deciding the rules of the association. It is a way of discovering the relationship between
various items.
Example: Suppose we want to know which items are frequently purchased together. An
example for such a rule mined from a transactional database is,
buys (X, “computer”) ⇒ buys (X, “software”) [support = 1%, confidence = 50%],
where X is a variable representing a customer
Data processing from of Data Pre-processing
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the
specific data mining task.
Steps of Data Preprocessing
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in
the data, such as missing values, outliers, and duplicates. Various techniques can be
used for data cleaning, such as imputation, removal, and transformation.
2. Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with
different formats, structures, and semantics. Techniques such as record linkage and
data fusion can be used for data integration.
3. Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a
common range, while standardization is used to transform the data to have zero mean
and unit variance. Discretization is used to convert continuous data into discrete
categories.
4. Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as
feature selection and feature extraction. Feature selection involves selecting a subset
of relevant features from the dataset, while feature extraction involves transforming
the data into a lower-dimensional space while preserving the important information.
5. Data Discretization: This involves dividing continuous data into discrete categories
or intervals. Discretization is often used in data mining and machine learning
algorithms that require categorical data. Discretization can be achieved through
techniques such as equal width binning, equal frequency binning, and clustering.
6. Data Normalization: This involves scaling the data to a common range, such as
between 0 and 1 or -1 and 1. Normalization is often used to handle data with different
units and scales. Common normalization techniques include min-max normalization,
z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the
analysis results. The specific steps involved in data preprocessing may vary depending on the
nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.
Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.

Data Cleaning: The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.
 Missing Data: This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
o Ignore the tuples: This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.
o Fill the Missing values: There are various ways to do this task. You can
choose to fill the missing values manually, by attribute mean or the most
probable value.

 Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It can be handled
in following ways :
o Binning Method: This method works on sorted data in order to smooth it. The
whole data is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately. One
can replace all data in a segment by its mean or boundary values can be used
to complete the task.
o Regression: Here data can be made smooth by fitting it to a regression
function. The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
o Clustering: This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
2. Data Transformation: This step is taken in order to transform the data in appropriate
forms suitable for mining process. This involves following ways:
 Normalization: It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
 Attribute Selection: In this strategy, new attributes are constructed from the given set
of attributes to help the mining process.
 Discretization: This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
 Concept Hierarchy Generation: Here attributes are converted from lower level to
higher level in hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction: Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information. This is done to
improve the efficiency of data analysis and to avoid overfitting of the model.
 Feature Selection: This involves selecting a subset of relevant features from the
dataset. Feature selection is often performed to remove irrelevant or redundant
features from the dataset. It can be done using various techniques such as correlation
analysis, mutual information, and principal component analysis (PCA).
 Feature Extraction: This involves transforming the data into a lower-dimensional
space while preserving the important information. Feature extraction is often used
when the original features are high-dimensional and complex. It can be done using
techniques such as PCA, linear discriminant analysis (LDA), and non-negative matrix
factorization (NMF).
 Sampling: This involves selecting a subset of data points from the dataset. Sampling
is often used to reduce the size of the dataset while preserving the important
information. It can be done using techniques such as random sampling, stratified
sampling, and systematic sampling.

Decision Tree
A decision tree is a flowchart-like structure used to make decisions or predictions. It consists
of nodes representing decisions or tests on attributes, branches representing the outcome of
these decisions, and leaf nodes representing final outcomes or predictions. Each internal node
corresponds to a test on an attribute, each branch corresponds to the result of the test, and
each leaf node corresponds to a class label or a continuous value.
Structure of a Decision Tree
1. Root Node: Represents the entire dataset and the initial decision to be made.
2. Internal Nodes: Represent decisions or tests on attributes. Each internal node has one
or more branches.
3. Branches: Represent the outcome of a decision or test, leading to another node.
4. Leaf Nodes: Represent the final decision or prediction. No further splits occur at these
nodes.
Decision Trees Work
The process of creating a decision tree involves:
1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or
information gain, the best attribute to split the data is selected.
2. Splitting the Dataset: The dataset is split into subsets based on the selected attribute.
3. Repeating the Process: The process is repeated recursively for each subset, creating a
new internal node or leaf node until a stopping criterion is met (e.g., all instances in a
node belong to the same class or a predefined depth is reached).
Metrics for Splitting
 Gini Impurity: Measures the likelihood of an incorrect classification of a new
instance if it was randomly classified according to the distribution of classes in the
dataset.
o Gini=1–∑i=1n(pi)2Gini=1–∑i=1n(pi)2, where pi is the probability of an
instance being classified into a particular class.
 Entropy: Measures the amount of uncertainty or impurity in the dataset.
o Entropy=−∑i=1npilog⁡2(pi)Entropy=−∑i=1npilog2(pi), where pi is the
probability of an instance being classified into a particular class.
 Information Gain: Measures the reduction in entropy or Gini impurity after a dataset
is split on an attribute.
o InformationGain=Entropyparent–

∗Entropy(Di)), where Di is the subset of D after splitting by an attribute.

∑i=1n(∣Di∣∣D∣∗Entropy(Di))InformationGain=Entropyparent–∑i=1n(∣D∣∣Di∣

Advantages of Decision Trees

 Simplicity and Interpretability: Decision trees are easy to understand and interpret.
The visual representation closely mirrors human decision-making processes.
 Versatility: Can be used for both classification and regression tasks.
 No Need for Feature Scaling: Decision trees do not require normalization or scaling
of the data.
 Handles Non-linear Relationships: Capable of capturing non-linear relationships
between features and target variables.
Disadvantages of Decision Trees
 Overfitting: Decision trees can easily overfit the training data, especially if they are
deep with many nodes.
 Instability: Small variations in the data can result in a completely different tree being
generated.
 Bias towards Features with More Levels: Features with more levels can dominate the
tree structure.
Pruning
To overcome overfitting, pruning techniques are used. Pruning reduces the size of the tree by
removing nodes that provide little power in classifying instances. There are two main types of
pruning:
 Pre-pruning (Early Stopping): Stops the tree from growing once it meets certain
criteria (e.g., maximum depth, minimum number of samples per leaf).
 Post-pruning: Removes branches from a fully grown tree that do not provide
significant power.
Applications of Decision Trees
 Business Decision Making: Used in strategic planning and resource allocation.
 Healthcare: Assists in diagnosing diseases and suggesting treatment plans.
 Finance: Helps in credit scoring and risk assessment.
 Marketing: Used to segment customers and predict customer behaviour.

Power Analysis For Experimental Research
No ratings yet
Power Analysis For Experimental Research
377 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Data Mining-Unit-1
No ratings yet
Data Mining-Unit-1
21 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Down 2
No ratings yet
Down 2
61 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data mining 3
No ratings yet
Data mining 3
31 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
Lec 02
No ratings yet
Lec 02
33 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
unit2
No ratings yet
unit2
20 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Unit - 2 Data Minig Notes
No ratings yet
Unit - 2 Data Minig Notes
15 pages
Data Mining
No ratings yet
Data Mining
14 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
Data Mining Issues and Tasks
No ratings yet
Data Mining Issues and Tasks
5 pages
DM Module1 notes
No ratings yet
DM Module1 notes
25 pages
Unit-1 Notes (1)
No ratings yet
Unit-1 Notes (1)
24 pages
2-Tasks and Techniques
No ratings yet
2-Tasks and Techniques
17 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Unit-2
No ratings yet
Unit-2
144 pages
Whats App
No ratings yet
Whats App
23 pages
Unit 1
No ratings yet
Unit 1
27 pages
Unit III Dwdm
No ratings yet
Unit III Dwdm
113 pages
Data Mining - Tasks: Data Characterization Data Discrimination
No ratings yet
Data Mining - Tasks: Data Characterization Data Discrimination
4 pages
Unit 3
No ratings yet
Unit 3
18 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
No ratings yet
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
6 pages
Preprocessing in Data Mining: Edgar Acu Na
No ratings yet
Preprocessing in Data Mining: Edgar Acu Na
5 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
DM-unit 1
No ratings yet
DM-unit 1
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Data Mining Unit 1-1
No ratings yet
Data Mining Unit 1-1
11 pages
1.1 - Data Mining
No ratings yet
1.1 - Data Mining
18 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Soln 1
100% (1)
Soln 1
6 pages
Data Mining
No ratings yet
Data Mining
3 pages
unit 1 DM
No ratings yet
unit 1 DM
24 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
47 pages
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
Multiple Regression Analysis 1
No ratings yet
Multiple Regression Analysis 1
57 pages
Introducing Machine Learning Dino Esposito Francesco Esposito instant download
No ratings yet
Introducing Machine Learning Dino Esposito Francesco Esposito instant download
87 pages
SOAL TUGAS, Menggunakan SPSS
No ratings yet
SOAL TUGAS, Menggunakan SPSS
9 pages
SMA 6304 / MIT 2.853 / MIT 2.854: Manufacturing Systems
No ratings yet
SMA 6304 / MIT 2.853 / MIT 2.854: Manufacturing Systems
35 pages
Instrumental Variables Stata Program and Output
No ratings yet
Instrumental Variables Stata Program and Output
10 pages
AnalyticsEdge Rmanual PDF
100% (1)
AnalyticsEdge Rmanual PDF
44 pages
An Introduction to Multilevel Modeling Techniques MLM and SEM Approaches 4th Edition Ronald H. Heck all chapter instant download
100% (2)
An Introduction to Multilevel Modeling Techniques MLM and SEM Approaches 4th Edition Ronald H. Heck all chapter instant download
55 pages
Solution 9 8
No ratings yet
Solution 9 8
17 pages
Axioms 13 00346 v2
No ratings yet
Axioms 13 00346 v2
20 pages
Simple & Multiple Regression
No ratings yet
Simple & Multiple Regression
12 pages
Stats PPT - Signed Rank Test, One Way Anova
No ratings yet
Stats PPT - Signed Rank Test, One Way Anova
13 pages
EXE - Forecasting
No ratings yet
EXE - Forecasting
3 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
355955B30 Siddesh Mahind SMA Exp-5
No ratings yet
355955B30 Siddesh Mahind SMA Exp-5
11 pages
Anova and T Test
No ratings yet
Anova and T Test
5 pages
Pattern Recognition-Theory
No ratings yet
Pattern Recognition-Theory
2 pages
STATA Workshop@VART_24-30 June 2025
No ratings yet
STATA Workshop@VART_24-30 June 2025
1 page
B4.1-R3 Syllabus
No ratings yet
B4.1-R3 Syllabus
2 pages
Curran PG (2016) Preprint
No ratings yet
Curran PG (2016) Preprint
65 pages
Lampiran Hasil Analisis Uji Linearitas
No ratings yet
Lampiran Hasil Analisis Uji Linearitas
1 page
CHAPTER 1 and 2
No ratings yet
CHAPTER 1 and 2
18 pages
Answer Scheme Assessment 2 - 2 QMT181STA104 July 2022
No ratings yet
Answer Scheme Assessment 2 - 2 QMT181STA104 July 2022
4 pages
Research Methodology and Statistics
No ratings yet
Research Methodology and Statistics
4 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
Discussion PPT - Correlation&Regression
No ratings yet
Discussion PPT - Correlation&Regression
13 pages
PrepScholar SAT Score Target
No ratings yet
PrepScholar SAT Score Target
2 pages
Highly Variable Drugs (HVDS)
No ratings yet
Highly Variable Drugs (HVDS)
4 pages
SPSS Analysis Exercise
No ratings yet
SPSS Analysis Exercise
44 pages
One-Way Analysis of VAriance
No ratings yet
One-Way Analysis of VAriance
8 pages

Unit 3 Data Warehousing and Data Mining

Uploaded by

Unit 3 Data Warehousing and Data Mining

Uploaded by

Unit 3 Data warehousing and data mining

∗Entropy(Di)), where Di is the subset of D after splitting by an attribute.

Advantages of Decision Trees

You might also like