0% found this document useful (0 votes)

11 views11 pages

DT - Missing Values

Uploaded by

lokeshgopal2104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views11 pages

DT - Missing Values

Uploaded by

lokeshgopal2104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Decisions trees – Missing Values, Imputation, Surrogate Split

The lecture discusses the issue of missing values in data and how they can arise in various scenarios.
Here are the key points extracted and explained:

1. Understanding Missing Values

 Definition:

 Missing values occur when some data points are not recorded or are unavailable. This can
happen in various situations, such as surveys, medical records, or other data collection
processes.

 Types of Missing Values:

 No Response: For example, in a survey, a respondent may skip certain questions. This results
in missing data for those specific questions.

 Noise Removal: Sometimes, data might be intentionally removed or corrected due to errors
or outliers. For example, if a medical record contains an unusually high temperature reading
that is clearly incorrect, it might be removed or corrected, leading to missing values for that
attribute.

 Recording Errors: Data might be missing due to recording errors or malfunctions in the data
collection process. For instance, if a nurse forgets to record a patient’s temperature or if the
recording equipment fails, this results in missing values.

2. Handling Missing Values in Decision Trees

 Impact on Decision Trees:

 Data Incompleteness: Missing values can affect the construction of decision trees. When
building a decision tree, missing values for certain attributes can lead to incomplete data
splits, which might impact the accuracy and reliability of the model.

 Handling Strategies: There are various strategies to handle missing values:

o Ignoring Missing Values: In some cases, algorithms can handle missing values by
ignoring them during the split evaluations or by using other attributes to make
decisions.

o Imputation: Filling in missing values with estimated or calculated values based on

other available data.

o Using Surrogates: Decision trees can use surrogate splits, which are alternative splits
based on other attributes when the primary attribute is missing.

 Practical Considerations:
 Quality of Data: Missing values can reduce the quality of the data used for building the
decision tree. Ensuring that missing data is properly handled or minimized is crucial for
building a reliable model.

 Interpretability: Handling missing values properly is important for maintaining the

interpretability of the decision tree. If missing values lead to complex and unclear branches,
the decision tree may become harder to understand.

3. Examples and Scenarios

 Survey Data Example:

 In a survey, respondents may not answer all questions, leading to missing values. This can
impact the ability to accurately predict outcomes based on the survey data.

 Medical Data Example:

 In medical records, missing values may result from recording errors or data loss. For instance,
if a patient's temperature is not recorded due to a malfunction, it can affect the analysis of
the patient’s condition.

 Data Collection Issues:

 Malfunctions in data collection equipment or errors in manual recording can also lead to
missing values. For example, if a data collection tool fails or if data entry is incomplete, it
results in missing information.

Summary

1. Missing Values:

 Types: Missing values can arise due to no response, noise removal, recording errors, or
malfunctions in data collection.

 Impact: Missing values affect the quality and completeness of data used in decision trees,
influencing the accuracy and interpretability of the model.

1. Handling Strategies:

 Ignoring Missing Values: Algorithms may ignore missing values during splits.

 Imputation: Filling in missing values based on other data.

 Using Surrogates: Alternative splits used when primary data is missing.

1. Practical Considerations:

 Data Quality: Proper handling of missing values is crucial for building reliable decision trees.
 Interpretability: Missing values can complicate decision trees, affecting their interpretability
and usability.

By understanding and addressing missing values, you can improve the accuracy and effectiveness of
decision trees and other machine learning models.

The lecture covers how to handle missing values in data, particularly in the context of decision trees,
and introduces various techniques for dealing with this issue. Here are the key points explained in
detail:

1. Reasons for Missing Values

 Sensor Failures: Sensors or data collection tools may fail intermittently, causing missing data
during those periods.

 Recording Errors: Human errors or equipment malfunctions can lead to missing data. For
example, a sensor might overheat and stop recording temporarily.

 Data Entry Issues: In surveys or data collection forms, respondents might skip questions,
leading to missing values.

2. Handling Missing Values in Decision Trees

Removing Data or Attributes

 High Percentage of Missing Values:

 If an attribute is missing in a significant portion of data points (e.g., more than 80%), it may
be practical to remove the attribute from the dataset. This is because the attribute provides
limited information and is not useful for building a reliable model.

 Low Percentage of Missing Values:

 When missing values are present in a smaller percentage of data points (e.g., 5-10%), you
generally do not want to remove the attribute or the data points entirely. Removing a
substantial amount of data can lead to loss of valuable information and affect the model’s
performance.

Imputation Techniques

 Definition:

 Imputation involves filling in missing values with estimated or calculated values based on the
available data. This helps to maintain the dataset’s completeness and usability.

 Simple Imputation:

 Mean Imputation: Replace missing values with the mean value of the attribute from the
available data. This is straightforward but may not account for variations in different classes
or groups within the data.

 Conditional Imputation:
 Class-Conditioned Imputation: Instead of using the overall mean, you can use class-specific
information to impute missing values. This involves predicting the missing attribute value
based on the class of the data point. For instance:

o Example: If you have a dataset where some attributes are missing for a subset of
data points belonging to a specific class (e.g., class 1), you can use the data points
within that class to predict the missing values.

o Process: Use only the data points from class 1 (where the attribute is not missing) to
estimate the missing attribute values. This approach considers the distribution of the
attribute within the specific class, leading to more accurate imputation.

3. Techniques and Considerations

 Impact on Decision Trees:

 Decision trees can handle missing values in specific ways, such as using surrogate splits.
However, proper imputation can improve the model’s performance and accuracy by ensuring
that missing values do not lead to incomplete or biased decision-making.

 Practical Tips:

 Attribute Removal: If an attribute has a high rate of missing values, consider removing it to
simplify the model and avoid inaccuracies.

 Data Point Removal: Avoid removing data points with missing values unless it significantly
affects the dataset. Instead, use imputation to retain valuable information.

Summary

1. Handling Missing Values:

 High Missing Rate: Remove the attribute if more than 80% of values are missing.

 Low Missing Rate: Use imputation techniques to fill in missing values.

1. Imputation Techniques:

 Mean Imputation: Replace missing values with the overall mean.

 Class-Conditioned Imputation: Use class-specific data to estimate missing values, leading to

more accurate results.

1. Practical Considerations:

 Removing attributes or data points should be done cautiously to avoid losing important
information.

 Imputation helps maintain dataset completeness and enhances model reliability, particularly
in decision trees.
By understanding and applying these techniques, you can effectively manage missing values and
improve the quality of your decision tree models and other machine learning algorithms.

The lecture discusses various methods for handling missing values in datasets, particularly in the
context of decision trees and other machine learning models. Here are the key points explained in
detail:

1. Class-Conditioned Imputation

 Concept:

 When imputing missing values, it's beneficial to use information from the same class as the
missing data. This approach helps preserve correlations between the missing attribute and
the class, which might otherwise be lost if imputation is done across the entire dataset.

 Why It's Useful:

 Correlation Preservation: Imputing values conditionally based on class-specific data

maintains the correlation between the attribute and the class. This is important because
correlations can significantly impact the performance of the model.

 Avoiding Polluted Data: If you use data from all classes, any correlations between the
attribute and specific classes may be diluted, leading to less accurate imputation.

 Example:

 Suppose you have a dataset with 100,000 data points and a specific attribute is missing for a
subset. If you use only the data points from the same class to predict the missing values, the
imputation will better reflect the relationships within that class.

2. Imputation Techniques

 Mean Imputation:

 Replace missing values with the mean of the attribute from the available data. This is a
simple method but may not be the most accurate if there are class-specific patterns.

 Regression Imputation:

 Use regression models to predict missing values based on other attributes. For class-
conditioned imputation, you would perform regression within each class separately. This
method helps account for variations in different classes and maintains relevant correlations.

 Multiple Imputation:

 Concept:

o Multiple imputation involves creating several datasets by filling in missing values

with different estimates drawn from a probability distribution.

 Process:

o Fit a model to predict the missing attribute using known data. Use this model to
generate multiple possible values for the missing data based on the distribution.
o Create multiple datasets with different imputed values and analyze them to account
for the uncertainty in the imputation.

 Benefits:

o Reduces the variance of the imputation and can provide more accurate estimates
compared to single imputation methods.

o Although computationally intensive, it is often more robust and informative.

3. Introducing a Special Value

 Concept:

 Instead of imputing missing values, you can introduce a new value indicating that data is
missing. This approach acknowledges that the missingness itself might carry useful
information.

 Why It Might Be Useful:

 Systematic Missingness: The fact that data is missing could be due to a specific, systematic
reason. For example, a survey question might be skipped intentionally, and this behavior
might be predictive of an outcome.

 Practical Use: Including a special value for missing data can help in identifying patterns
related to why data might be missing and how it correlates with other attributes.

4. Summary of Imputation Methods

 Class-Conditioned Imputation: Maintains class-specific correlations and is more accurate for

imputation within classes.

 Regression Imputation: Predicts missing values based on other attributes, useful for
preserving correlations.

 Multiple Imputation: Provides robust estimates by generating multiple datasets with

different imputed values, accounting for uncertainty.

 Special Value for Missing Data: Recognizes that missing data might have inherent meaning
and can provide insights into systematic issues.

By using these imputation techniques and approaches, you can effectively handle missing values in
your dataset, ensuring that the imputation process enhances the performance of your decision tree
models and other machine learning algorithms.

The lecture discusses advanced techniques for handling missing values in decision trees, focusing on
surrogate splits and how they can be used during both training and testing. Here are the key points
explained in detail:
1. Surrogate Splits

 Concept:

 Definition: Surrogate splits are a method used to handle missing attribute values by
identifying alternative attributes that split the data in a similar way to the primary attribute
used in the decision tree. This helps in making predictions even when the primary attribute is
missing.

 How It Works:

 Training Phase:

o During the training phase, when you build a decision tree, you split the data based
on a primary attribute. For each attribute, you identify a surrogate attribute that
provides similar splits to the primary attribute.

o For example, if the primary split is based on attribute 3 and you have another
attribute, attribute 4, that tends to produce similar splits, attribute 4 can serve as a
surrogate for attribute 3.

 Testing Phase:

o When a new data point arrives and the primary attribute is missing, the decision tree
uses the surrogate attribute to decide the split. This allows the tree to make a
prediction even when some attribute values are missing.

o Example: Suppose you split your data into two groups based on attribute 3, with
70,000 data points in one group and 30,000 in the other. If attribute 4 splits the data
into groups of 68,000 and 32,000 points, and these splits overlap significantly with
those of attribute 3, then attribute 4 can be used as a surrogate.

 Benefits:

 Maintains Accuracy: Surrogate splits help maintain the accuracy of the decision tree when
dealing with missing values by leveraging attributes that provide similar information.

 Improves Robustness: It makes the model more robust and reliable, as it doesn't rely solely
on the primary attribute for decisions.

2. Handling Missing Values Using Surrogates

 Practical Use:

 Example Scenario: If you have a dataset where attribute 3 is missing for a certain data point,
you use the surrogate attribute 4 to decide the split, acting as if you had the information
from attribute 3.

 Process:
o Determine how well different attributes (surrogates) correlate with the primary
attribute.

o Use these surrogates to handle missing values during the decision-making process.

3. Splitting Based on Missing Data

 Concept:

 Handling Missing Categorical Attributes:

o If an attribute is missing for a data point, you can handle it by treating the missing
value as a separate category or by using probabilities based on the distribution of
non-missing values.

 Probability-Based Splitting:

o For a given query, such as x3 < 5, if x3 is missing, analyze the data points that do
have x3 to determine the probability of falling into each subset.

o For instance, if 60% of data points with known x3 values go to one subset and 40% to
another, then you distribute the missing data point similarly—60% to one subset and
40% to the other.

 Example:

 Suppose you have a decision tree that splits on x3 < 5. If x3 is missing for a data point, you
look at the data points where x3 is not missing and determine the proportions of those data
points that go left and right. You then apply these proportions to the data point with the
missing value.

4. Summary

 Surrogate Splits: Use alternative attributes to make splits when the primary attribute is
missing, maintaining the decision tree's accuracy and robustness.

 Probability-Based Splitting: Distribute missing data based on observed probabilities from

non-missing data points, ensuring that the decision process remains consistent and accurate.

These methods help ensure that decision trees can handle missing values effectively, providing
robust predictions even when some data points have incomplete information.

The lecture covers advanced techniques for handling missing values in decision trees, focusing
on fragmentation and Expectation Maximization (EM). Here are the key points explained in detail:

1. Fragmentation Method

 Concept:
 Definition: Fragmentation is a technique used in decision trees to handle missing values by
allowing a data point to travel down multiple paths in the tree and then combining the
predictions from these paths.

 How It Works:

 Traversal with Missing Values:

o When a data point with missing attributes reaches a decision node in the tree, it may
not be able to make a decision based on the missing attribute.

o Instead of making a single decision, the data point is split according to the
probabilities associated with the missing attribute. For example, if the probability of
traveling down one path is 0.6 and another path is 0.4, the data point is split
accordingly.

 Combining Predictions:

o Each path leads to a leaf node, which provides a prediction or probability distribution
over the classes.

o The final prediction is a weighted combination of the predictions from all paths. For
instance, if one path predicts class 1 with 0.6 probability and class 2 with 0.4
probability, and another path predicts class 1 with 0.2 probability and class 2 with
0.8 probability, the final output is a weighted combination of these predictions.

 Purpose:

 Handling Multiple Missing Attributes:

o Fragmentation allows the decision tree to handle cases where multiple attributes are
missing by allowing a data point to traverse multiple paths and aggregate
predictions.

 Maintaining Information:

o By allowing a data point to travel down multiple paths and combining the results,
fragmentation ensures that the information from different possible splits is
considered, making the decision-making process more robust.

 Considerations:

 Complexity:

o This method can become complex if a data point has multiple missing attributes, as it
might traverse multiple paths and require aggregation from various leaf nodes.

 Use Case:

o Fragmentation is particularly useful in decision trees where handling missing values

directly is challenging and requires more sophisticated methods.

2. Expectation Maximization (EM)

 Concept:

 Definition: Expectation Maximization (EM) is a statistical technique used for handling missing
data by iteratively estimating missing values and optimizing parameters.

 How It Works:

 Expectation Step (E-Step):

o Estimate the missing data based on the observed data and current parameter
estimates.

 Maximization Step (M-Step):

o Update the parameters by maximizing the likelihood function using the estimated
missing data from the E-Step.

 Iterative Process:

o Repeat the E-Step and M-Step until convergence, refining the estimates of missing
data and parameters.

 Application:

 In Decision Trees:

o EM can be applied to estimate missing values and improve the accuracy of the
decision tree model. This involves using EM to fill in missing attribute values and
then building or refining the decision tree based on the completed dataset.

 Challenges:

 Complexity:

o EM is a sophisticated technique and can be challenging to implement and

understand. It involves iterative computations and may require a deep
understanding of the underlying statistical principles.

 Practical Considerations:

o While EM is powerful, it can be computationally intensive and may not always be

practical for large datasets or complex models.

Summary

 Fragmentation: This method allows a data point with missing attributes to travel down
multiple paths in the decision tree, combining predictions from these paths to make a final
decision. It is particularly useful for handling cases with multiple missing attributes.

 Expectation Maximization (EM): EM is an advanced technique used to estimate missing data

and optimize model parameters through iterative steps. It can be applied to decision trees to
handle missing values but is complex and computationally demanding.

These techniques are designed to improve the robustness and accuracy of decision trees when
dealing with incomplete or missing data.

Accrual Engine
100% (3)
Accrual Engine
48 pages
Free Primavera Training Course Learn P6 or P3
50% (2)
Free Primavera Training Course Learn P6 or P3
2 pages
platias2020-Greece
No ratings yet
platias2020-Greece
10 pages
ADS-EXP2
No ratings yet
ADS-EXP2
3 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
Centraltendencywhattoconsider 1
No ratings yet
Centraltendencywhattoconsider 1
6 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
04 05 PDE Missing Value
No ratings yet
04 05 PDE Missing Value
3 pages
Machine Learning Based Missing Data Imputation
No ratings yet
Machine Learning Based Missing Data Imputation
13 pages
handling missing values
No ratings yet
handling missing values
5 pages
Emmanuel Et Al. - 2021 - A Survey on Missing Data in Machine Learning
No ratings yet
Emmanuel Et Al. - 2021 - A Survey on Missing Data in Machine Learning
37 pages
Emmanuel 2021 A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel 2021 A Survey On Missing Data in Machine Learning
37 pages
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
No ratings yet
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
12 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
m Akaba 2019
No ratings yet
m Akaba 2019
7 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
A Method For Missing Values Imputation of Machine Learning Datasets
No ratings yet
A Method For Missing Values Imputation of Machine Learning Datasets
11 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
FDS_U4.pptx
No ratings yet
FDS_U4.pptx
93 pages
Data Imputation for Missing Values
No ratings yet
Data Imputation for Missing Values
14 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
Intermediate Machine learning
No ratings yet
Intermediate Machine learning
12 pages
What Are the Different Ways to Handle Missing Values
No ratings yet
What Are the Different Ways to Handle Missing Values
2 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
No ratings yet
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
8 pages
Missing Values
No ratings yet
Missing Values
3 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
Unit 3
No ratings yet
Unit 3
30 pages
04 05 PDE Missing Value
No ratings yet
04 05 PDE Missing Value
3 pages
Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling
No ratings yet
Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling
8 pages
Handling The Missing Values
No ratings yet
Handling The Missing Values
4 pages
Missing Data
No ratings yet
Missing Data
14 pages
Missing Data
No ratings yet
Missing Data
25 pages
Unit 2 Data Preprocessing (1)
No ratings yet
Unit 2 Data Preprocessing (1)
66 pages
3 -Missing Values-1
No ratings yet
3 -Missing Values-1
9 pages
Data Cleaning_Project work
No ratings yet
Data Cleaning_Project work
10 pages
Ijctt V3i2p104
No ratings yet
Ijctt V3i2p104
5 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
2 PB
No ratings yet
2 PB
10 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Roles of Imputation Methods For Filling The Missing Values: A Review
No ratings yet
Roles of Imputation Methods For Filling The Missing Values: A Review
9 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
10 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Businnes Intelligence
No ratings yet
Businnes Intelligence
36 pages
Yana Bondarenko Statistical Analysis With Missing Values
No ratings yet
Yana Bondarenko Statistical Analysis With Missing Values
5 pages
Handling Missing Values in Data Mining
No ratings yet
Handling Missing Values in Data Mining
12 pages
Lecture10
No ratings yet
Lecture10
20 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
DA_MID1
No ratings yet
DA_MID1
32 pages
missingValue
No ratings yet
missingValue
11 pages
Business Analytics ST1
No ratings yet
Business Analytics ST1
13 pages
DA unit 2 15m handling missing data
No ratings yet
DA unit 2 15m handling missing data
3 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
An analysis of four missing data treatment methods for supervised learning
No ratings yet
An analysis of four missing data treatment methods for supervised learning
16 pages
FAQ - ReCell
No ratings yet
FAQ - ReCell
5 pages
Mida (AE)
No ratings yet
Mida (AE)
12 pages
A Modified Deep Residual-Convolutional Neural Netw
No ratings yet
A Modified Deep Residual-Convolutional Neural Netw
23 pages
Workbook - Design & Process
No ratings yet
Workbook - Design & Process
29 pages
CH-4 System Analysis & Optimization
No ratings yet
CH-4 System Analysis & Optimization
37 pages
Basics of Preparing Tabular Report Using Spreadsheet: Learning Objectives
No ratings yet
Basics of Preparing Tabular Report Using Spreadsheet: Learning Objectives
11 pages
UVM ASSIGNMENT 2 Rahulss
No ratings yet
UVM ASSIGNMENT 2 Rahulss
6 pages
Latest List of Pesticides Distributors in Punjab - PDF - Lahore - Pakistan
No ratings yet
Latest List of Pesticides Distributors in Punjab - PDF - Lahore - Pakistan
66 pages
TS 10 2014 - 2015
No ratings yet
TS 10 2014 - 2015
3 pages
CE Chisholm
No ratings yet
CE Chisholm
5 pages
Mzpack Indicators and Strategies User Guide (en)
No ratings yet
Mzpack Indicators and Strategies User Guide (en)
131 pages
Software Archtecture (Dcs 301)
No ratings yet
Software Archtecture (Dcs 301)
4 pages
Principles of Computer Aided Design and Manufacturing, -- Farid M_ L Amirouche -- 2nd ed, Upper Saddle River, N_J, ©2004 -- Upper Saddle River, N_J__ -- 9780130646316 -- 5be2f6259db931b73ac2b25049cd590b -- Anna’s A
No ratings yet
Principles of Computer Aided Design and Manufacturing, -- Farid M_ L Amirouche -- 2nd ed, Upper Saddle River, N_J, ©2004 -- Upper Saddle River, N_J__ -- 9780130646316 -- 5be2f6259db931b73ac2b25049cd590b -- Anna’s A
536 pages
Making A Kaleidoscope Worksheet-5
No ratings yet
Making A Kaleidoscope Worksheet-5
2 pages
Adopting Effective Computer Maintenance and Troubleshooting Cultures For Sustainable Development of IT-Driven Sectors of The Third World Countries
No ratings yet
Adopting Effective Computer Maintenance and Troubleshooting Cultures For Sustainable Development of IT-Driven Sectors of The Third World Countries
9 pages
Comptia A+ (220-1101 & 220-1102) Chapter 2 - Hardware Components
No ratings yet
Comptia A+ (220-1101 & 220-1102) Chapter 2 - Hardware Components
73 pages
Ieeesensors
No ratings yet
Ieeesensors
5 pages
legalinfo.mn - ЗУРАГ ТӨСӨЛ БОЛОВСРУУЛАХ, ЗӨВШӨӨРӨЛЦӨХ, БАТЛАХ ДҮРЭМ
No ratings yet
legalinfo.mn - ЗУРАГ ТӨСӨЛ БОЛОВСРУУЛАХ, ЗӨВШӨӨРӨЛЦӨХ, БАТЛАХ ДҮРЭМ
279 pages
Online Application Admission and Registration System
No ratings yet
Online Application Admission and Registration System
4 pages
Design of Virtual Digital Oscilloscope Based On
No ratings yet
Design of Virtual Digital Oscilloscope Based On
9 pages
Free Trial Filler
No ratings yet
Free Trial Filler
9 pages
Learning Plans in The Context of The 21st Century - 2 PDF
No ratings yet
Learning Plans in The Context of The 21st Century - 2 PDF
39 pages
B.Com., Corporate Secretaryship - FUNDAMENTAL OF INFORMATION TECHNOLOGY
No ratings yet
B.Com., Corporate Secretaryship - FUNDAMENTAL OF INFORMATION TECHNOLOGY
52 pages
060010203-Object Oriented Programming: Fill in The Blanks
No ratings yet
060010203-Object Oriented Programming: Fill in The Blanks
1 page
G31MX 2.0 Series Motherboard
No ratings yet
G31MX 2.0 Series Motherboard
76 pages
أساسيات تكنولوجيا المعلومات
No ratings yet
أساسيات تكنولوجيا المعلومات
177 pages
Gps Master CLock Manual
No ratings yet
Gps Master CLock Manual
8 pages
Opencl Interoperability Enable With Openvino Rev0 5
No ratings yet
Opencl Interoperability Enable With Openvino Rev0 5
10 pages
ĐỀ THI HSG ANH 9 THANH HOÁ 2023 2024
No ratings yet
ĐỀ THI HSG ANH 9 THANH HOÁ 2023 2024
10 pages
Technical Data: On-Line UPS-system
No ratings yet
Technical Data: On-Line UPS-system
9 pages

DT - Missing Values

Uploaded by

DT - Missing Values

Uploaded by

Decisions trees – Missing Values, Imputation, Surrogate Split

1. Understanding Missing Values

 Types of Missing Values:

2. Handling Missing Values in Decision Trees

 Impact on Decision Trees:

 Handling Strategies: There are various strategies to handle missing values:

o Imputation: Filling in missing values with estimated or calculated values based on

 Interpretability: Handling missing values properly is important for maintaining the

3. Examples and Scenarios

 Survey Data Example:

 Medical Data Example:

 Data Collection Issues:

 Imputation: Filling in missing values based on other data.

 Using Surrogates: Alternative splits used when primary data is missing.

1. Reasons for Missing Values

2. Handling Missing Values in Decision Trees

Removing Data or Attributes

 High Percentage of Missing Values:

 Low Percentage of Missing Values:

3. Techniques and Considerations

 Impact on Decision Trees:

1. Handling Missing Values:

 Low Missing Rate: Use imputation techniques to fill in missing values.

 Mean Imputation: Replace missing values with the overall mean.

 Class-Conditioned Imputation: Use class-specific data to estimate missing values, leading to

 Why It's Useful:

 Correlation Preservation: Imputing values conditionally based on class-specific data

o Multiple imputation involves creating several datasets by filling in missing values

o Although computationally intensive, it is often more robust and informative.

3. Introducing a Special Value

 Why It Might Be Useful:

4. Summary of Imputation Methods

 Class-Conditioned Imputation: Maintains class-specific correlations and is more accurate for

 Multiple Imputation: Provides robust estimates by generating multiple datasets with

2. Handling Missing Values Using Surrogates

3. Splitting Based on Missing Data

 Handling Missing Categorical Attributes:

 Probability-Based Splitting: Distribute missing data based on observed probabilities from

 Traversal with Missing Values:

 Handling Multiple Missing Attributes:

o Fragmentation is particularly useful in decision trees where handling missing values

2. Expectation Maximization (EM)

 Expectation Step (E-Step):

 Maximization Step (M-Step):

o EM is a sophisticated technique and can be challenging to implement and

o While EM is powerful, it can be computationally intensive and may not always be

 Expectation Maximization (EM): EM is an advanced technique used to estimate missing data

You might also like