0% found this document useful (0 votes)
11 views11 pages

DT - Missing Values

Uploaded by

lokeshgopal2104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

DT - Missing Values

Uploaded by

lokeshgopal2104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Decisions trees – Missing Values, Imputation, Surrogate Split

The lecture discusses the issue of missing values in data and how they can arise in various scenarios.
Here are the key points extracted and explained:

1. Understanding Missing Values

 Definition:

 Missing values occur when some data points are not recorded or are unavailable. This can
happen in various situations, such as surveys, medical records, or other data collection
processes.

 Types of Missing Values:

 No Response: For example, in a survey, a respondent may skip certain questions. This results
in missing data for those specific questions.

 Noise Removal: Sometimes, data might be intentionally removed or corrected due to errors
or outliers. For example, if a medical record contains an unusually high temperature reading
that is clearly incorrect, it might be removed or corrected, leading to missing values for that
attribute.

 Recording Errors: Data might be missing due to recording errors or malfunctions in the data
collection process. For instance, if a nurse forgets to record a patient’s temperature or if the
recording equipment fails, this results in missing values.

2. Handling Missing Values in Decision Trees

 Impact on Decision Trees:

 Data Incompleteness: Missing values can affect the construction of decision trees. When
building a decision tree, missing values for certain attributes can lead to incomplete data
splits, which might impact the accuracy and reliability of the model.

 Handling Strategies: There are various strategies to handle missing values:

o Ignoring Missing Values: In some cases, algorithms can handle missing values by
ignoring them during the split evaluations or by using other attributes to make
decisions.

o Imputation: Filling in missing values with estimated or calculated values based on


other available data.

o Using Surrogates: Decision trees can use surrogate splits, which are alternative splits
based on other attributes when the primary attribute is missing.

 Practical Considerations:
 Quality of Data: Missing values can reduce the quality of the data used for building the
decision tree. Ensuring that missing data is properly handled or minimized is crucial for
building a reliable model.

 Interpretability: Handling missing values properly is important for maintaining the


interpretability of the decision tree. If missing values lead to complex and unclear branches,
the decision tree may become harder to understand.

3. Examples and Scenarios

 Survey Data Example:

 In a survey, respondents may not answer all questions, leading to missing values. This can
impact the ability to accurately predict outcomes based on the survey data.

 Medical Data Example:

 In medical records, missing values may result from recording errors or data loss. For instance,
if a patient's temperature is not recorded due to a malfunction, it can affect the analysis of
the patient’s condition.

 Data Collection Issues:

 Malfunctions in data collection equipment or errors in manual recording can also lead to
missing values. For example, if a data collection tool fails or if data entry is incomplete, it
results in missing information.

Summary

1. Missing Values:

 Types: Missing values can arise due to no response, noise removal, recording errors, or
malfunctions in data collection.

 Impact: Missing values affect the quality and completeness of data used in decision trees,
influencing the accuracy and interpretability of the model.

1. Handling Strategies:

 Ignoring Missing Values: Algorithms may ignore missing values during splits.

 Imputation: Filling in missing values based on other data.

 Using Surrogates: Alternative splits used when primary data is missing.

1. Practical Considerations:

 Data Quality: Proper handling of missing values is crucial for building reliable decision trees.
 Interpretability: Missing values can complicate decision trees, affecting their interpretability
and usability.

By understanding and addressing missing values, you can improve the accuracy and effectiveness of
decision trees and other machine learning models.

The lecture covers how to handle missing values in data, particularly in the context of decision trees,
and introduces various techniques for dealing with this issue. Here are the key points explained in
detail:

1. Reasons for Missing Values

 Sensor Failures: Sensors or data collection tools may fail intermittently, causing missing data
during those periods.

 Recording Errors: Human errors or equipment malfunctions can lead to missing data. For
example, a sensor might overheat and stop recording temporarily.

 Data Entry Issues: In surveys or data collection forms, respondents might skip questions,
leading to missing values.

2. Handling Missing Values in Decision Trees

Removing Data or Attributes

 High Percentage of Missing Values:

 If an attribute is missing in a significant portion of data points (e.g., more than 80%), it may
be practical to remove the attribute from the dataset. This is because the attribute provides
limited information and is not useful for building a reliable model.

 Low Percentage of Missing Values:

 When missing values are present in a smaller percentage of data points (e.g., 5-10%), you
generally do not want to remove the attribute or the data points entirely. Removing a
substantial amount of data can lead to loss of valuable information and affect the model’s
performance.

Imputation Techniques

 Definition:

 Imputation involves filling in missing values with estimated or calculated values based on the
available data. This helps to maintain the dataset’s completeness and usability.

 Simple Imputation:

 Mean Imputation: Replace missing values with the mean value of the attribute from the
available data. This is straightforward but may not account for variations in different classes
or groups within the data.

 Conditional Imputation:
 Class-Conditioned Imputation: Instead of using the overall mean, you can use class-specific
information to impute missing values. This involves predicting the missing attribute value
based on the class of the data point. For instance:

o Example: If you have a dataset where some attributes are missing for a subset of
data points belonging to a specific class (e.g., class 1), you can use the data points
within that class to predict the missing values.

o Process: Use only the data points from class 1 (where the attribute is not missing) to
estimate the missing attribute values. This approach considers the distribution of the
attribute within the specific class, leading to more accurate imputation.

3. Techniques and Considerations

 Impact on Decision Trees:

 Decision trees can handle missing values in specific ways, such as using surrogate splits.
However, proper imputation can improve the model’s performance and accuracy by ensuring
that missing values do not lead to incomplete or biased decision-making.

 Practical Tips:

 Attribute Removal: If an attribute has a high rate of missing values, consider removing it to
simplify the model and avoid inaccuracies.

 Data Point Removal: Avoid removing data points with missing values unless it significantly
affects the dataset. Instead, use imputation to retain valuable information.

Summary

1. Handling Missing Values:

 High Missing Rate: Remove the attribute if more than 80% of values are missing.

 Low Missing Rate: Use imputation techniques to fill in missing values.

1. Imputation Techniques:

 Mean Imputation: Replace missing values with the overall mean.

 Class-Conditioned Imputation: Use class-specific data to estimate missing values, leading to


more accurate results.

1. Practical Considerations:

 Removing attributes or data points should be done cautiously to avoid losing important
information.

 Imputation helps maintain dataset completeness and enhances model reliability, particularly
in decision trees.
By understanding and applying these techniques, you can effectively manage missing values and
improve the quality of your decision tree models and other machine learning algorithms.

The lecture discusses various methods for handling missing values in datasets, particularly in the
context of decision trees and other machine learning models. Here are the key points explained in
detail:

1. Class-Conditioned Imputation

 Concept:

 When imputing missing values, it's beneficial to use information from the same class as the
missing data. This approach helps preserve correlations between the missing attribute and
the class, which might otherwise be lost if imputation is done across the entire dataset.

 Why It's Useful:

 Correlation Preservation: Imputing values conditionally based on class-specific data


maintains the correlation between the attribute and the class. This is important because
correlations can significantly impact the performance of the model.

 Avoiding Polluted Data: If you use data from all classes, any correlations between the
attribute and specific classes may be diluted, leading to less accurate imputation.

 Example:

 Suppose you have a dataset with 100,000 data points and a specific attribute is missing for a
subset. If you use only the data points from the same class to predict the missing values, the
imputation will better reflect the relationships within that class.

2. Imputation Techniques

 Mean Imputation:

 Replace missing values with the mean of the attribute from the available data. This is a
simple method but may not be the most accurate if there are class-specific patterns.

 Regression Imputation:

 Use regression models to predict missing values based on other attributes. For class-
conditioned imputation, you would perform regression within each class separately. This
method helps account for variations in different classes and maintains relevant correlations.

 Multiple Imputation:

 Concept:

o Multiple imputation involves creating several datasets by filling in missing values


with different estimates drawn from a probability distribution.

 Process:

o Fit a model to predict the missing attribute using known data. Use this model to
generate multiple possible values for the missing data based on the distribution.
o Create multiple datasets with different imputed values and analyze them to account
for the uncertainty in the imputation.

 Benefits:

o Reduces the variance of the imputation and can provide more accurate estimates
compared to single imputation methods.

o Although computationally intensive, it is often more robust and informative.

3. Introducing a Special Value

 Concept:

 Instead of imputing missing values, you can introduce a new value indicating that data is
missing. This approach acknowledges that the missingness itself might carry useful
information.

 Why It Might Be Useful:

 Systematic Missingness: The fact that data is missing could be due to a specific, systematic
reason. For example, a survey question might be skipped intentionally, and this behavior
might be predictive of an outcome.

 Practical Use: Including a special value for missing data can help in identifying patterns
related to why data might be missing and how it correlates with other attributes.

4. Summary of Imputation Methods

 Class-Conditioned Imputation: Maintains class-specific correlations and is more accurate for


imputation within classes.

 Regression Imputation: Predicts missing values based on other attributes, useful for
preserving correlations.

 Multiple Imputation: Provides robust estimates by generating multiple datasets with


different imputed values, accounting for uncertainty.

 Special Value for Missing Data: Recognizes that missing data might have inherent meaning
and can provide insights into systematic issues.

By using these imputation techniques and approaches, you can effectively handle missing values in
your dataset, ensuring that the imputation process enhances the performance of your decision tree
models and other machine learning algorithms.

The lecture discusses advanced techniques for handling missing values in decision trees, focusing on
surrogate splits and how they can be used during both training and testing. Here are the key points
explained in detail:
1. Surrogate Splits

 Concept:

 Definition: Surrogate splits are a method used to handle missing attribute values by
identifying alternative attributes that split the data in a similar way to the primary attribute
used in the decision tree. This helps in making predictions even when the primary attribute is
missing.

 How It Works:

 Training Phase:

o During the training phase, when you build a decision tree, you split the data based
on a primary attribute. For each attribute, you identify a surrogate attribute that
provides similar splits to the primary attribute.

o For example, if the primary split is based on attribute 3 and you have another
attribute, attribute 4, that tends to produce similar splits, attribute 4 can serve as a
surrogate for attribute 3.

 Testing Phase:

o When a new data point arrives and the primary attribute is missing, the decision tree
uses the surrogate attribute to decide the split. This allows the tree to make a
prediction even when some attribute values are missing.

o Example: Suppose you split your data into two groups based on attribute 3, with
70,000 data points in one group and 30,000 in the other. If attribute 4 splits the data
into groups of 68,000 and 32,000 points, and these splits overlap significantly with
those of attribute 3, then attribute 4 can be used as a surrogate.

 Benefits:

 Maintains Accuracy: Surrogate splits help maintain the accuracy of the decision tree when
dealing with missing values by leveraging attributes that provide similar information.

 Improves Robustness: It makes the model more robust and reliable, as it doesn't rely solely
on the primary attribute for decisions.

2. Handling Missing Values Using Surrogates

 Practical Use:

 Example Scenario: If you have a dataset where attribute 3 is missing for a certain data point,
you use the surrogate attribute 4 to decide the split, acting as if you had the information
from attribute 3.

 Process:
o Determine how well different attributes (surrogates) correlate with the primary
attribute.

o Use these surrogates to handle missing values during the decision-making process.

3. Splitting Based on Missing Data

 Concept:

 Handling Missing Categorical Attributes:

o If an attribute is missing for a data point, you can handle it by treating the missing
value as a separate category or by using probabilities based on the distribution of
non-missing values.

 Probability-Based Splitting:

o For a given query, such as x3 < 5, if x3 is missing, analyze the data points that do
have x3 to determine the probability of falling into each subset.

o For instance, if 60% of data points with known x3 values go to one subset and 40% to
another, then you distribute the missing data point similarly—60% to one subset and
40% to the other.

 Example:

 Suppose you have a decision tree that splits on x3 < 5. If x3 is missing for a data point, you
look at the data points where x3 is not missing and determine the proportions of those data
points that go left and right. You then apply these proportions to the data point with the
missing value.

4. Summary

 Surrogate Splits: Use alternative attributes to make splits when the primary attribute is
missing, maintaining the decision tree's accuracy and robustness.

 Probability-Based Splitting: Distribute missing data based on observed probabilities from


non-missing data points, ensuring that the decision process remains consistent and accurate.

These methods help ensure that decision trees can handle missing values effectively, providing
robust predictions even when some data points have incomplete information.

The lecture covers advanced techniques for handling missing values in decision trees, focusing
on fragmentation and Expectation Maximization (EM). Here are the key points explained in detail:

1. Fragmentation Method

 Concept:
 Definition: Fragmentation is a technique used in decision trees to handle missing values by
allowing a data point to travel down multiple paths in the tree and then combining the
predictions from these paths.

 How It Works:

 Traversal with Missing Values:

o When a data point with missing attributes reaches a decision node in the tree, it may
not be able to make a decision based on the missing attribute.

o Instead of making a single decision, the data point is split according to the
probabilities associated with the missing attribute. For example, if the probability of
traveling down one path is 0.6 and another path is 0.4, the data point is split
accordingly.

 Combining Predictions:

o Each path leads to a leaf node, which provides a prediction or probability distribution
over the classes.

o The final prediction is a weighted combination of the predictions from all paths. For
instance, if one path predicts class 1 with 0.6 probability and class 2 with 0.4
probability, and another path predicts class 1 with 0.2 probability and class 2 with
0.8 probability, the final output is a weighted combination of these predictions.

 Purpose:

 Handling Multiple Missing Attributes:

o Fragmentation allows the decision tree to handle cases where multiple attributes are
missing by allowing a data point to traverse multiple paths and aggregate
predictions.

 Maintaining Information:

o By allowing a data point to travel down multiple paths and combining the results,
fragmentation ensures that the information from different possible splits is
considered, making the decision-making process more robust.

 Considerations:

 Complexity:

o This method can become complex if a data point has multiple missing attributes, as it
might traverse multiple paths and require aggregation from various leaf nodes.

 Use Case:

o Fragmentation is particularly useful in decision trees where handling missing values


directly is challenging and requires more sophisticated methods.

2. Expectation Maximization (EM)


 Concept:

 Definition: Expectation Maximization (EM) is a statistical technique used for handling missing
data by iteratively estimating missing values and optimizing parameters.

 How It Works:

 Expectation Step (E-Step):

o Estimate the missing data based on the observed data and current parameter
estimates.

 Maximization Step (M-Step):

o Update the parameters by maximizing the likelihood function using the estimated
missing data from the E-Step.

 Iterative Process:

o Repeat the E-Step and M-Step until convergence, refining the estimates of missing
data and parameters.

 Application:

 In Decision Trees:

o EM can be applied to estimate missing values and improve the accuracy of the
decision tree model. This involves using EM to fill in missing attribute values and
then building or refining the decision tree based on the completed dataset.

 Challenges:

 Complexity:

o EM is a sophisticated technique and can be challenging to implement and


understand. It involves iterative computations and may require a deep
understanding of the underlying statistical principles.

 Practical Considerations:

o While EM is powerful, it can be computationally intensive and may not always be


practical for large datasets or complex models.

Summary

 Fragmentation: This method allows a data point with missing attributes to travel down
multiple paths in the decision tree, combining predictions from these paths to make a final
decision. It is particularly useful for handling cases with multiple missing attributes.

 Expectation Maximization (EM): EM is an advanced technique used to estimate missing data


and optimize model parameters through iterative steps. It can be applied to decision trees to
handle missing values but is complex and computationally demanding.

These techniques are designed to improve the robustness and accuracy of decision trees when
dealing with incomplete or missing data.

You might also like