0% found this document useful (0 votes)
83 views14 pages

Understanding Decision Trees in Machine Learning

Decision Trees are supervised learning techniques used for classification and regression, structured as tree-like models with decision nodes and leaf nodes. The construction involves selecting features to split the dataset recursively until certain stopping criteria are met, and various criteria like Gini Impurity and Entropy are used for splitting. Ensemble learning methods like Bagging and Boosting enhance model accuracy, with Random Forest being a specific ensemble method that builds multiple decision trees using random subsets of data and features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views14 pages

Understanding Decision Trees in Machine Learning

Decision Trees are supervised learning techniques used for classification and regression, structured as tree-like models with decision nodes and leaf nodes. The construction involves selecting features to split the dataset recursively until certain stopping criteria are met, and various criteria like Gini Impurity and Entropy are used for splitting. Ensemble learning methods like Bagging and Boosting enhance model accuracy, with Random Forest being a specific ensemble method that builds multiple decision trees using random subsets of data and features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DECISION TREES

• Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
• The decisions or the tests are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.

Example:
• One of the reasons that decision trees are popular is that we can turn them into a set of
logical disjunctions (if ... then rules) that then go into program code very simply.

Ex: if there is a party then go to it if there is not a party and you have an urgent deadline
then study

Constructing Decision Trees:


1. Start with the Entire Dataset

Consider all training data as the root.

2. Choose the Best Feature to Split

• Evaluate each feature using a splitting criterion (explained below).

• Select the feature that best separates the data.

3. Split the Dataset

Based on the selected feature, divide the dataset into subsets.

4. Recurse on Each Subset

Repeat the process recursively on each subset, treating it as a new node.

5. Stopping Criteria

Stop if:

o All instances in a node belong to the same class.

o Maximum tree depth is reached.

o Minimum number of samples in a node is reached.

o No further gain from splitting.

6. Assign Output

• For classification: Assign the most common class.

• For regression: Assign the average value.

Main Criteria Used for Splitting Nodes

1. Gini Impurity

Gini Impurity measures how mixed the classes are in a node. A lower value means a purer node.
It is used in the CART algorithm.

2. Entropy / Information Gain


Information Gain tells us how much useful information a feature gives. A good feature split
reduces entropy. It is used in the ID3 algorithm.

3. Chi-Square

Chi-Square checks if the split between data groups is meaningful or just due to chance.

4. Reduction in Variance

This is used in regression trees. It chooses splits that reduce the difference (variance) within the
data.

5. Gain Ratio

Gain Ratio improves Information Gain by adjusting for features that have many different values.
It is used in the C4.5 algorithm.

CLASSIFICATION AND REGRESSION IN TREES


CART is a machine learning algorithm that builds decision trees.
It is used for two types of problems:

• Classification – when we want to predict a category or label

• Regression – when we want to predict a numerical value

CART uses a binary tree structure, meaning each node splits into two branches.

Classification in Trees
This is used when the output we want is a class label (like “yes” or “no”).

Example:

• Predicting whether an email is spam or not spam

• Diagnosing whether a person has disease A or not

Working
• At each node, CART checks all features and chooses the one that best separates the data.
• The goal is to group similar items together in each branch.
• This process continues until each branch (called a leaf node) has mostly one class.

Gini Impurity (for classification)


Gini Impurity is a measure used to find the best feature to split the data.

What it means:
• If all data at a node belongs to one class → Gini = 0 (perfectly pure)

• If data is equally mixed among classes → Gini is higher

• Formula:

Where pi is the probability of a class at that node.


CART chooses the split that has the lowest Gini impurity, meaning it creates purer
groups.

Regression in Trees
This is used when the output is a numerical value (like marks, price, or temperature).

Example:

• Predicting the price of a house based on area, location, etc.

• Estimating the rainfall for next week

Working
• CART splits the data again and again to form smaller and smaller groups.
• It finds the best split by checking how much the error reduces after the split.

Residual Sum of Squares (for regression)


• This is the error measure used for regression trees.
• It calculates how far the predicted values are from the actual values.
• Formula:

Where:
• yi is the actual value
• y^ is the predicted value
CART tries to find a split that gives the lowest error (minimum SSE).

Residual Reduction
Residual Reduction is how much the error decreases after a split.

What it means:

• If a split greatly reduces the error → it is a good split

• If it doesn't help much → CART won’t split that way

CART keeps splitting until:


• The tree is too deep

• Or a node has very few data points

ENSEMBLE LEARNING
• Ensemble learning refers to the approach of combining multiple ML models to produce a
more accurate and robust prediction compared to any individual model.
• The conventional ensemble methods include bagging, boosting, and stacking-based
methods

Boosting:
• Boosting is an ensemble technique that combines multiple weak learners to create a
strong learner.
• The ensemble of weak models are trained in series such that each model that comes
next, tries to correct errors of the previous model until the entire training dataset is
predicted correctly.
• One of the most well-known boosting algorithms is AdaBoost (Adaptive Boosting).

AdaBoost (Adaptive Boosting)


• AdaBoost (Adaptive Boosting) is an ensemble learning method used for classification and
regression tasks.
• It works by training weak classifiers one after another, giving more focus (weight) to the
misclassified data in each round.
• The final model is a weighted combination of all weak classifiers, where more accurate
models get higher weight.

Steps in AdaBoost:

1. Weight Initialization

Start by giving equal importance to all training examples.

2. Model Training

Train a simple model on the data, using the current importance (weights) of each example.

3. Weighted Error Calculation

Check how many examples were classified wrongly, giving more attention to the important
ones.

4. Model Weight Calculation

If the model does well, give it more power (weight) in the final decision; if it does badly, give
it less.

5. Update Instance Weights

Increase the importance (weight) of wrongly classified examples, so the next model tries
harder on them.

6. Repeat

Repeat steps 2 to 5 multiple times, each time focusing more on the hard examples.

7. Final Model Creation

Combine all the models into one strong model, giving more say to the better ones.

8. Classification

For new data, the final model makes a prediction by taking a vote from all the simple
models.
Bagging:
• Bagging (Bootstrap Aggregating) is an ensemble learning technique that builds multiple
models on different random subsets of the training data (with replacement).
• It helps to reduce variance and prevent overfitting, especially in high-variance models
like decision trees.
• Bagging is used for both classification and regression tasks and makes the final prediction
by combining outputs from all base models

Steps:
1. Bootstrap Sampling

Randomly create multiple subsets of the training data with replacement.

2. Train Base Models

Train a separate model (usually the same type) on each subset independently.

3. Aggregate Predictions

Combine the results:

Use majority vote for classification

Use average for regression


4. Out-of-Bag Evaluation

Use the samples not included in the training subset (OOB samples) to estimate model
accuracy.

5. Final Prediction

Output the aggregated prediction from all models for the final result.

Random Forest
Definition:

Random Forest is an ensemble method based on Bagging that builds a collection of Decision
Trees, using random subsets of data and features, to improve prediction and reduce overfitting.

Steps:
1. Bootstrap Sampling

Create random subsets of training data with replacement, like in Bagging.

2. Random Feature Selection

When building each Decision Tree, at each split, use a random subset of features (not all
features).

3. Train Multiple Decision Trees

Build many Decision Trees, each trained on different data and features.

4. Aggregate Predictions

Combine the predictions:

Use majority vote for classification

Use average for regression

5. Final Prediction

Predict the final output by combining the outputs of all the trees.

Different Ways to Combine Classifiers


1. Majority Voting (for Classification)

If the number of classifiers is odd and they are independent, the label chosen by most
classifiers is used as the final output.

2. Mean (for Regression)


For regression tasks, we usually take the average (mean) of all model outputs.

3. Median (for Robust Regression)

Since the mean can be affected by outliers, using the median (middle value) is often
better for more stable predictions.

4. Bagging Uses Median

When the median is used to combine predictions, it leads to a more robust version of
bagging, sometimes called "robust bagging".

BASIC STATISTICS

1. Mean

The mean is the average value of a dataset.

It is calculated by adding all the values in the dataset and then dividing the total by the
number of values.

The mean is a useful measure of central tendency, but it is sensitive to outliers, meaning
extreme values can affect it significantly.

2. Median

The median is the middle value of a dataset when all the values are arranged in order.

If the number of values is odd, the median is the exact middle value.

If the number of values is even, the median is calculated by taking the average of the two
middle values.

The median is a reliable measure of central tendency because it is not affected by outliers or
extreme values.

3. Mode

The mode is the value that occurs most frequently in a dataset.

It is useful for identifying the most common value in the data.

If a dataset has two or more values that occur with the same highest frequency, it is called
bimodal, trimodal, or multimodal respectively.

The mode may not be a helpful measure if all values are different or if the dataset has a wide
range of values.

4. Variance

Variance measures how much the values in a dataset differ from the mean.

A high variance indicates that the data points are spread out widely, while a low variance
means they are closely clustered around the mean.

5. Standard Deviation
The standard deviation is the square root of the variance.

It tells us how much the values in the dataset typically differ from the mean, and it is
expressed in the same units as the data.

A small standard deviation means the data is closer to the mean, while a large one means
the data is more spread out.

6. Covariance

Covariance shows the relationship between two variables, indicating how one variable
changes when the other changes.

A positive covariance means both variables tend to increase or decrease together.

A negative covariance means that as one variable increases, the other tends to decrease.

Covariance is scale-dependent, so the values can vary greatly.

7. Mahalanobis Distance

The Mahalanobis Distance is a statistical measure that shows how far a point is from the
mean of a distribution, considering the correlations between variables.

It is used in applications like outlier detection, clustering, and classification.

The formula is:

GAUSSIAN MIXTURE MODELS (GMM)


Gaussian Mixture Model (GMM) is a probabilistic model that assumes data is generated from a
mixture of several Gaussian (normal) distributions, each with its own mean and variance.
Purpose:

GMM is mainly used for unsupervised learning, especially in clustering, where we don’t have
labelled data. It helps to group similar data points based on the likelihood that they belong to
the same Gaussian distribution.

How GMM Works:

1. Initialization

Start with a guess of how many Gaussians (clusters) there are.

2. Expectation Step (E-Step)

For each data point, calculate the probability that it belongs to each Gaussian.

3. Maximization Step (M-Step)

Update the parameters (mean, variance, and weight) of each Gaussian to maximize the
likelihood of the data.

4. Repeat

Repeat E-Step and M-Step until the model converges, i.e., changes are very small.

Notation:

• K: Number of Gaussian components

• N: Number of data points

• D: Dimensionality of the data

GMM Parameters

Means (μ) → The center of each Gaussian component

Covariance (Σ) → The shape and spread of each component

Weights (π) → The probability (importance) of each component

Model Training

• Training a GMM involves setting the parameters using available data.

• The Expectation-Maximization (EM) technique is often employed, alternating between the


Expectation (E) and Maximization (M) steps until convergence.

Expectation-Maximization:

• During the E step, the model calculates the probability of each data point belonging to each
Gaussian component.

• The M step then adjusts the model’s parameters based on these probabilities.
Clustering and Density Estimation:

• Post-training, GMMs cluster data points based on the highest posterior probability.

• They are also used for density estimation, assessing the probability density at any point in the
feature space.

Applications:
• Customer segmentation
• Anomaly detection
• Image segmentation
• Speaker identification

Nearest Neighbour Methods:


K-Nearest Neighbors Algorithm:

What is KNN? – Simple Points


• K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification
and regression.
• It is a lazy learning algorithm, meaning it does not build a model during training — it
simply stores the data.
• KNN makes predictions only when a new input is given.
• The algorithm assumes that similar data points are located close together in the feature
space.

To make a prediction:

• It finds the 'K' nearest neighbors (closest data points) to the input.
• For classification, it uses majority voting of the neighbors' labels.
• For regression, it takes the average of the neighbors' values.

Steps in KNN Algorithm:


Step 1: Choose the value of K

K is the number of nearest neighbors to look at when making a prediction.


A good value of K is chosen by testing for the best accuracy.

Step 2: Calculate Distance

Use Euclidean distance (or other distance measures) to find how close each training point is
to the new (target) point.

Step 3: Find Nearest Neighbors

Pick the K training points that are closest to the target point based on the calculated
distances.

Step 4: Make Prediction

For Classification:

→ Use majority voting — the class that appears the most among the K neighbors is selected.

For Regression:

→ Take the average of the values of the K nearest neighbors to predict the result.

K MEANS ALGORITHM
• K-Means is an unsupervised learning algorithm used to group unlabelled data into K
clusters based on similarity.
• The value of K is pre-defined and represents the number of clusters the algorithm should
form.
• It works by finding centroids (center points) and assigning data points to the nearest
centroid.
• The algorithm is iterative, repeatedly adjusting the centroids and cluster assignments until
the results stabilize.
• K-Means is widely used in tasks like customer segmentation, image compression, and
pattern recognition because it is simple, fast, and effective.

Main Purpose:

The K-Means algorithm tries to group similar data points together by minimizing the distance
between the points and their cluster center (called a centroid).
Working

1. Choose the number of clusters (K)

Decide how many groups (clusters) you want to divide the data into.

2. Initialize K centroids randomly

Select K random points from the dataset to serve as the initial centers of the clusters.

3. Assign each data point to the nearest centroid

Measure the distance (usually Euclidean distance) from each point to each centroid and
assign it to the closest one.

4. Update the centroids

Recalculate the centroid of each cluster by taking the average position of all the points in
that cluster.

5. Repeat steps 3 and 4

Continue reassigning points and updating centroids until the centroids do not change
much, or a maximum number of iterations is reached.

Common questions

Powered by AI

Variance and standard deviation both describe the data's spread; variance gives the average squared deviation from the mean, showing how data points vary around the mean. Standard deviation, as the square root of variance, measures dispersion in original units, offering intuitive insight into average variance. Both metrics help identify data variability; however, standard deviation is directly comparable to mean values, providing clearer context .

Bagging reduces variance by training multiple models on random subsets of the data with replacement and averaging their predictions, thus smoothing fluctuations from any one model. It primarily aims to stabilize high-variance models like decision trees. Conversely, boosting focuses on reducing bias by sequentially training models, each correcting errors of predecessors, which enhances their overall ability to generalize from data patterns .

Gaussian Mixture Models (GMM) use the Expectation-Maximization (EM) algorithm to perform clustering and density estimation by iteratively applying two main steps: the E-step calculates the probability of each point belonging to each Gaussian component, and the M-step updates each Gaussian's parameters (mean, variance, weight) based on these probabilities. This alternation optimizes the likelihood of data fitting the model, efficiently grouping data into clusters .

Random Forest enhances the robustness and accuracy of decision tree models by constructing numerous trees using random feature subsets and data samples, which diversifies the decision paths. This approach mitigates overfitting inherent in single decision trees and ensures a more stable generalization by majority voting in classification or averaging in regression, thereby improving prediction accuracy .

AdaBoost offers advantages such as improved accuracy over individual weak learners by iteratively enhancing misclassified examples' weights, leading to stronger models. However, its drawbacks include sensitivity to noise and outliers due to high emphasis on challenging examples, which may lead to overfitting. Additionally, AdaBoost assumes weak learners are independently trainable, which might not hold for all datasets .

K-Means ensures convergence by initializing centroids and iteratively reassigning data points to the nearest centroid, recalculating centroids by the mean of assignments, and repeating until minimal centroid movement. This simplicity facilitates quick convergence but faces limitations like sensitivity to initial centroid placement, necessitating careful initialization (e.g., k-means++), and struggles with non-linear, varying shaped data clusters .

Decision trees apply splitting criteria to improve classification accuracy by evaluating each feature's ability to separate data into more distinct classes. Key splitting criteria include Gini impurity, which measures the degree of impurity in a split, and entropy/information gain, which evaluates how much a feature reduces uncertainty. The tree selects the feature that yields the most significant information gain or the lowest impurity, resulting in purer subsets and improved classification accuracy .

In a decision tree, decision nodes represent points where the dataset is split based on feature values, directing the branching of the tree. Leaf nodes are the endpoints representing the final output or class labels. The interaction involves decision nodes evaluating information gain or purity to decide splits, while leaf nodes provide the ultimate outcomes of these decisions without further branching .

K-Nearest Neighbors (KNN) handles classification problems by assuming proximity reflects similarity; it predicts a point's class by majority voting among its 'K' closest neighbors. Strengths of KNN include simplicity and effectiveness in capturing complex patterns without explicit model training. Limitations involve high computational cost, sensitivity to irrelevant features, and performance dependency on 'K' choice and distance metric, which can affect robustness .

Decision trees translate into logical disjunctions ('if ... then' rules) by expressing each split as conditional statements corresponding to decision nodes. This capability simplifies model interpretation and implementation in rule-based systems. While this offers transparency and clarity, it may also result in overly complex or lengthy rules, particularly in deep trees, potentially hindering practical usability without pruning or simplification .

You might also like