0% found this document useful (0 votes)
10 views

UNIT 2-Part2

The document discusses the Random Forest machine learning algorithm. It describes how Random Forest works by creating multiple decision trees on random subsets of data and combining their predictions through voting. It also discusses the assumptions, advantages, and applications of Random Forest as well as the steps to implement it.

Uploaded by

1DS21AI055
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

UNIT 2-Part2

The document discusses the Random Forest machine learning algorithm. It describes how Random Forest works by creating multiple decision trees on random subsets of data and combining their predictions through voting. It also discusses the assumptions, advantages, and applications of Random Forest as well as the steps to implement it.

Uploaded by

1DS21AI055
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Dayananda Sagar College of Engineering

An Autonomous Institute Affiliated to VTU, Belagavi


Approved by AICTE & UGC, Accredited by NAAC with ‘A’ Grade and NBA

Department of Artificial Intelligence and Machine Learning


UNIT 2
Random Forest

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions, and
it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

Advanced Machine learning 1


Dayananda Sagar College of Engineering
An Autonomous Institute Affiliated to VTU, Belagavi
Approved by AICTE & UGC, Accredited by NAAC with ‘A’ Grade and NBA

Department of Artificial Intelligence and Machine Learning


Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:

o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?

Below are some points that explain why we should use the Random Forest algorithm:

<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.

Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

Advanced Machine learning 2


Dayananda Sagar College of Engineering
An Autonomous Institute Affiliated to VTU, Belagavi
Approved by AICTE & UGC, Accredited by NAAC with ‘A’ Grade and NBA

Department of Artificial Intelligence and Machine Learning


1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression tasks.


o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.

Implementation Steps are given below:

o Data Pre-processing step


o Fitting the Random forest algorithm to the Training set
o Predicting the test result
o Test accuracy of the result (Creation of Confusion matrix)
o Visualizing the test set result.

Ensemble Classifier
An ensemble classifier is a machine learning model that combines the predictions of multiple
individual models to enhance overall performance and generalization. The fundamental idea behind
ensemble methods is to leverage the diversity among different models to improve predictive
accuracy and robustness. There are several types of ensemble classifiers, with two prominent ones
being bagging and boosting.

1. Bagging (Bootstrap Aggregating):

Idea: Build multiple instances of the same base model on different subsets of the training data.

Procedure:

Advanced Machine learning 3


Dayananda Sagar College of Engineering
An Autonomous Institute Affiliated to VTU, Belagavi
Approved by AICTE & UGC, Accredited by NAAC with ‘A’ Grade and NBA

Department of Artificial Intelligence and Machine Learning


Randomly sample subsets (with replacement) from the training data.

Train a base model on each subset.

Combine predictions through averaging (for regression) or voting (for classification).

2. Boosting:

Idea: Sequentially build a series of weak learners, where each new model corrects the errors made
by the previous ones.

Procedure:

Train a weak learner on the entire dataset.

Give more weight to misclassified instances.

Iterate, focusing on the misclassified instances in each step.

Combine the weak learners into a strong model.

3. Random Forest:

Type: Bagging ensemble method.

Base Model: Decision Trees.

Procedure:

Build multiple decision trees on random subsets of features and samples.

Combine predictions through voting.

4. XGBoost (Extreme Gradient Boosting):

Type: Boosting ensemble method.

Base Model: Decision Trees.

Procedure:

Sequentially build decision trees, giving more weight to misclassified instances.

Regularize the model to avoid overfitting.

Combine predictions using a weighted sum.

Performance Metrics in Classification

Advanced Machine learning 4


Dayananda Sagar College of Engineering
An Autonomous Institute Affiliated to VTU, Belagavi
Approved by AICTE & UGC, Accredited by NAAC with ‘A’ Grade and NBA

Department of Artificial Intelligence and Machine Learning


Performance metrics are used to evaluate the effectiveness of classification models by comparing
their predicted outputs to the actual ground truth. Common metrics provide insights into different
aspects of a classifier's behavior, such as accuracy, precision, recall, and F1 score.

1. Accuracy:

Formula: (Number of Correct Predictions) / (Total Number of Predictions)

Interpretation: Overall correctness of the classifier.

Consideration: May be misleading in imbalanced datasets.

2. Precision:

Formula: (True Positives) / (True Positives + False Positives)

Interpretation: Proportion of predicted positives that were actually positive.

Consideration: Useful when the cost of false positives is high.

3. Recall (Sensitivity or True Positive Rate):

Formula: (True Positives) / (True Positives + False Negatives)

Interpretation: Proportion of actual positives that were correctly predicted.

Consideration: Important when missing actual positives is costly.

4. F1 Score:

Formula: 2 * (Precision * Recall) / (Precision + Recall)

Interpretation: Harmonic mean of precision and recall.

Consideration: Balances precision and recall.

5. Specificity (True Negative Rate):

Formula: (True Negatives) / (True Negatives + False Positives)

Interpretation: Proportion of actual negatives that were correctly predicted.

Consideration: Important when avoiding false positives is crucial.

6. Receiver Operating Characteristic (ROC) Curve:

Advanced Machine learning 5


Dayananda Sagar College of Engineering
An Autonomous Institute Affiliated to VTU, Belagavi
Approved by AICTE & UGC, Accredited by NAAC with ‘A’ Grade and NBA

Department of Artificial Intelligence and Machine Learning


Graphical Representation: Plots the true positive rate against the false positive rate.

Area Under the Curve (AUC): Quantifies the classifier's ability to distinguish between classes.

7. Confusion Matrix:

Matrix: Tabulates true positives, true negatives, false positives, and false negatives.

Use: Provides a detailed breakdown of a classifier's performance.

Understanding these metrics helps practitioners choose the appropriate evaluation criteria based on
the specific requirements and characteristics of their classification problem.

XGBoost

XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient

boosting algorithm. XGBoost has proved to be a highly effective ML algorithm,

extensively used in machine learning competitions and hackathons. XGBoost has high

predictive power and is almost 10 times faster than the other gradient boosting

techniques. It also includes a variety of regularization which reduces overfitting and

improves overall performance. Hence it is also known as ‘regularized boosting‘

technique.

Let us see how XGBoost is comparatively better than other techniques:

1. Regularization:

o Standard GBM implementation has no regularisation like XGBoost.

o Thus XGBoost also helps to reduce overfitting.

Advanced Machine learning 6


Dayananda Sagar College of Engineering
An Autonomous Institute Affiliated to VTU, Belagavi
Approved by AICTE & UGC, Accredited by NAAC with ‘A’ Grade and NBA

Department of Artificial Intelligence and Machine Learning


2. Parallel Processing:

o XGBoost implements parallel processing and is faster than GBM .

o XGBoost also supports implementation on Hadoop.

3. High Flexibility:

o XGBoost allows users to define custom optimization objectives and

evaluation criteria adding a whole new dimension to the model.

4. Handling Missing Values:

o XGBoost has an in-built routine to handle missing values.

5. Tree Pruning:

o XGBoost makes splits up to the max_depth specified and then starts

pruning the tree backwards and removes splits beyond which there is no

positive gain.

6. Built-in Cross-Validation:

o XGBoost allows a user to run a cross-validation at each iteration of the

boosting process and thus it is easy to get the exact optimum number of

boosting iterations in a single run.


# Importing the XGBoost library
import xgboost as xgb
# Creating an instance of the XGBClassifier with specified parameters
model = xgb.XGBClassifier(random_state=1, learning_rate=0.01)
# Training the model on the training data (assuming x_train and y_train are
defined)
model.fit(x_train, y_train)

Advanced Machine learning 7


Dayananda Sagar College of Engineering
An Autonomous Institute Affiliated to VTU, Belagavi
Approved by AICTE & UGC, Accredited by NAAC with ‘A’ Grade and NBA

Department of Artificial Intelligence and Machine Learning


# Evaluating the model on the test data (assuming x_test and y_test are
defined)
accuracy = model.score(x_test, y_test)
# Printing the accuracy score
print(accuracy)

Parameters

 nthread

o This is used for parallel processing and the number of cores in the system
should be entered..

o If you wish to run on all cores, do not input this value. The algorithm will
detect it automatically.

 eta

o Analogous to learning rate in GBM.

o Makes the model more robust by shrinking the weights on each step.

 min_child_weight

o Defines the minimum sum of weights of all observations required in a


child.

o Used to control over-fitting. Higher values prevent a model from learning


relations which might be highly specific to the particular sample selected
for a tree.

 max_depth

o It is used to define the maximum depth.

o Higher depth will allow the model to learn relations very specific to a
particular sample.

 max_leaf_nodes

o The maximum number of terminal nodes or leaves in a tree.

Advanced Machine learning 8


Dayananda Sagar College of Engineering
An Autonomous Institute Affiliated to VTU, Belagavi
Approved by AICTE & UGC, Accredited by NAAC with ‘A’ Grade and NBA

Department of Artificial Intelligence and Machine Learning


o Can be defined in place of max_depth. Since binary trees are created, a
depth of ‘n’ would produce a maximum of 2^n leaves.

o If this is defined, GBM will ignore max_depth.

 gamma

o A node is split only when the resulting split gives a positive reduction in
the loss function. Gamma specifies the minimum loss reduction required to
make a split.

o Makes the algorithm conservative. The values can vary depending on the
loss function and should be tuned.

 subsample

o Same as the subsample of GBM. Denotes the fraction of observations to be


randomly sampled for each tree.

o Lower values make the algorithm more conservative and prevent


overfitting but values that are too small might lead to under-fitting.

 colsample_bytree

o It is similar to max_features in GBM.

o Denotes the fraction of columns to be randomly sampled for each tree.

Advanced Machine learning 9

You might also like