UNIT 2-Part2
UNIT 2-Part2
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions, and
it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
There are mainly four sectors where Random forest mostly used:
o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.
Ensemble Classifier
An ensemble classifier is a machine learning model that combines the predictions of multiple
individual models to enhance overall performance and generalization. The fundamental idea behind
ensemble methods is to leverage the diversity among different models to improve predictive
accuracy and robustness. There are several types of ensemble classifiers, with two prominent ones
being bagging and boosting.
Idea: Build multiple instances of the same base model on different subsets of the training data.
Procedure:
2. Boosting:
Idea: Sequentially build a series of weak learners, where each new model corrects the errors made
by the previous ones.
Procedure:
3. Random Forest:
Procedure:
Procedure:
1. Accuracy:
2. Precision:
4. F1 Score:
Area Under the Curve (AUC): Quantifies the classifier's ability to distinguish between classes.
7. Confusion Matrix:
Matrix: Tabulates true positives, true negatives, false positives, and false negatives.
Understanding these metrics helps practitioners choose the appropriate evaluation criteria based on
the specific requirements and characteristics of their classification problem.
XGBoost
extensively used in machine learning competitions and hackathons. XGBoost has high
predictive power and is almost 10 times faster than the other gradient boosting
technique.
1. Regularization:
3. High Flexibility:
5. Tree Pruning:
pruning the tree backwards and removes splits beyond which there is no
positive gain.
6. Built-in Cross-Validation:
boosting process and thus it is easy to get the exact optimum number of
Parameters
nthread
o This is used for parallel processing and the number of cores in the system
should be entered..
o If you wish to run on all cores, do not input this value. The algorithm will
detect it automatically.
eta
o Makes the model more robust by shrinking the weights on each step.
min_child_weight
max_depth
o Higher depth will allow the model to learn relations very specific to a
particular sample.
max_leaf_nodes
gamma
o A node is split only when the resulting split gives a positive reduction in
the loss function. Gamma specifies the minimum loss reduction required to
make a split.
o Makes the algorithm conservative. The values can vary depending on the
loss function and should be tuned.
subsample
colsample_bytree