0% found this document useful (0 votes)
4 views

Clustering and Random

Clustering and Random

Uploaded by

Peris Wanjiku
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Clustering and Random

Clustering and Random

Uploaded by

Peris Wanjiku
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Improving the Linear regression model

Adjusted R² measures the goodness of fit of a regression model. Higher the R², better is the model.
When the R² is small, it could mean that there are some variables that are not significant in
predicting the amount of money transacted. One should note that correlated predictor variables
brings down the model accuracy. A regression model can be further improved by detecting outliers
and high leverage points.
Decision Trees
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is
mostly used in classification problems. It works for both categorical and continuous input and
output variables. In this technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input
variables.

Example
Suppose we have 30 individuals with 2 variables i.e. Gender (Male and Female) and class (A and B).
We want to create a model to predict who will default in say Fuliza. Here we need to segregate those
who default based on a highly significant input variable among the three.

How do we split?
There exists many splitting mechanisms. Among the simplest is the Gini index which is used to
perform a binary split.
Steps to calculate Gini for a split
1. Calculate Gini for sub-nodes using the formula: 𝑝 ∗ 𝑝 + 𝑞 ∗ 𝑞
Where p is the probability of success and q the probability of failure
2. Calculate the weighted Gini score for each node.
A higher weighted Gini score means a higher homogeneity
From our example:

The weighted Gini score for split on gender is higher than split on class hence the node will be plit
on Gender.
Random Forest

Random Forest is a versatile machine learning method capable of performing both regression and
classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier
values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble
learning method, where a group of weak models combine to form a powerful model.

Random forest is developed by aggregating trees. So instead of having only one decision tree, we
create or grow several trees and then aggregate the results from all the trees to come up with a
classification or a regression model. So if we have data and we want to grow say 1000 decision trees,
we draw with replacement 1000 random samples of size n from the data and grow a tree for each of
the 1000 sub data sets and then aggregate the results. E.g. in our case we may have 800 trees
classifying by gender and say 200 trees classifying by class so the classification will be by gender
because it received ‘majority of the votes’. For regression we simply take the averages for the 1000
trees.
Random forests can be used for regression and classification. If y (the dependent variable) is
categorical we have classification and if y is continuous we have regression.

So from the above tree, an individual aged 29, with Mpesa balance of 33 shillings and living in an
urban region will default on his or her Fuliza loan.

Clustering

This is the process of breaking down a large population into smaller groups where each observation
in these smaller groups are more similar to the ones they share groups than to the ones in other
groups. The idea is to group similar kind of observations into smaller groups and this brake down
the large heterogeneous group into smaller homogeneous groups. This is great as tailor made
products and strategies can be devised for these smaller homogeneous groups as opposed to blanket
strategy for the entire heterogeneous population. Consider the figure below of a 2 dimensional
example.

Essentially what we have done is that we have is that we have taken an imaginary point at the center
and drawn a circle around it and all observations that fall in that circle are grouped into one cluster.
Now assume instead of 2 variables we have several variables say 100, it is difficult to draw a 100
dimensional graph.

The Clustering Algorithm

The most widely used clustering algorithm is the k means cluster. This means that when the
algorithm is applied on a data set, the algorithm is going to break the dataset into k different clusters.
Incase it is unable to find k clusters it is going to break the dataset into k-1 clusters and so on. So if
k=5 we shall have 5 clusters.

Example

Suppose we want to split our data into 3 groups.

Step 1

Identify 3 seeds i.e get three observations from the dataset and assign them as seeds.

Step 2

Assign all other observations to the one of the seeds based on their proximity to the seeds. How?

Simple way

Draw a straight line joining two seeds and then draw a perpendicular bisector at the mid-point. All
the points to the left of the line are close to the left seed and those to the right of the perpendicular
line belong to the right seed

We have created 3 clusters based on the three random seeds that were chosen, the question now is
how do we know if these are the optimal clusters? These clusters have been chosen based on the
seeds and we can clearly see from how the algorithm works that if the seeds are chosen differently
we could get different results. SO how do we decide on the optimal cluster.

The next step of the algorithm will be to calculate the centroid of each cluster i.e. the mid point of
each cluster and assign these mid points as the seed for the next round.

This is repeated until convergence i.e. there is no more movements or shifting of pints across
clusters etc.

You might also like