Business Analytics MGN801-CA2 KAJAL (11917586) Section - Q1959
Business Analytics MGN801-CA2 KAJAL (11917586) Section - Q1959
MGN801-CA2
KAJAL (11917586)
Section – Q1959
The data which I have used for my assignment is taken from the above link. This data is
basically shows the yield on agriculture products based on the cost of cultivation and production.
REGRESSION:
It is a set of statistical processes which we used to estimating the relationships among the
different variables. Variables include on dependent variables and one or more than one input
variables. In this assignment I have taken the three input variables and one output variable.
A linear regression is simply a regression model which is made up of linear variables. This can
be single variable linear model or multivariable linear model. Single variable is simply where we
have one input and one output however in multivariable we have more than one inputs and one
output. In this case we are referring to multivariable linear regression model.
Output:
Interpretation:
Decision tree builds regression or classification models in the form of a tree structure. It breaks
down a dataset into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A
decision node has two or more branches, each representing values for the attribute tested. Leaf
node represents a decision on the numerical target. The topmost decision node in a tree which
corresponds to the best predictor called root node. Decision trees can handle both categorical and
numerical data.
Output:
Interpretation:
3) Random Forest Model
The Random Forest is one of the most effective machine learning models for predictive
analytics, making it an industrial workhorse for machine learning. The random forest model is a
type of additive model that makes predictions by combining decisions from a sequence of base
models.
Output
Interpretation:
Accuracy
Classification
Classification is a process related to categorization, where we classify and make different classes
based on different characteristics. It is a useful tool for statistical surveys.
KNN Classification
KNN algorithm is one of the simplest classification algorithm and it is one of the most used
learning algorithms. KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a
database in which the data points are separated into several classes to predict the classification of
a new sample point. KNN is a simple algorithm that stores all available cases and classifies new
cases based on similarity measures e.g distance functions. Those who comes under the smallest
distance that falls under the same category.
Commands:
library(dplyr)
library(class)
T1=read.csv(file.choose(),header=TRUE)
View(T1)
str(T1)
set.seed(99)
T2=T1[,c(1,3,4,5)]
head(T2)
normalize=function(x){return((x-min(x))/(max(x)-min(x)))}
T2.new=as.data.frame(lapply(T2[,c(2,3,4)],normalize))
head(T2.new)
T2.train=T2.new[1:35,]
T2.train.target=T2[1:35,1]
T2.test=T2.new[36:49,]
T2.test.target=T2[36:49,1]
summary(T2.new)
T_model=knn(train = T2.train,test = T2.test,
cl=T2.train.target,k=7)
T_model
plot(T_model)
table(T2.test.target,T_model)
Output
Table
Interpretation
Clustering
Clustering analysis or clustering is a technique of grouping the objects having similar
characteristics in the same group so that they can look more similar to each other than those
which are in other groups.
K Means clustering:
This clustering aims the partition on n observations into k clusters and each observation belongs
to the cluster with the nearest mean. It is a part of unsupervised learning and we use this
clustering when we have unlabeled data.
Commands
T3=T2
T3
T3$Crop=NULL
head(T3)
results.T=kmeans(T3,3)
results.T
results.T$size
results.T$cluster
table(results.T$cluster)
plot(T3$Cost.of.Cultivation_1~T3$Cost.of.Production,col=results.T$cluster)
library(ggplot2)
ggplot(T3,aes(x=T3$Cost.of.Cultivation_1,y=T3$Cost.of.Production))+
geom_point(aes(col=results.T$cluster))
Output
Interpretation
Hierarchical Clustering:
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is
distinct from each other cluster, and the objects within each cluster are broadly similar to each
other.
Commands
#### Method complete#####
dist(T2[,2:4])
clust1=hclust(dist(T2[,2:4]),method = "complete")
plot(clust1,cex=0.3)
cutting1=cutree(clust1,3)
plot(cutting1)
table(T2$Crop,cutting1)
rect.hclust(clust1,k=8,border = c("red","blue","green"))
####Method Ward.d####
dist(T2[,2:4])
clust2=hclust(dist(T2[,2:4]),method = "ward.D")
plot(clust2,cex=0.3)
cutting2=cutree(clust2,3)
plot(cutting2)
table(T2$Crop,cutting2)
rect.hclust(clust2,k=8,border = c("red","blue","green"))
#####Method Average#####
dist(T2[,2:4])
clust3=hclust(dist(T2[,2:4]),method = "average")
plot(clust3,cex=0.3)
cutting3=cutree(clust3,3)
plot(cutting3)
table(T2$Crop,cutting3)
rect.hclust(clust3,k=8,border = c("red","blue","green"))
Output
Method complete
Method Ward.d
Method Average