0% found this document useful (0 votes)
220 views

Business Analytics MGN801-CA2 KAJAL (11917586) Section - Q1959

The document discusses various machine learning algorithms applied to an agriculture crops production dataset from India. It first introduces supervised and unsupervised learning in machine learning. It then applies linear regression, decision tree, and random forest regression models to predict crop yield based on cost variables. Next, it uses KNN classification and K-means clustering for classification and clustering. Finally, it performs hierarchical clustering using complete, ward.D, and average methods to cluster the data. Tables and plots are produced to evaluate and visualize the results of the various models.

Uploaded by

KAJAL KUMARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
220 views

Business Analytics MGN801-CA2 KAJAL (11917586) Section - Q1959

The document discusses various machine learning algorithms applied to an agriculture crops production dataset from India. It first introduces supervised and unsupervised learning in machine learning. It then applies linear regression, decision tree, and random forest regression models to predict crop yield based on cost variables. Next, it uses KNN classification and K-means clustering for classification and clustering. Finally, it performs hierarchical clustering using complete, ward.D, and average methods to cluster the data. Tables and plots are produced to evaluate and visualize the results of the various models.

Uploaded by

KAJAL KUMARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

BUSINESS ANALYTICS

MGN801-CA2
KAJAL (11917586)
Section – Q1959

Introduction to Machine Learning


Machine Learning is a tool which is used to build logic based on the given data. Its helps
in learning from the examples and past experiences where instead of writing codes we
feed data to the generic algorithms which further gives an informative output and this
output is used for forecasting or making predictions. The main goal of machine learning
is to understand the nature of data and convert that data into different models that can be
further understood and utilized by people.

 Types of Machine Learning

 Supervised Learning- Supervised learning is a machine learning approach where


we use input variables and an output variables from the data and then we run an
algorithm to learn the mapping function from the input to the output. The goal is
to determine the mapping function so well that again if we have a new input data
then we can predict the output variables for that data. Classification and
regression model are the two different models used for prediction in supervised
learning. In supervised learning all the data is labeled and algorithms are used to
predict output from the input data.

 Unsupervised Learning- Unsupervised learning is the other machine learning


approach where we only take input data and no corresponding output variables.
The goal for unsupervised learning is to frame a model of underlying structure or
distribution available in the data in order to learn more about the data.
Unsupervised learning is further grouped into two different models i.e. Clustering
and Association. In case of unsupervised learning the data is unlabeled and
algorithms are used to learn inherent structure from the input data.

Data set information


https://www.kaggle.com/srinivas1/agricuture-crops-production-in-india#datafile%20(1).csv

The data which I have used for my assignment is taken from the above link. This data is
basically shows the yield on agriculture products based on the cost of cultivation and production.
REGRESSION:

It is a set of statistical processes which we used to estimating the relationships among the
different variables. Variables include on dependent variables and one or more than one input
variables. In this assignment I have taken the three input variables and one output variable.

Input variables – Cost of cultivation and production


Output – yield
1) Linear Regression Model

A linear regression is simply a regression model which is made up of linear variables. This can
be single variable linear model or multivariable linear model. Single variable is simply where we
have one input and one output however in multivariable we have more than one inputs and one
output. In this case we are referring to multivariable linear regression model.
Output:
Interpretation:

2) Decision Tree Model

Decision tree builds regression or classification models in the form of a tree structure. It breaks
down a dataset into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A
decision node has two or more branches, each representing values for the attribute tested. Leaf
node represents a decision on the numerical target. The topmost decision node in a tree which
corresponds to the best predictor called root node. Decision trees can handle both categorical and
numerical data.
Output:

Interpretation:
3) Random Forest Model

The Random Forest is one of the most effective machine learning models for predictive
analytics, making it an industrial workhorse for machine learning. The random forest model is a
type of additive model that makes predictions by combining decisions from a sequence of base
models.
Output

Interpretation:
Accuracy
Classification
Classification is a process related to categorization, where we classify and make different classes
based on different characteristics. It is a useful tool for statistical surveys.
KNN Classification
KNN algorithm is one of the simplest classification algorithm and it is one of the most used
learning algorithms. KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a
database in which the data points are separated into several classes to predict the classification of
a new sample point. KNN is a simple algorithm that stores all available cases and classifies new
cases based on similarity measures e.g distance functions. Those who comes under the smallest
distance that falls under the same category.
Commands:
library(dplyr)
library(class)
T1=read.csv(file.choose(),header=TRUE)
View(T1)
str(T1)
set.seed(99)
T2=T1[,c(1,3,4,5)]
head(T2)
normalize=function(x){return((x-min(x))/(max(x)-min(x)))}
T2.new=as.data.frame(lapply(T2[,c(2,3,4)],normalize))
head(T2.new)

T2.train=T2.new[1:35,]
T2.train.target=T2[1:35,1]
T2.test=T2.new[36:49,]
T2.test.target=T2[36:49,1]
summary(T2.new)
T_model=knn(train = T2.train,test = T2.test,
cl=T2.train.target,k=7)
T_model
plot(T_model)
table(T2.test.target,T_model)
Output

Table

Interpretation
Clustering
Clustering analysis or clustering is a technique of grouping the objects having similar
characteristics in the same group so that they can look more similar to each other than those
which are in other groups.

K Means clustering:

This clustering aims the partition on n observations into k clusters and each observation belongs
to the cluster with the nearest mean. It is a part of unsupervised learning and we use this
clustering when we have unlabeled data.
Commands
T3=T2
T3
T3$Crop=NULL
head(T3)
results.T=kmeans(T3,3)
results.T
results.T$size
results.T$cluster
table(results.T$cluster)
plot(T3$Cost.of.Cultivation_1~T3$Cost.of.Production,col=results.T$cluster)
library(ggplot2)
ggplot(T3,aes(x=T3$Cost.of.Cultivation_1,y=T3$Cost.of.Production))+
geom_point(aes(col=results.T$cluster))
Output

Interpretation
Hierarchical Clustering:

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is
distinct from each other cluster, and the objects within each cluster are broadly similar to each
other.
Commands
#### Method complete#####
dist(T2[,2:4])
clust1=hclust(dist(T2[,2:4]),method = "complete")
plot(clust1,cex=0.3)
cutting1=cutree(clust1,3)
plot(cutting1)
table(T2$Crop,cutting1)
rect.hclust(clust1,k=8,border = c("red","blue","green"))

####Method Ward.d####
dist(T2[,2:4])
clust2=hclust(dist(T2[,2:4]),method = "ward.D")
plot(clust2,cex=0.3)
cutting2=cutree(clust2,3)
plot(cutting2)
table(T2$Crop,cutting2)
rect.hclust(clust2,k=8,border = c("red","blue","green"))

#####Method Average#####
dist(T2[,2:4])
clust3=hclust(dist(T2[,2:4]),method = "average")
plot(clust3,cex=0.3)
cutting3=cutree(clust3,3)
plot(cutting3)
table(T2$Crop,cutting3)
rect.hclust(clust3,k=8,border = c("red","blue","green"))
Output
Method complete
Method Ward.d
Method Average

You might also like