0% found this document useful (0 votes)
57 views40 pages

Aditya Garg DMDW

A crop researcher wants to test the effect of three fertilizer mixtures on crop yield using a one-way ANOVA. The researcher will perform a one-way ANOVA on the data to see if there are statistically significant differences in crop yields between the three groups. The one-way ANOVA will calculate test statistics and compare them to an alpha level of 0.05 to determine if the null hypothesis that the group means are identical can be rejected or not. A conclusion will be made and the results will be plotted in a graph.

Uploaded by

Raj Nish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views40 pages

Aditya Garg DMDW

A crop researcher wants to test the effect of three fertilizer mixtures on crop yield using a one-way ANOVA. The researcher will perform a one-way ANOVA on the data to see if there are statistically significant differences in crop yields between the three groups. The one-way ANOVA will calculate test statistics and compare them to an alpha level of 0.05 to determine if the null hypothesis that the group means are identical can be rejected or not. A conclusion will be made and the results will be plotted in a graph.

Uploaded by

Raj Nish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Dr. B. R.

Ambedkar National Institute of Technology


Jalandhar, Punjab

Session : June-Dec 2020

CXS – 425
Data Mining and Data Warehousing Lab

Submitted to : Submitted By:


Dr. Kunwar Pal Kunal Khandelwal
CSE Department 17103045
G2
17103045 1

Assignment 1
1.
a. Find matrix – matrix multiplication
b. Find (AB)T and (AB)-1
c. Find the mean, standard deviation for each column and row for the
matrices A, B, AB, (AB)-1.
A <- rbind(c(3,-2,1),c(-1,4,-2))
B <- rbind(c(-7,4),c(9,5),c(2,-1))

print("Matrix A : ")
print(A)
print("Matrix B :")
print(B)

#AB
C <-A%*%B
print("Multiplication AB :")
print(C)

#T(AB)
T <-t(C)
print("Transpose of Matrix AB :")
print(T)

#I(AB)
I <- solve(C)
print("Inverse of Matrix AB :")
print(I)

#Mean
print("Mean of Matrix A :")
#Row
mean(A[1,])
mean(A[2,])
#column
mean(A[,1])
mean(A[,2])
mean(A[,3])

print("Mean of Matrix B :")


#Row
mean(B[1,])
mean(B[2,])
mean(B[3,])
#column
mean(B[,1])
mean(B[,2])

print("Mean of Matrix AB :")


#Row
mean(C[1,])
mean(C[2,])
#column
mean(C[,1])
mean(C[,2])
17103045 2

print("Mean of Matrix Inverse of AB :")


#Row
mean(I[1,])
mean(I[2,])
#column
mean(I[,1])
mean(I[,2])

#Standard Deviations
print("Standard deviation of matrix A :")
sd(A,na.rm=TRUE)
print("Standard deviation of matrix B :")
sd(B,na.rm=TRUE)
print("Standard deviation of matrix AB :")
sd(C,na.rm=TRUE)
print("Standard deviation of matrix inverse of AB :")
sd(I,na.rm=TRUE)

OUTPUT
17103045 3
17103045 4

2. Write a “Function” program in R to find n! . Hence Find 13! , 32! ,.Do not name the
function by “Factorial”. You can initialize that 0!=1 and 1!=1.

findfactorial <- function(n){

factorial <- 1
if((n==0||n==1))
factorial <- 1
else{
for(i in 1:n)
factorial <- factorial*i
}
return (factorial)
}

print(findfactorial(13))
print(findfactorial(32))

OUTPUT
17103045 5

3. Write a “Function” program in R to find maximum and minimum from a set of


numbers. Do not name the function by “max” or “min”. As an input you take
(4,44.7,2,40,54,1,3,4).

vector1 <- c(4,44.7,2,40,54,1,3,4)


l <- length(vector1)

min1 = 10000
max1 = -10000

for(i in 1:l){
if(min1>vector1[i]){
min1 = vector1[i];
}
if(max1<vector1[i]){
max1 = vector1[i];
}
}

print(paste("Minimum is", min1))


print(paste("Maximum is", max1))

OUTPUT
17103045 6

ASSIGNMENT 2

1. How to read/write data from the dataset in R.

In R, we can write data frames easily to a file, using the write.table() command.
write.table(cars1, file="cars1.txt", quote=F)
The first argument refers to the data frame to be written to the output file, the second is
the name of the output file. By default, R will surround each entry in the output file by
quotes, so we use quote=F.
The function read.table(“/location”) can then be used to read the data frame directly

Code:

data <- read_excel("BEPSxls.xlsx")


View(data)

OUTUPT:
17103045 7

2. Use different function in R.


a. Read
b. Head
c. Tail
d. Names

CODE:

data <- read_excel("BEPSxls.xlsx")


#data-read
print("*************************");
print(data)
head(data,6)
tail(data,6)

print("********Data Head ***********")


#data-head
print(head(1:50,10))

print("********Data Tail***********")
#data-tail
print(tail(1:5,1))
print("******Names Data ***********")
print(names(data))
17103045 8

OUTPUT:
17103045 9

3. Download the given dataset and perform the following.


a. Mean
b. Median
c. Summary
d. Histogram
e. Plot

Code:
dataset<- read_excel("BEPSxls.xlsx")
mean(dataset$age)
median(dataset$age)
summary(dataset)
hist(dataset$age,main = 'AGE HISTOGRAM')
plot(dataset$Blair)

OUTPUT:
17103045 10

4. Attach and detach the dataset in R.

data <- data.frame(x1 = c(9, 8, 3, 4, 8),


x2 = c(5, 4, 7, 1, 1),
x3 = c(1, 2, 3, 4, 5))
data
x1 #give error
attach(data)
x1 #run
detach(data)
x1 # give error

library(readxl)
dataset=read_excel(file.choose())
#For dataset
attach(dataset)
cat(gender)
detach(dataset)
cat(gender)
17103045 11

ASSIGNMENT 3

1. Demonstration of pre-processing on dataset mtcars(R-studio)

Code:
mtcars
mtcars$mpg = ifelse(is.na(mtcars$mpg),ave(mtcars$mpg, FUN =
function(x) mean(x,na.rm='TRUE')),mtcars$mpg)

OUTPUT:

2. Demonstrate the filter function on dataset mtcars using (deplyr package)


a. Show ehre gear attribute = 4,
b. Show where disp = 160,
c. Show different operations (and,or,not)

CODE:
library(dplyr)

#1 Show where gear attribute = 4,


gear_4 <- filter(mtcars, gear == 4)
head(gear_4)

#2 Show where disp = 160.


disp_160 <- filter(mtcars, disp == 160.0)
head(disp_160)

#3 Show different operations (and, or, not)


#AND
gear4_and_carb4 <- filter(mtcars, gear == 4 & carb == 4)
head(gear4_and_carb4)
#OR
17103045 12

gear4_or_hp110 <- filter(mtcars, gear == 4 | hp == 110)


head(gear4_or_hp110)
#Not
gearNot4 <- filter(mtcars, gear != 4)
head(gearNot4)

OUTPUT:
17103045 13

3. Demonstrate the different function on dataset mtcars/Titanic


a. arrange
b. group_by
c. summarise
d. select
e. intersect
f. setdiff

CODE:
print("Arrange : ")
arrange(mtcars, desc(disp))

print("Group By : ")
group_by(mtcars,drat)

print("Summarise : ")
summarise(mtcars,mean(disp))

print("Select : ")
select(mtcars,qsec)

print("Intersect :")
A<- subset(mtcars,disp==160)
B<- subset(mtcars,cyl=100)
intersect(A,B)

print("SetDiff :")
setdiff(B,A)
17103045 14
17103045 15
17103045 16
17103045 17

4. Remove the not required columns from mtcars dataset

Code:
DATA <- subset(mtcars,select=c(1:9))
print(DATA)

5. Show the attribute containing NA values in a column in dataset

Code:
myData <- data.frame(col1 = c(1:3, NA),
col2 = c("this", NA,"is", "text"),
col3 = c(TRUE, FALSE, TRUE, TRUE),
col4 = c(2.5, NA, 3.2, NA))
is.na(myData)
17103045 18

6. Repeat all the above question on downloaded dataset

Attributes containing NA VALUES

CODE:
is.na(mtcars)
17103045 19

ASSIGNMENT 4

1. As a crop researcher, you want to test effect of three different fertilizers


mixtures on crop yield. You can use a one-way ANOVA to find out if there
is a difference in crop yields between the three groups. Using the data,
perform a one-way analysis of variance using alpha = .05.
a. Perform a one-way analysis of variance
b. Calculate test statistics
c. Interpreting the results
d. State conclusion
e. Plot the graph for the same

The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an


extension of independent two-samples t-test for comparing means in a situation
where there are more than two groups. In one-way ANOVA, the data is organized
into several groups base on one single grouping variable (also called factor variable).
The one-way analysis of variance (ANOVA) is used to determine whether there are
any statistically significant differences between the means of three or more
independent (unrelated) groups.To clarify if the data comes from the same
population, you can perform a one-way analysis of variance (one-way ANOVA
hereafter). This test, like any other statistical tests, gives evidence whether the H0
hypothesis can be accepted or rejected.

Hypothesis in one-way ANOVA test:


• H0: The means between groups are identical
• H3: At least, the mean of one group is different

In other words, the H0 hypothesis implies that there is not enough evidence to prove
the mean of the group (factor) are different from another.
17103045 20

Code:

my_data <- read_excel('DMDW_LAB4.xlsx')


View(my_data)

#check and display ordered levels


my_data$group <- ordered(my_data$group, levels = c("Group1", "Group2",
"Group3"))

#compute summary statistics by group


library(dplyr)
group_by(my_data, group) %>%
summarise(
count = n(),mean = mean(values, na.rm = TRUE),
sd = sd(values, na.rm = TRUE)
)

#compute one way ANOVA


#compute analysis of variance
res.aov <- aov(values ~ group, data = my_data)
#summary of analysis
summary(res.aov)

#interpret result of ANOVA


#multiple pairwise comparison
TukeyHSD(res.aov)
#homogeneity
plot(res.aov,1)
#normality
plot(res.aov,2)

OUTPUT:
17103045 21

2. Repeat the quesion1 and perform one-way analysis of variance using inbuilt
dataset in Rstudio.

Code:
#build data
my_data<- PlantGrowth

#check data and display ordered levels


sample_n(my_data,10)

#show levels
levels(my_data$group)

#compute summary statistics


library(dplyr)
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)
17103045 22

#compute anova test


# Compute the analysis of variance
res.aov <- aov(weight ~ group, data = my_data)
# Summary of the analysis
summary(res.aov)

#Interpret the result of one-way ANOVA tests


#multiple pairwise comparison
TukeyHSD(res.aov)
#Homogeneity of variances
plot(res.aov, 1)
#Normality
plot(res.aov, 2)

OUTPUT:
17103045 23
17103045 24

ASSIGNMENT 5

1. Consider dataset “Groceries” and apply apriori algorithm on it. What are the
first 5 rules generated when the min support is 0.001 and min confidence is 0.9

Code:

library(arules)
groceries <- read_excel("LAB5.csv")
rules=apriori(data= groceries, parameter =
list(support=0.001,confidence=0.9))
inspect(rules[1:5])

OUTPUT:

2. The database has four transaction. What association rule can be found in this set,
if the minimum support is 60% and minimum confidence is 80%.

Code:

library(arules)
library(readr)
groceries2 <- read_excel("LAB5-2.csv")

rules=apriori(data= groceries2,parameter= list(support=0.6,confidence=0.8))


rules

Output:
17103045 25

3. Demonstration of association rule process on dataset titanic using apriori


algorithm in rstudio.

Code:
library(arules)
library(readr)
titanic <- read_csv("titanic.csv")
data(titanic)
rules=apriori(data= titanic, parameter =
list(support=0.6,confidence=0.8))
rules
inspect(rules[1:5])

OUTPUT:
17103045 26

ASSIGNMENT 6

1. Demonstrate performing linear regression on given data using R/Python.


a. Plot the scattered graph
b. Calculate test statistics
c. Find coefficient and different performance matrix

Code:
dataset <- read_excel("LAB6.xlsx")
summary(dataset)
hist(dataset$X)
plot(Y~X, data=dataset)
dataset.lm <- lm(Y~X, dataset)
summary(dataset.lm)

Output:
17103045 27

2. Demonstrate performing linear regression on Lung capacity dataset using


R/Python.

Code:

dataset <- read_excel("Lung Capacity.xls")


summary(dataset)
cor(dataset$Height, dataset$LungCapacity)
cor(dataset$Age, dataset$LungCapacity)
plot(dataset$Exercise, dataset$LungCapacity,data = dataset)

dataset.lm <- lm(dataset$LungCapacity ~ dataset$Gender + dataset$Height +


dataset$Smoker +dataset$Exercise, data= dataset)
summary(dataset.lm)

Output:
17103045 28

ASSIGNMENT 7

1. To construct Decision tree for weather data and classify it.

Code:

library(rpart.plot)
library(rpart)
dataset <- read_csv("austin_weather.csv")
head(dataset)
shuffle_index<-sample(1:nrow(dataset))
dataset <- dataset[shuffle_index,]
ls(dataset)
sum(is.na(dataset$Events))
dim(dataset)

sum(is.na(dataset$DewPointAvgF))
summary(dataset$TempHighF)

dataset = subset(dataset, select = -c(Date,Events,TempAvgF, DewPointAvgF,


HumidityAvgPercent,SeaLevelPressureAvgInches, VisibilityAvgMiles, WindAvgMPH ))

str(dataset)
dataset[] <- lapply(dataset, as.numeric)

dataset <- dataset %>%


mutate(TempHighF = case_when(
TempHighF < 40 ~ "<40",
TempHighF >= 40 & TempHighF < 60 ~ "40-60",
TempHighF >= 60 & TempHighF < 80 ~ "60-80",
TempHighF >= 80 & TempHighF < 100 ~ "80-100",
TempHighF >= 100 ~ ">100",
TRUE ~ "NA"
))

fit <- rpart(TempHighF~., data = dataset, method = 'class')


rpart.plot(fit, extra = 106)
17103045 29

Output:

2. To construct Decision tree for customer data and classify it.

Code:
dataset <- read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

dim(dataset)

ls(dataset)

dataset = subset(dataset, select = -c(customerID ))

fit <- rpart(Churn~., data = dataset, method = 'class')

rpart.plot(fit, extra = 106)

Output:
17103045 30
17103045 31

ASSIGNMENT 8

1. Write a procedure for clustering customer data using Simple KMeans Algorithm

 Step 1: Choose groups in the feature plan randomly


 Step 2: Minimize the distance between the cluster center and the different observations
(centroid). It results in groups with observations
 Step 3: Shift the initial centroid to the mean of the coordinates within a group.
 Step 4: Minimize the distance according to the new centroids. New boundaries are created.
Thus, observations will move from one group to another
 Repeat until no observation changes groups
17103045 32

2. Demonstration of clustering rule process on dataset using simple k-means.

Code:

library(readr)
dataset = read_csv("Mall_Customers.csv")
dataset = dataset[4:5]
set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(dataset, i)$withinss)
plot(1:10,
wcss,
type = 'b',
main = paste('The Elbow Method'),
xlab = 'Number of clusters',
ylab = 'WCSS')
kmeans = kmeans(x = dataset, centers = 5)
y_kmeans = kmeans$cluster

# Visualising the clusters


library(cluster)
clusplot(dataset,
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste('Clusters of customers'),
xlab = 'Annual Income',
ylab = 'Spending Score')

Output:
17103045 33
17103045 34

ASSIGNMENT 9

1. Demonstration of classification rule process on dataset using naïve bayes


algorithm

Code:

# Installing Packages
install.packages("e1071")
install.packages("caTools")
install.packages("caret")

# Loading package
library(e1071)
library(caTools)
library(caret)
library(dplyr)

dataset = read_csv("Mall_Customers.csv")

dataset$Gender <- factor(dataset$Gender, levels = c("Male", "Female"),


labels = c(0,1))

dataset <- dataset %>%


mutate(Age = case_when(
Age < 30 ~ "<30",
Age >= 30 & Age < 45 ~ "30-45",
Age >= 45 & Age < 60 ~ "45-60",
Age >= 60 ~ ">60",
TRUE ~ "NA"
))

dataset <- dataset %>%


mutate(Income = case_when(
Income <40 ~ "<40",
Income >= 40 & Income < 60 ~ "40-60",
Income >= 60 ~ ">60",
TRUE ~ "NA"
))

dataset <- dataset %>%


mutate(Score = case_when(
Score < 30 ~ "<20",
Score >= 20 & Score < 40 ~ "20-40",
Score >= 40 & Score < 60 ~ "40-60",
Score >= 60 & Score < 80 ~ "60-80",
Score >= 80 ~ ">80",
TRUE ~ "NA"
))

trainIndex <- createDataPartition(dataset$Score, p = .7,


list = FALSE,
times = 1)

Train <- dataset[ trainIndex,]


Valid <- dataset[-trainIndex,]
17103045 35

# Fitting Naive Bayes Model


# to training dataset

classifier_cl <- naiveBayes(Score ~ ., data = Train)


classifier_cl

Output:

2. Demonstration of clustering rule process on dataset using EM algorithm.

Code:

install.packages("mixtools")
dataset = read_csv("Mall_Customers.csv")
summary(dataset$Score)
x <- dataset$Score
plot(density(x))

mem <- kmeans(x,2)$cluster


mu1 <- mean(x[mem==1])
mu2 <- mean(x[mem==2])
sigma1 <- sd(x[mem==1])
sigma2 <- sd(x[mem==2])
pi1 <- sum(mem==1)/length(mem)
pi2 <- sum(mem==2)/length(mem)
# modified sum only considers finite values
sum.finite <- function(x) {
sum(x[is.finite(x)])
}
17103045 36

Q <- 0
# starting value of expected value of the log likelihood
Q[2] <- sum.finite(log(pi1)+log(dnorm(x, mu1, sigma1))) +
sum.finite(log(pi2)+log(dnorm(x, mu2, sigma2)))

k <- 2

while (abs(Q[k]-Q[k-1])>=1e-6) {
# E step
comp1 <- pi1 * dnorm(x, mu1, sigma1)
comp2 <- pi2 * dnorm(x, mu2, sigma2)
comp.sum <- comp1 + comp2

p1 <- comp1/comp.sum
p2 <- comp2/comp.sum

# M step
pi1 <- sum.finite(p1) / length(x)
pi2 <- sum.finite(p2) / length(x)

mu1 <- sum.finite(p1 * x) / sum.finite(p1)


mu2 <- sum.finite(p2 * x) / sum.finite(p2)

sigma1 <- sqrt(sum.finite(p1 * (x-mu1)^2) / sum.finite(p1))


sigma2 <- sqrt(sum.finite(p2 * (x-mu2)^2) / sum.finite(p2))

p1 <- pi1
p2 <- pi2

k <- k + 1
Q[k] <- sum(log(comp.sum))
}

library(mixtools)
gm<-normalmixEM(x,k=2,lambda=c(0.9,0.1),mu=c(0.4,0.3),sigma=c(0.05,0.02))
gm$mu
gm$sigma
gm$lambda
hist(x, prob=T, breaks=32, xlim=c(range(x)[1], range(x)[2]), main='')
lines(density(x), col="green", lwd=2)
x1 <- seq(from=range(x)[1], to=range(x)[2], length.out=1000)
y <- pi1 * dnorm(x1, mean=mu1, sd=sigma1) + pi2 * dnorm(x1, mean=mu2,
sd=sigma2)
lines(x1, y, col="red", lwd=2)
legend('topright', col=c("green", 'red'), lwd=2, legend=c("kernal", "fitted"))

Output:
17103045 37
17103045 38

ASSIGNMENT 10

1. Build Data Warehouse, install and Explore WEKA.

Took the dataset Mall_customers.csv


Here is how columns are visualised with a single click

Clustering the data with WEKA


17103045 39

2. Perform data pre-processing tasks and Demonstrate performing association rule


mining on datasets using WEKA.

Dataset:

Min_Support : 50%

Using WEKA,
We got the rules as

You might also like