0% found this document useful (0 votes)
4 views9 pages

Sahanashree Ex-2 ML (2)

The document outlines a lab exercise focused on data quality assessment and similarity analysis, specifically addressing outlier detection and handling missing values in a dataset. It includes functions for identifying outliers using both the IQR and Z-score methods, as well as techniques for imputing missing values. Additionally, it discusses computing and visualizing a correlation matrix for numeric attributes in the dataset.

Uploaded by

MOHANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Sahanashree Ex-2 ML (2)

The document outlines a lab exercise focused on data quality assessment and similarity analysis, specifically addressing outlier detection and handling missing values in a dataset. It includes functions for identifying outliers using both the IQR and Z-score methods, as well as techniques for imputing missing values. Additionally, it discusses computing and visualizing a correlation matrix for numeric attributes in the dataset.

Uploaded by

MOHANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

LAB 2: DATA QUALITY ASSESSMENT & SIMILARITY ANALYSIS

Q1. Outlier detection.

(i) For your dataset, identify any duplicates and remove them, retaining only their first occurrence.

(ii) Going by each attribute, use the two techniques discussed in class to identify outliers. You may find it

useful to write two functions - OutlierFromIQR() and OutlierFromZscore() - to process your dataset. An

outlier that lies 3*IQR outside Q1 or Q3 is an extreme outlier - these may also be identified.

#Q1
Drug<-read.csv("Drug.csv")
Drug

# Load necessary libraries


library(dplyr)
# (i) Remove duplicate rows, keeping the first occurrence
Drug <- Drug[!duplicated(Drug), ]

# Function to detect outliers using IQR method


OutlierFromIQR <- function(column) {
Q1 <- quantile(column, 0.25, na.rm = TRUE)
Q3 <- quantile(column, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
extreme_lower <- Q1 - 3 * IQR
extreme_upper <- Q3 + 3 * IQR
list(
Outliers = column[column < lower_bound | column > upper_bound],
Extreme_Outliers = column[column < extreme_lower | column >
extreme_upper]
)
}

# Function to detect outliers using Z-score method


OutlierFromZscore <- function(column) {
mean_col <- mean(column, na.rm = TRUE)
sd_col <- sd(column, na.rm = TRUE)
z_scores <- (column - mean_col) / sd_col
outliers <- column[abs(z_scores) > 3]

list(Outliers = outliers)
}

# Apply functions to numeric columns

numeric_columns <- names(Drug)[sapply(Drug, is.numeric)]

outliers_iqr <- lapply(Drug[, numeric_columns], OutlierFromIQR)


outliers_zscore <- lapply(Drug[, numeric_columns], OutlierFromZscore)

# Print results
print("Outliers detected using IQR method:")
print(outliers_iqr)

print("Outliers detected using Z-score method:")


print(outliers_zscore)

library(ggplot2)
#boxplot of Na
OutlierFromIQR <- function(column) {
Q1 <- quantile(column, 0.25, na.rm = TRUE)
Q3 <- quantile(column, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
return(column < lower_bound | column > upper_bound)
}

# Detect outliers in 'Na' column


Drug$outlier <- as.factor(OutlierFromIQR(Drug$Na))

# Create boxplot for 'Na' with outliers highlighted in red


ggplot(Drug, aes(x = "", y = Na, color = outlier)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
geom_jitter(width = 0.2, alpha = 0.5) + # Add jitter for better visibility
scale_color_manual(values = c("black", "red")) + # Normal points = black,
Outliers = red
labs(title = "Boxplot of Na", x = "", y = "Na", color = "Outlier") +
theme_minimal()

#scatter plot of Na
OutlierFromIQR <- function(column) {
Q1 <- quantile(column, 0.25, na.rm = TRUE)
Q3 <- quantile(column, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
return(column < lower_bound | column > upper_bound)
}

# Identify outliers in 'Na' column


Drug$outlier <- as.factor(OutlierFromIQR(Drug$Na))

# Create scatter plot for 'Na' column


ggplot(Drug, aes(x = seq_along(Na), y = Na, color = outlier)) +
geom_point(size = 2, alpha = 0.7) +
scale_color_manual(values = c("black", "red")) + # Normal points = black,
Outliers = red
labs(title = "Scatter Plot of Na with Outliers", x = "Index", y = "Na", color =
"Outlier") +
theme_minimal()
#Q2
Drug <- read.csv("Drug.csv", na.strings = c("", "NA"))

colSums(is.na(Drug))

# Impute numeric columns with median and categorical columns with mode
for (col in names(Drug)) {
if (is.numeric(Drug[[col]])) {
Drug[[col]][is.na(Drug[[col]])] <- median(Drug[[col]], na.rm = TRUE)
} else {
mode_value <- names(sort(table(Drug[[col]]), decreasing = TRUE))[1]
Drug[[col]][is.na(Drug[[col]])] <- mode_value
}
}

# Check missing values after imputation


colSums(is.na(Drug))

#Q3

# Load necessary libraries


install.packages(c("ggcorrplot", "dplyr"), quietly = TRUE)
library(ggcorrplot)
library(dplyr)

# Select numeric attributes


numeric_columns <- select_if(Drug, is.numeric)

# (i) Compute correlation matrix for numeric attributes


cor_matrix <- cor(numeric_columns, use = "pairwise.complete.obs")

# (ii) Visualize correlation using a heatmap


ggcorrplot(cor_matrix, lab = TRUE, colors = c("blue", "white", "red"))

# (iii) Identify attributes with high correlation (> 0.5)


high_corr <- which(abs(cor_matrix) > 0.5 & cor_matrix != 1, arr.ind = TRUE)
print("Highly Correlated Attributes (|correlation| > 0.5):")
print(high_corr)

# (iv) Compute distance matrix for numeric attributes


distance_matrix <- dist(numeric_columns, method = "euclidean")
print(as.matrix(distance_matrix))

You might also like