0% found this document useful (0 votes)

4 views9 pages

Sahanashree Ex-2 ML (2)

The document outlines a lab exercise focused on data quality assessment and similarity analysis, specifically addressing outlier detection and handling missing values in a dataset. It includes functions for identifying outliers using both the IQR and Z-score methods, as well as techniques for imputing missing values. Additionally, it discusses computing and visualizing a correlation matrix for numeric attributes in the dataset.

Uploaded by

MOHANA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views9 pages

Sahanashree Ex-2 ML (2)

Uploaded by

MOHANA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

LAB 2: DATA QUALITY ASSESSMENT & SIMILARITY ANALYSIS

Q1. Outlier detection.

(i) For your dataset, identify any duplicates and remove them, retaining only their first occurrence.

(ii) Going by each attribute, use the two techniques discussed in class to identify outliers. You may find it

useful to write two functions - OutlierFromIQR() and OutlierFromZscore() - to process your dataset. An

outlier that lies 3*IQR outside Q1 or Q3 is an extreme outlier - these may also be identified.

#Q1
Drug<-read.csv("Drug.csv")
Drug

# Load necessary libraries

library(dplyr)
# (i) Remove duplicate rows, keeping the first occurrence
Drug <- Drug[!duplicated(Drug), ]

# Function to detect outliers using IQR method

OutlierFromIQR <- function(column) {
Q1 <- quantile(column, 0.25, na.rm = TRUE)
Q3 <- quantile(column, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
extreme_lower <- Q1 - 3 * IQR
extreme_upper <- Q3 + 3 * IQR
list(
Outliers = column[column < lower_bound | column > upper_bound],
Extreme_Outliers = column[column < extreme_lower | column >
extreme_upper]
)
}

# Function to detect outliers using Z-score method

OutlierFromZscore <- function(column) {
mean_col <- mean(column, na.rm = TRUE)
sd_col <- sd(column, na.rm = TRUE)
z_scores <- (column - mean_col) / sd_col
outliers <- column[abs(z_scores) > 3]

list(Outliers = outliers)
}

# Apply functions to numeric columns

numeric_columns <- names(Drug)[sapply(Drug, is.numeric)]

outliers_iqr <- lapply(Drug[, numeric_columns], OutlierFromIQR)

outliers_zscore <- lapply(Drug[, numeric_columns], OutlierFromZscore)

# Print results
print("Outliers detected using IQR method:")
print(outliers_iqr)

print("Outliers detected using Z-score method:")

print(outliers_zscore)

library(ggplot2)
#boxplot of Na
OutlierFromIQR <- function(column) {
Q1 <- quantile(column, 0.25, na.rm = TRUE)
Q3 <- quantile(column, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
return(column < lower_bound | column > upper_bound)
}

# Detect outliers in 'Na' column

Drug$outlier <- as.factor(OutlierFromIQR(Drug$Na))

# Create boxplot for 'Na' with outliers highlighted in red

ggplot(Drug, aes(x = "", y = Na, color = outlier)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
geom_jitter(width = 0.2, alpha = 0.5) + # Add jitter for better visibility
scale_color_manual(values = c("black", "red")) + # Normal points = black,
Outliers = red
labs(title = "Boxplot of Na", x = "", y = "Na", color = "Outlier") +
theme_minimal()

#scatter plot of Na
OutlierFromIQR <- function(column) {
Q1 <- quantile(column, 0.25, na.rm = TRUE)
Q3 <- quantile(column, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
return(column < lower_bound | column > upper_bound)
}

# Identify outliers in 'Na' column

Drug$outlier <- as.factor(OutlierFromIQR(Drug$Na))

# Create scatter plot for 'Na' column

ggplot(Drug, aes(x = seq_along(Na), y = Na, color = outlier)) +
geom_point(size = 2, alpha = 0.7) +
scale_color_manual(values = c("black", "red")) + # Normal points = black,
Outliers = red
labs(title = "Scatter Plot of Na with Outliers", x = "Index", y = "Na", color =
"Outlier") +
theme_minimal()
#Q2
Drug <- read.csv("Drug.csv", na.strings = c("", "NA"))

colSums(is.na(Drug))

# Impute numeric columns with median and categorical columns with mode
for (col in names(Drug)) {
if (is.numeric(Drug[[col]])) {
Drug[[col]][is.na(Drug[[col]])] <- median(Drug[[col]], na.rm = TRUE)
} else {
mode_value <- names(sort(table(Drug[[col]]), decreasing = TRUE))[1]
Drug[[col]][is.na(Drug[[col]])] <- mode_value
}
}

# Check missing values after imputation

colSums(is.na(Drug))

#Q3

# Load necessary libraries

install.packages(c("ggcorrplot", "dplyr"), quietly = TRUE)
library(ggcorrplot)
library(dplyr)

# Select numeric attributes

numeric_columns <- select_if(Drug, is.numeric)

# (i) Compute correlation matrix for numeric attributes

cor_matrix <- cor(numeric_columns, use = "pairwise.complete.obs")

# (ii) Visualize correlation using a heatmap

ggcorrplot(cor_matrix, lab = TRUE, colors = c("blue", "white", "red"))

# (iii) Identify attributes with high correlation (> 0.5)

high_corr <- which(abs(cor_matrix) > 0.5 & cor_matrix != 1, arr.ind = TRUE)
print("Highly Correlated Attributes (|correlation| > 0.5):")
print(high_corr)

# (iv) Compute distance matrix for numeric attributes

distance_matrix <- dist(numeric_columns, method = "euclidean")
print(as.matrix(distance_matrix))

Comparison Contrast Analogy Paragraph Samples
No ratings yet
Comparison Contrast Analogy Paragraph Samples
3 pages
Minimum Static Strength Requirements
No ratings yet
Minimum Static Strength Requirements
2 pages
Experiment No 8 Study of Line Codes (NRZ, RZ, POLAR RZ, BIPOLAR (AMI), MANCHESTER) Their Spectral Analysis.-converted
No ratings yet
Experiment No 8 Study of Line Codes (NRZ, RZ, POLAR RZ, BIPOLAR (AMI), MANCHESTER) Their Spectral Analysis.-converted
6 pages
Analysis of Ipsec Vpns Performance in A
No ratings yet
Analysis of Ipsec Vpns Performance in A
5 pages
Module2 BDA
No ratings yet
Module2 BDA
44 pages
Sónia Alexandra Pereira Miguel
No ratings yet
Sónia Alexandra Pereira Miguel
118 pages
DAV practical 2
No ratings yet
DAV practical 2
6 pages
Cheat Sheet F
No ratings yet
Cheat Sheet F
2 pages
Growing A Circular Economy With Fungal Biotechnology: A White Paper
No ratings yet
Growing A Circular Economy With Fungal Biotechnology: A White Paper
23 pages
3. Key Ingredients of PM
No ratings yet
3. Key Ingredients of PM
16 pages
XXGST BPIL Miscllleneous CN 060324
No ratings yet
XXGST BPIL Miscllleneous CN 060324
1 page
Question No1
No ratings yet
Question No1
6 pages
_Part_I_Assignment
No ratings yet
_Part_I_Assignment
2 pages
Class 4-Term2 - Maths Worksheet-April08 (1)
No ratings yet
Class 4-Term2 - Maths Worksheet-April08 (1)
1 page
HANDLING MISSING VALUES AND OUTLIERS
No ratings yet
HANDLING MISSING VALUES AND OUTLIERS
4 pages
Gathering Evidences and Forms of Evidence1
No ratings yet
Gathering Evidences and Forms of Evidence1
12 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
Matrix, Dataframes, List
No ratings yet
Matrix, Dataframes, List
8 pages
ds
No ratings yet
ds
14 pages
Efa Medstat
No ratings yet
Efa Medstat
20 pages
040
No ratings yet
040
2 pages
R Code
No ratings yet
R Code
13 pages
Worksheet 5, Grade 4 (1)
No ratings yet
Worksheet 5, Grade 4 (1)
4 pages
Moby Dick
No ratings yet
Moby Dick
32 pages
(Practical) Programming With R
No ratings yet
(Practical) Programming With R
5 pages
Explanatory Data Analysis
100% (1)
Explanatory Data Analysis
28 pages
Insana - 2021 - Analisis Pengaruh Penggunaan Uang Elektronik Terhadap Perilaku Konsumtif Mahasiswa
No ratings yet
Insana - 2021 - Analisis Pengaruh Penggunaan Uang Elektronik Terhadap Perilaku Konsumtif Mahasiswa
22 pages
Asdfsdafsda Sdaf Aseww
No ratings yet
Asdfsdafsda Sdaf Aseww
59 pages
Cost-Volume-Profit Relationships
No ratings yet
Cost-Volume-Profit Relationships
97 pages
Week2 R Program
No ratings yet
Week2 R Program
4 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
HCL Leaptop Z
No ratings yet
HCL Leaptop Z
13 pages
Copy Entire Document Content in R Studio: R Script Compiled by Mr. Anup Sharma (Strictly To Be Used As Class Notes)
No ratings yet
Copy Entire Document Content in R Studio: R Script Compiled by Mr. Anup Sharma (Strictly To Be Used As Class Notes)
15 pages
VedicReport11 28 20193 55 45PM
No ratings yet
VedicReport11 28 20193 55 45PM
56 pages
Research File 3
No ratings yet
Research File 3
10 pages
Decision Tree
No ratings yet
Decision Tree
10 pages
Lab file AD pdf
No ratings yet
Lab file AD pdf
25 pages
R code
No ratings yet
R code
9 pages
Chapter 6: MEASUREMENT: Conceptual Framework in Financial Reporting
No ratings yet
Chapter 6: MEASUREMENT: Conceptual Framework in Financial Reporting
24 pages
LITERA03Z - Week 7
No ratings yet
LITERA03Z - Week 7
42 pages
Cost Practical
No ratings yet
Cost Practical
13 pages
Assignment 2 297
No ratings yet
Assignment 2 297
6 pages
Mod3
No ratings yet
Mod3
50 pages
Cot - DLP - English 4 by Teacher Margie v#4
No ratings yet
Cot - DLP - English 4 by Teacher Margie v#4
6 pages
saurabh
No ratings yet
saurabh
22 pages
Pool
No ratings yet
Pool
13 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
R_record-1
No ratings yet
R_record-1
57 pages
Alpine Cde-9870
No ratings yet
Alpine Cde-9870
84 pages
Commands for Data Analysis using R
No ratings yet
Commands for Data Analysis using R
11 pages
Healthcare Data Exploration Report Word File
No ratings yet
Healthcare Data Exploration Report Word File
9 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
GLM Sol
No ratings yet
GLM Sol
11 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
TTD Seva Receipt - SEENU KALYANAM-1
No ratings yet
TTD Seva Receipt - SEENU KALYANAM-1
1 page
The Effect of Low Morale and Motivation On Employe
No ratings yet
The Effect of Low Morale and Motivation On Employe
7 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
da lab file 2
No ratings yet
da lab file 2
13 pages
AMDA Practical - A048
No ratings yet
AMDA Practical - A048
35 pages
class5 worksheet-4 (11)
No ratings yet
class5 worksheet-4 (11)
2 pages
A Neuroscientific View On The Enneagram of Personality
No ratings yet
A Neuroscientific View On The Enneagram of Personality
8 pages
Tot Watchers The Two Mouseketeers: Shorts
No ratings yet
Tot Watchers The Two Mouseketeers: Shorts
1 page
R Practicals
No ratings yet
R Practicals
32 pages
Data cleaning Using R
No ratings yet
Data cleaning Using R
5 pages
model_lab[1]
No ratings yet
model_lab[1]
6 pages
EDA_INDEPTH
No ratings yet
EDA_INDEPTH
19 pages
DSI237_GROUP_2
No ratings yet
DSI237_GROUP_2
27 pages
R Console
No ratings yet
R Console
6 pages
The New Income Tax Chapter 13: COLLECTION of Tax: Print
No ratings yet
The New Income Tax Chapter 13: COLLECTION of Tax: Print
4 pages
Jinal: Types of Papad Available With Lajawab Papad
No ratings yet
Jinal: Types of Papad Available With Lajawab Papad
6 pages
Class III Ws5
No ratings yet
Class III Ws5
2 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
ML_EX2
No ratings yet
ML_EX2
7 pages
Conditional Statements and Loops in Visual Basic
No ratings yet
Conditional Statements and Loops in Visual Basic
9 pages
UL2
No ratings yet
UL2
2 pages
R Note
No ratings yet
R Note
56 pages
Experiment 5
No ratings yet
Experiment 5
13 pages
DSBAProject Oct 2020
No ratings yet
DSBAProject Oct 2020
24 pages
Root Surface Alterations Following Manual and Mechanical Scaling
No ratings yet
Root Surface Alterations Following Manual and Mechanical Scaling
26 pages
BAN5
No ratings yet
BAN5
2 pages
BCS 042
No ratings yet
BCS 042
4 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
Celex 32016L1629 en TXT
No ratings yet
Celex 32016L1629 en TXT
59 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Dharmshala McLeodganj 5 Days 4 Nights R1
No ratings yet
Dharmshala McLeodganj 5 Days 4 Nights R1
6 pages
Predictive Analytics: Group Assignment 2
No ratings yet
Predictive Analytics: Group Assignment 2
6 pages
What Is Steel Reinforcement? Why Is It Required in A Concrete Structure?
No ratings yet
What Is Steel Reinforcement? Why Is It Required in A Concrete Structure?
11 pages
Digital Footprints: By: Alex Sato
No ratings yet
Digital Footprints: By: Alex Sato
10 pages
R Course
No ratings yet
R Course
7 pages
DataPreparation - Outlier - Treatment ASSIGNMENT 1
100% (1)
DataPreparation - Outlier - Treatment ASSIGNMENT 1
7 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
Project
No ratings yet
Project
16 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
8 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)