0% found this document useful (0 votes)
20 views

Final Project Implementation

The document describes a project to analyze healthcare data and classify malicious events using machine learning algorithms. It provides instructions to clean the data, select algorithms, tune models, and evaluate their performance on testing data to determine the best for classification. Random Forest and logistic regression are to be tested on balanced and unbalanced training data.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Final Project Implementation

The document describes a project to analyze healthcare data and classify malicious events using machine learning algorithms. It provides instructions to clean the data, select algorithms, tune models, and evaluate their performance on testing data to determine the best for classification. Random Forest and logistic regression are to be tested on balanced and unbalanced training data.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Project 2 (Main project)

Csv file will be send with this file.


Objectives
You are the data scientist that has been hired by Q.LG to examine the data and provide
insights. Your goals will be to
• Clean the data file and prepare it for Machine Learning (ML)
• Recommend a ML algorithm that will provide the most accurate detection of
malicious events.
• Create a brief report on your findings
You job
Your job is to develop the detection algorithms that will provide the most accurate
incident detection. You do not need to concern yourself about the specifics of the SIEM
plugin or software integration, i.e., your task is to focus on accurate classification of
malicious events using R.
You are to test and evaluate two machine learning algorithms (each in two scenarios) to
determine which supervised learning model is best for the task as described.
Task
You are to import and clean the same HealthCareData_2024.csv, that was used in the
previous assignment. Then run, tune and evaluate two supervised ML algorithms (each
with two types of training data) to identify the most accurate way of classifying
malicious events.
Part 1 – General data preparation and cleaning
a) Import the HealthCareData_2024.csv into R Studio. This version is the same as
Assignment 1.
b) Write the appropriate code in R Studio to prepare and clean the
HealthCareData_2024 dataset as follows:
i. Clean the whole dataset based on the feedback received for Assignment 1.
ii. For the feature NetworkInteractionType, merge the ‘Regular’ and
‘Unknown’ categories together to form the category ‘Others’. Hint: use the
forcats:: fct_collapse(.) function.
iii. Select only the complete cases using the
na.omit(.) function, and name the
dataset dat.cleaned.
Briefly outline the preparation and cleaning process in your report and why you
believe the above steps were necessary.
c) Use the code below to generated two training datasets (one unbalanced
mydata.ub.train and one balanced mydata.b.train) along with the testing set
(mydata.test). Make sure you enter your student ID into the command
set.seed(.).
# Separate samples of normal and malicious events
dat.class0 <- dat.cleaned %>% filter(Classification == "Normal") # normal
dat.class1 <- dat.cleaned %>% filter(Classification == "Malicious") # malicious
# Randomly select 9600 non-malicious and 400 malicious samples using your student
ID, then combine them to form a working data set
set.seed(Enter your Student ID)
rows.train0 <- sample(1:nrow(dat.class0), size = 9600, replace = FALSE)
rows.train1 <- sample(1:nrow(dat.class1), size = 400, replace = FALSE)
# Your 10000 ‘unbalanced’ training samples
train.class0 <- dat.class0[rows.train0,] # Non-malicious samples
train.class1 <- dat.class1[rows.train1,] # Malicious samples
mydata.ub.train <- rbind(train.class0, train.class1)
# Your 19200 ‘balanced’ training samples, i.e. 9600 normal and malicious samples e
ach.
set.seed(Enter your Student ID)
train.class1_2 <- train.class1[sample(1:nrow(train.class1), size = 9600,
replace = TRUE),]
mydata.b.train <- rbind(train.class0, train.class1_2)
# Your testing samples
test.class0 <- dat.class0[-rows.train0,]
test.class1 <- dat.class1[-rows.train1,]
mydata.test <- rbind(test.class0, test.class1)
Note that in the master data set, the percentage of malicious events is
approximately 4%. This distribution is roughly represented by the unbalanced
data. The balanced data is generated based on up-sampling of the minority class
using bootstrapping. The idea here is to ensure the trained model is not biased
towards the majority class, i.e. normal events.
Part 2 – Compare the performances of different ML algorithms
a) Randomly select two supervised learning modelling algorithms to test against
one another by running the following code. Make sure you enter your student ID
into the command
set.seed(.). Your 2 ML approaches are given by myModels.
set.seed(Enter your student ID)
models.list1 <- c("Logistic Ridge Regression",
"Logistic LASSO Regression",
"Logistic Elastic-Net Regression")
models.list2 <- c("Classification Tree",
"Bagging Tree",
"Random Forest")
myModels <- c(sample(models.list1, size = 1),
sample(models.list2, size = 1))
myModels %>% data.frame
For each of your two ML modelling approaches, you will need to:
b) Run the ML algorithm in R on the two training sets with Classification as the
outcome variable.
c) Perform hyperparameter tuning to optimise the model:
• Outline your hyperparameter tuning/searching strategy for each of the ML
modelling approaches. Report on the search range(s) for hyperparameter
tuning, which 𝑘-fold CV was used, and the number of repeated CVs (if
applicable), and the final optimal tuning parameter values and relevant CV
statistics (i.e. CV results, tables and plots), where appropriate. If you are
using repeated CVs, a minimum of 2 repeats are required.
• If your selected tree model is Bagging, you must tune the nbagg, cp and
minsplit hyperparameters, with at least 3 values for each.
• If your selected tree model is Random Forest, you must tune the num.trees
and mtry hyperparameters, with at least 3 values for each.
• Be sure to set the randomisation seed using your student ID.
d) Evaluate the predictive performance of your two ML models, derived from the
balanced and unbalanced training sets, on the testing set. Provide the confusion
matrices and report and interpret the following measures in the context of the
project:
• Overall Accuracy
• Precision
• Recall
• F1-score
Make sure you define each of the above metrics in the context of the study. Hint:
Use the help menu in R Studio on the
confusionMatrix(.) function to see how one
can obtain the precision, recall and F1-score metrics.
e) Provide a brief statement on your final recommended model and why you have
chosen it. This includes explaining which metric(s) you have used in making this
decision and why. Parsimony, and to a lesser extent, interpretability maybe
taken into account if the decision is close.
You may outline your penalised model
estimates in the Appendix if it helps with your argument.
What to submit
Gather your findings into a report (maximum of 5 pages) and citing relevant sources, if
necessary.
Present how and why the data was ‘cleaned and prepared’, how the ML models were
tuned and provide the relevant CV results. Lastly, present how they performed to each
other in both the unbalanced and balanced scenarios. You may use graphs, tables and
images where appropriate to help your reader understand your findings. All tables and
figures should be appropriately captioned, and referenced in-text.
Make a final recommendation on which ML modelling approach is the best for this task.
Your final report should look professional, include appropriate headings and
subheadings, should cite facts and reference source materials in APA-7th format.
Your submission must include the following:
• Your report (5 pages or less, excluding cover/contents/reference/appendix
page). • A copy of your R code, which is to be submitted separately.
Make sure you keep a copy of each of the two training sets and a testing set (in .csv
format) in case you are asked for them later.

You might also like