The document describes a project to analyze healthcare data and classify malicious events using machine learning algorithms. It provides instructions to clean the data, select algorithms, tune models, and evaluate their performance on testing data to determine the best for classification. Random Forest and logistic regression are to be tested on balanced and unbalanced training data.
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
20 views
Final Project Implementation
The document describes a project to analyze healthcare data and classify malicious events using machine learning algorithms. It provides instructions to clean the data, select algorithms, tune models, and evaluate their performance on testing data to determine the best for classification. Random Forest and logistic regression are to be tested on balanced and unbalanced training data.
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3
Project 2 (Main project)
Csv file will be send with this file.
Objectives You are the data scientist that has been hired by Q.LG to examine the data and provide insights. Your goals will be to • Clean the data file and prepare it for Machine Learning (ML) • Recommend a ML algorithm that will provide the most accurate detection of malicious events. • Create a brief report on your findings You job Your job is to develop the detection algorithms that will provide the most accurate incident detection. You do not need to concern yourself about the specifics of the SIEM plugin or software integration, i.e., your task is to focus on accurate classification of malicious events using R. You are to test and evaluate two machine learning algorithms (each in two scenarios) to determine which supervised learning model is best for the task as described. Task You are to import and clean the same HealthCareData_2024.csv, that was used in the previous assignment. Then run, tune and evaluate two supervised ML algorithms (each with two types of training data) to identify the most accurate way of classifying malicious events. Part 1 – General data preparation and cleaning a) Import the HealthCareData_2024.csv into R Studio. This version is the same as Assignment 1. b) Write the appropriate code in R Studio to prepare and clean the HealthCareData_2024 dataset as follows: i. Clean the whole dataset based on the feedback received for Assignment 1. ii. For the feature NetworkInteractionType, merge the ‘Regular’ and ‘Unknown’ categories together to form the category ‘Others’. Hint: use the forcats:: fct_collapse(.) function. iii. Select only the complete cases using the na.omit(.) function, and name the dataset dat.cleaned. Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary. c) Use the code below to generated two training datasets (one unbalanced mydata.ub.train and one balanced mydata.b.train) along with the testing set (mydata.test). Make sure you enter your student ID into the command set.seed(.). # Separate samples of normal and malicious events dat.class0 <- dat.cleaned %>% filter(Classification == "Normal") # normal dat.class1 <- dat.cleaned %>% filter(Classification == "Malicious") # malicious # Randomly select 9600 non-malicious and 400 malicious samples using your student ID, then combine them to form a working data set set.seed(Enter your Student ID) rows.train0 <- sample(1:nrow(dat.class0), size = 9600, replace = FALSE) rows.train1 <- sample(1:nrow(dat.class1), size = 400, replace = FALSE) # Your 10000 ‘unbalanced’ training samples train.class0 <- dat.class0[rows.train0,] # Non-malicious samples train.class1 <- dat.class1[rows.train1,] # Malicious samples mydata.ub.train <- rbind(train.class0, train.class1) # Your 19200 ‘balanced’ training samples, i.e. 9600 normal and malicious samples e ach. set.seed(Enter your Student ID) train.class1_2 <- train.class1[sample(1:nrow(train.class1), size = 9600, replace = TRUE),] mydata.b.train <- rbind(train.class0, train.class1_2) # Your testing samples test.class0 <- dat.class0[-rows.train0,] test.class1 <- dat.class1[-rows.train1,] mydata.test <- rbind(test.class0, test.class1) Note that in the master data set, the percentage of malicious events is approximately 4%. This distribution is roughly represented by the unbalanced data. The balanced data is generated based on up-sampling of the minority class using bootstrapping. The idea here is to ensure the trained model is not biased towards the majority class, i.e. normal events. Part 2 – Compare the performances of different ML algorithms a) Randomly select two supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 2 ML approaches are given by myModels. set.seed(Enter your student ID) models.list1 <- c("Logistic Ridge Regression", "Logistic LASSO Regression", "Logistic Elastic-Net Regression") models.list2 <- c("Classification Tree", "Bagging Tree", "Random Forest") myModels <- c(sample(models.list1, size = 1), sample(models.list2, size = 1)) myModels %>% data.frame For each of your two ML modelling approaches, you will need to: b) Run the ML algorithm in R on the two training sets with Classification as the outcome variable. c) Perform hyperparameter tuning to optimise the model: • Outline your hyperparameter tuning/searching strategy for each of the ML modelling approaches. Report on the search range(s) for hyperparameter tuning, which 𝑘-fold CV was used, and the number of repeated CVs (if applicable), and the final optimal tuning parameter values and relevant CV statistics (i.e. CV results, tables and plots), where appropriate. If you are using repeated CVs, a minimum of 2 repeats are required. • If your selected tree model is Bagging, you must tune the nbagg, cp and minsplit hyperparameters, with at least 3 values for each. • If your selected tree model is Random Forest, you must tune the num.trees and mtry hyperparameters, with at least 3 values for each. • Be sure to set the randomisation seed using your student ID. d) Evaluate the predictive performance of your two ML models, derived from the balanced and unbalanced training sets, on the testing set. Provide the confusion matrices and report and interpret the following measures in the context of the project: • Overall Accuracy • Precision • Recall • F1-score Make sure you define each of the above metrics in the context of the study. Hint: Use the help menu in R Studio on the confusionMatrix(.) function to see how one can obtain the precision, recall and F1-score metrics. e) Provide a brief statement on your final recommended model and why you have chosen it. This includes explaining which metric(s) you have used in making this decision and why. Parsimony, and to a lesser extent, interpretability maybe taken into account if the decision is close. You may outline your penalised model estimates in the Appendix if it helps with your argument. What to submit Gather your findings into a report (maximum of 5 pages) and citing relevant sources, if necessary. Present how and why the data was ‘cleaned and prepared’, how the ML models were tuned and provide the relevant CV results. Lastly, present how they performed to each other in both the unbalanced and balanced scenarios. You may use graphs, tables and images where appropriate to help your reader understand your findings. All tables and figures should be appropriately captioned, and referenced in-text. Make a final recommendation on which ML modelling approach is the best for this task. Your final report should look professional, include appropriate headings and subheadings, should cite facts and reference source materials in APA-7th format. Your submission must include the following: • Your report (5 pages or less, excluding cover/contents/reference/appendix page). • A copy of your R code, which is to be submitted separately. Make sure you keep a copy of each of the two training sets and a testing set (in .csv format) in case you are asked for them later.
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB