Introduction_to_Machine_Learning_Exercises
Introduction_to_Machine_Learning_Exercises
Introduction to Machine
Learning
Version 2023-08
Exercises: Introduction to Machine Learning 2
Licence
This manual is © 2023, Simon Andrews, Laura Biggins.
This manual is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0
licence. This means that you are free:
• Non-Commercial. You may not use this work for commercial purposes.
• Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work
only under a licence identical to this one.
• For any reuse or distribution, you must make clear to others the licence terms of this work.
• Any of these conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.
The aim of the model is to try to predict which genes are involved in development. This is defined based
on the “Developmental Process” Gene Ontology category (GO:0032502).
Running Models
To let you try out some of these models you can go to:
https://www.bioinformatics.babraham.ac.uk/shiny/machinelearning/
Where we have built a simple interface which lets you run a variety of different model types on this data.
Just select the model you want to run from the drop down box and press the “Run Model” button.
After the model has run you will see some information about the model on the left which summarises the
parameters which were used to run it – you should be able to match these to the theory we talked about
before.
Exercises: Introduction to Machine Learning 4
On the right you will see a summary of some predictions made by the model. We have run two sets of
data through the model.
1. We re-ran the data used to train the model back through it to see how well it is able to predict
data it has seen before.
2. We set aside a portion of the original data before training the model and then ran this through
the model after it was trained to see how well it works against data it hasn’t seen before.
Results
This table shows a summary of the predictions the model made and how they matched against the known
correct values in the data. It’s important to validate a model against data where you know the answer,
before using it to make predictions on data where the answer isn’t known.
In this table you can see the total number of correct (TRUE) and incorrect (FALSE) predictions the model
made.
Questions:
Run the different models and look at their output and the summary of the predictions they make then
answer the questions below.
1. Do all of the models perform similarly well, or are some better than others?
Exercises: Introduction to Machine Learning 5
2. Do the models perform similarly well on the data they have seen before and the data they haven’t
seen before?
3. Do the more complex models perform better than the simpler ones?
4. If you run each model a couple of times, do the results change? If they only change for some of
the models why is this?
5. If you hadn’t run a model, but had simply assigned the most frequent category (Not
Development) to every prediction, how many correct answers would you expect to have seen in
the test data of 249 samples? Do any of the models do substantially better than this?
https://www.bioinformatics.babraham.ac.uk/shiny/optimising_model/
In this version you can change the total number of trees constructed, the number of random predictors
selected at each branch point, and the minimum number of measures which must appear in a node at
the bottom of the tree so it doesn’t get too complex at the bottom of the tree.
Try running the model a few times and seeing what effect changing these parameters has on the results.
What settings would you have to use to mimic a conventional decision tree?
What do you think the effect of changing the different parameters would be? Do you see this in the
results?
Exercises: Introduction to Machine Learning 6
1. Are the models actually identifying developmental genes at a rate which is significantly higher
than you’d get by guessing?
2. What is the balance in the models between sensitivity (the ability to say that a developmental
gene is a developmental gene) and specificity (the ability to identify non-developmental genes)?
3. Are there differences in the sensitivity / specificity trade-off between the models?
4. Which models appear to be most strongly overfitted to the training data (do well on training, and
poorly on testing)?
Exercises: Introduction to Machine Learning 7
The aim of your model is to predict which of these proteins contains one or more transmembrane
segments, such that the protein is normally found embedded within a membrane.
To do this you are going to build and train a random forest model. The steps in the modelling procedure
will be:
Below we will talk you through how to construct a script in RStudio to perform all of these steps. In an
actual modelling experiment we would include more evaluation of the data before starting on the
modelling, so this is a somewhat truncated version of the full procedure you’d use.
To get started you need to open a new R script, save it, then set the location of the data you’re going to
use.
Once the script has opened go to File > Save As and save it into the MachineLearningData folder in a
file called model.R
In the RStudio menu select Session > Set Working Directory > To Source File Location
Exercises: Introduction to Machine Learning 8
library(tidyverse)
library(tidymodels)
tidymodels_prefer()
The last line here simply says that we should always use functions from tidymodels, even if another
function with the same name, but from a different package exists.
You can then click on the data in the Environment tab (top right) and have a look at what the data looks
like.
data %>%
mutate(
transmembrane = factor(transmembrane)
) -> data
After you’ve run this, hold your mouse over the transmembrane column header when looking at the data.
It should now says that it is a factor
data %>%
select(-gene_id) -> data
Exercises: Introduction to Machine Learning 9
You should now see that the gene_id column has gone, and that the transmembrane column is now the
first one.
data %>%
sample_frac() -> data
This won’t change the structure of your data but where the data originally put all proteins from the same
chromosome together you should now see that they are all mixed up.
data %>%
na.omit() -> data
After running this you should see that the number of rows in the data goes down from 19,701 to 18,352.
Because we are going to run a random forest model this is all of the preparation we need to do. Later
we may try other model types where we would need to make the data behave in a more quantitatively
nice way, but tree based models really don’t care.
data %>%
initial_split(prop=0.8, strata=transmembrane) -> split_data
This will split off 80% of our data to be used for training and 20% for testing.
training(split_data)
..or..
testing(split_data)
Exercises: Introduction to Machine Learning 10
You should see about 14,600 rows in the training data and about 3,600 in the testing.
rand_forest(trees=100) %>%
set_engine("ranger") %>%
set_mode("classification") -> forest_model
Note that a lot of the options in the model fit template are set to “missing_arg()” which means that
they are values we will need to supply later in the process.
forest_model %>%
fit(transmembrane ~ ., data=training(split_data)) -> forest_fit
forest_fit
We should see all of the variables for the model in place, and see some of the details of the data and the
fit (number of variables and cases etc).
forest_fit %>%
predict(testing(split_data))
# A tibble: 3,671 × 1
.pred_class
<fct>
1 Soluble
2 Soluble
3 Transmembrane
4 Soluble
5 Soluble
6 Soluble
7 Soluble
8 Soluble
9 Soluble
10 Soluble
The problem with this is that it only outputs the predictions, we don’t see the rest of the data, including
the column which says what the answer should have been, so we need to join those predictions to the
training data
forest_fit %>%
predict(testing(split_data)) %>%
bind_cols(testing(split_data)) -> prediction_results
You can now click on the prediction_results in the environment window to see the predictions (in
the .pred_class column) alongside the known correct answers (in the transmembrane column)
We can start by simply counting the number of times we see different combinations of predictions and
true values in the data.
prediction_results %>%
group_by(transmembrane, .pred_class) %>%
count()
From this you can see how many times a correct and incorrect prediction was made and the break down
of the mistakes which were made.
We can also get more specific values for sensitivity and specificity
prediction_results %>%
sens(transmembrane, .pred_class)
Exercises: Introduction to Machine Learning 12
..and..
prediction_results %>%
spec(transmembrane, .pred_class)
Finally we can get an overall accuracy value, and we can also get the Coehn’s kappa value to say
whether we’re actually performing better than chance on the data.
prediction_results %>%
metrics(transmembrane, .pred_class)
What is your evaluation of how well the model has performed? Feel free to try playing with the setup
parameters for the model to see if you can improve on the initial performance. Remember though that
there is a random component, so just because a model works better once doesn’t mean that those
settings will always be better.
Exercises: Introduction to Machine Learning 13
Because neural networks have more constraints on the data which goes into the model we’re going to
have to do more pre-processing, and we’re going to have to apply this to both the training and testing
data (and we’d have to do it to any unknown proteins in future), so we’re going to automate this with a
recipe and we’re going to integrate this into a workflow to run it.
We can follow the same steps as before, or we can use the same split_data variable as for the
random forest model.
Building a Recipe
Firstly we’re going to build a recipe which will combine the formula for prediction and the training data.
Once we have it we can then add steps to it to complete the pre-processing.
recipe(
transmembrane ~ . ,
data=training(split_data)
) -> neural_recipe
neural_recipe
Now we have a recipe we can add processing steps to it. The steps will be:
neural_recipe %>%
step_log(gene_length, transcript_length) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) -> neural_recipe
Look at the recipe again to see the new steps have been added.
Exercises: Introduction to Machine Learning 14
mlp(
epochs = 1000,
hidden_units = 10,
penalty = 0.01,
learn_rate = 0.01
) %>%
set_engine("brulee", validation = 0) %>%
set_mode("classification") -> nnet_model
Again, these values could be modified after generating an initial model, but these will give us something
to work from.
Building a workflow
A workflow will combine the recipe and the model together and will allow us to run everything at once.
workflow() %>%
add_recipe(neural_recipe) %>%
add_model(nnet_model) -> neural_workflow
neural_workflow
This will take a couple of minutes to complete. Once complete we can see the fitted model with
Exercises: Introduction to Machine Learning 15
neural_fit
You should see that a load more parameters have now been set because the model and the pre-
processing have been finalised.
You can look at the contents of the neural_predictions variable to get an idea of how well it did.
Now we can calculate some of the standard metrics from this. We can make up a simple confusion table.
neural_predictions %>%
group_by(.pred_class,transmembrane) %>%
count()
neural_predictions %>%
group_by(.pred_class,transmembrane) %>%
count() %>%
pivot_wider(
names_from=.pred_class,
values_from=n,
names_prefix = "predicted_"
) %>%
rename(true_transmembrane=transmembrane)
neural_predictions %>%
metrics(transmembrane, .pred_class)
neural_predictions %>%
sens(transmembrane, .pred_class)
neural_predictions %>%
spec(transmembrane, .pred_class)
Exercises: Introduction to Machine Learning 16
Exercises: Introduction to Machine Learning 17
The preparation of the data will be the same as before initially, but then we will hit some changes.
For the model you are going to use a knn model, and let the number of neighbours be a tuneable
parameter
For the data you need to build a 10 fold cross validation split of the full dataset, rather than a single 80%
split.
vfold_cv(
data,
v=10
) -> vdata
You can then build a workflow from the model and data using the same formula as before.
Once you have the workflow you can look at the tuneable paramters.
workflow %>%
extract_parameter_set_dials()
…and from these we want to change the neighbors parameter to run from 1 to 50
workflow %>%
extract_parameter_set_dials() %>%
update(
neighbors = neighbors(c(1,50))
) -> tune_parameters
We’re then going to run the workflow generating a regular grid of 20 samples over the 1-50 range. We
are going to measure both the sensitivity and specificity of the model.
workflow %>%
tune_grid(
vdata,
grid = grid_regular(tune_parameters, levels=20),
metrics = metric_set(sens,spec)
) -> tune_results
Exercises: Introduction to Machine Learning 18
Finally we can plot out the tuned results to see which value for k we think is best.
autoplot(tune_results)