0% found this document useful (0 votes)

7 views

Introduction_to_Machine_Learning_Exercises

Uploaded by

Serge Pascal Fogoum Tamu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Introduction_to_Machine_Learning_Exercises

Uploaded by

Serge Pascal Fogoum Tamu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Exercises:

Introduction to Machine
Learning

Version 2023-08
Exercises: Introduction to Machine Learning 2

Licence
This manual is © 2023, Simon Andrews, Laura Biggins.

This manual is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0
licence. This means that you are free:

• to copy, distribute, display, and perform the work

• to make derivative works

Under the following conditions:

• Attribution. You must give the original author credit.

• Non-Commercial. You may not use this work for commercial purposes.

• Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work
only under a licence identical to this one.

Please note that:

• For any reuse or distribution, you must make clear to others the licence terms of this work.
• Any of these conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.

Full details of this licence can be found at

http://creativecommons.org/licenses/by-nc-sa/2.0/uk/legalcode
Exercises: Introduction to Machine Learning 3

Exercise 1: Running machine learning models

In this exercise we have given you a filtered subset of the data in GSE1133, which is a microarray study
measuring gene expression across a panel of around 90 different tissues.

The aim of the model is to try to predict which genes are involved in development. This is defined based
on the “Developmental Process” Gene Ontology category (GO:0032502).

A snapshot of the first part of the data looks like this:

The stats for the dataset are:

• There are 1241 measured genes
• 522 of the genes are development genes, 719 are not
• There are 92 variables (tissues) we can use for prediction. All of them are quantitative so are
compatible with all model types
• The variable to predict is Categorical (Development or Not Development) so we can only use
models with a Categorical output.
• Although there is a gene column we aren’t going to use that as there’s no way that this can be
predictive since it’s a categorical value which is different for every gene.

Running Models
To let you try out some of these models you can go to:

https://www.bioinformatics.babraham.ac.uk/shiny/machinelearning/

Where we have built a simple interface which lets you run a variety of different model types on this data.
Just select the model you want to run from the drop down box and press the “Run Model” button.

After the model has run you will see some information about the model on the left which summarises the
parameters which were used to run it – you should be able to match these to the theory we talked about
before.
Exercises: Introduction to Machine Learning 4

On the right you will see a summary of some predictions made by the model. We have run two sets of
data through the model.

1. We re-ran the data used to train the model back through it to see how well it is able to predict
data it has seen before.

2. We set aside a portion of the original data before training the model and then ran this through
the model after it was trained to see how well it works against data it hasn’t seen before.

Results

This table shows a summary of the predictions the model made and how they matched against the known
correct values in the data. It’s important to validate a model against data where you know the answer,
before using it to make predictions on data where the answer isn’t known.

In this table you can see the total number of correct (TRUE) and incorrect (FALSE) predictions the model
made.

Questions:
Run the different models and look at their output and the summary of the predictions they make then
answer the questions below.

1. Do all of the models perform similarly well, or are some better than others?
Exercises: Introduction to Machine Learning 5

2. Do the models perform similarly well on the data they have seen before and the data they haven’t
seen before?

3. Do the more complex models perform better than the simpler ones?

4. If you run each model a couple of times, do the results change? If they only change for some of
the models why is this?

5. If you hadn’t run a model, but had simply assigned the most frequent category (Not
Development) to every prediction, how many correct answers would you expect to have seen in
the test data of 249 samples? Do any of the models do substantially better than this?

Changing Model Parameters

We have a second interface which lets you rerun the random forest model whilst changing the
parameters used to construct it:

https://www.bioinformatics.babraham.ac.uk/shiny/optimising_model/

In this version you can change the total number of trees constructed, the number of random predictors
selected at each branch point, and the minimum number of measures which must appear in a node at
the bottom of the tree so it doesn’t get too complex at the bottom of the tree.

Try running the model a few times and seeing what effect changing these parameters has on the results.

What settings would you have to use to mimic a conventional decision tree?

What do you think the effect of changing the different parameters would be? Do you see this in the
results?
Exercises: Introduction to Machine Learning 6

Exercise 2: Evaluating Models

Now that you have learned about the way that models can be tested you can initially look back at the
results you found in exercise 1 to see the additional metrics which were supplied alongside the raw
results.

1. Are the models actually identifying developmental genes at a rate which is significantly higher
than you’d get by guessing?

2. What is the balance in the models between sensitivity (the ability to say that a developmental
gene is a developmental gene) and specificity (the ability to identify non-developmental genes)?

3. Are there differences in the sensitivity / specificity trade-off between the models?

4. Which models appear to be most strongly overfitted to the training data (do well on training, and
poorly on testing)?
Exercises: Introduction to Machine Learning 7

Exercise 3: Building your first model

In this exercise you will build your first tidymodels model. You can going to use a dataset comprising all
of the canonical proteins in the mouse genome. For each of these you have some basic information
about the gene, transcript and protein, plus you have the compositional break down of the protein into
its component amino acids.

The aim of your model is to predict which of these proteins contains one or more transmembrane
segments, such that the protein is normally found embedded within a membrane.

To do this you are going to build and train a random forest model. The steps in the modelling procedure
will be:

1. Load the R packages we’re going to need for this analysis

2. Load in the original data

3. Prepare the data for modelling

a. Convert the variable to predict to be a factor
b. Remove proteins with missing data
c. Shuffle the rows

4. Split the data into a training and testing subset

5. Build the model

6. Train the model using the training data

7. Predict the transmembrane proteins from the testing data

8. Check how good the predictions are

Below we will talk you through how to construct a script in RStudio to perform all of these steps. In an
actual modelling experiment we would include more evaluation of the data before starting on the
modelling, so this is a somewhat truncated version of the full procedure you’d use.

To get started you need to open a new R script, save it, then set the location of the data you’re going to
use.

Setting up your environment

Inside RStudio select

File > New File > R Script

Once the script has opened go to File > Save As and save it into the MachineLearningData folder in a
file called model.R

In the RStudio menu select Session > Set Working Directory > To Source File Location
Exercises: Introduction to Machine Learning 8

Loading the R packages we need

We will be using two packages in this script, the tidyverse package, which will do the general data
manipulation for us, and the tidymodels package which will do the modelling.

We can load these with

library(tidyverse)
library(tidymodels)
tidymodels_prefer()

The last line here simply says that we should always use functions from tidymodels, even if another
function with the same name, but from a different package exists.

Loading the input data

To load the data from the TSV file it’s saved in we need to do

read_delim("transmembrane_data.txt") -> data

You can then click on the data in the Environment tab (top right) and have a look at what the data looks
like.

Preparing the data for modelling

Turing transmembrane into a factor
If a column is going to be used as the value to predict then it must have a data type of “factor” which is
a data type specifically used to represent data which can hold one of a defined set of values. Our
transmembrane predictions are currently just in a text column so we need to change that.

data %>%
mutate(
transmembrane = factor(transmembrane)
) -> data

After you’ve run this, hold your mouse over the transmembrane column header when looking at the data.
It should now says that it is a factor

Removing the gene_id column

In our data the gene_id column just holds the name of the gene. This isn’t useful in the model and will
just slow things down or cause them to overfit, so we need to remove it.

data %>%
select(-gene_id) -> data
Exercises: Introduction to Machine Learning 9

You should now see that the gene_id column has gone, and that the transmembrane column is now the
first one.

Shuffling the rows

For some types of model there may be information contained in the order the rows appear (for example
if all of the transmembrane proteins were next to each other). To prevent this information from having
any effect we can just shuffle all of the rows.

data %>%
sample_frac() -> data

This won’t change the structure of your data but where the data originally put all proteins from the same
chromosome together you should now see that they are all mixed up.

Removing missing values

We will remove any rows in which any of the columns have missing values.

data %>%
na.omit() -> data

After running this you should see that the number of rows in the data goes down from 19,701 to 18,352.

Because we are going to run a random forest model this is all of the preparation we need to do. Later
we may try other model types where we would need to make the data behave in a more quantitatively
nice way, but tree based models really don’t care.

Splitting the data

Before we construct the model we must split off some training data so that we aren’t using the same data
to test the model as we are to train it.

data %>%
initial_split(prop=0.8, strata=transmembrane) -> split_data

This will split off 80% of our data to be used for training and 20% for testing.

We can see the data in the two subsets by running:

training(split_data)

..or..

testing(split_data)
Exercises: Introduction to Machine Learning 10

You should see about 14,600 rows in the training data and about 3,600 in the testing.

Building the model

Now all the data is prepared we can go on and build a model. We’re going to build a random forest model
using the ranger engine. We also need to tell it that it’s going to make a classification prediction.

rand_forest(trees=100) %>%
set_engine("ranger") %>%
set_mode("classification") -> forest_model

To see the model you can run

forest_model %>% translate()

Note that a lot of the options in the model fit template are set to “missing_arg()” which means that
they are values we will need to supply later in the process.

Training the model

We now need to train the model. We are going to give it the training data from our split data, and we’re
going to tell it that it should try to predict the transmembrane values using all of the rest of the columns.

forest_model %>%
fit(transmembrane ~ ., data=training(split_data)) -> forest_fit

Once the model is fit we can see it by running

forest_fit

We should see all of the variables for the model in place, and see some of the details of the data and the
fit (number of variables and cases etc).

Testing the model

To test the model we need to use it to make predictions about data where we know the answer, which is
what our testing data is for. We are going to use the predict function to make predictions on this data.
To make a prediction we need to pass in a new dataset with the same variables as the training data and
it will make predictions.

forest_fit %>%
predict(testing(split_data))

Which will give us something like:

Exercises: Introduction to Machine Learning 11

# A tibble: 3,671 × 1
.pred_class
<fct>
1 Soluble
2 Soluble
3 Transmembrane
4 Soluble
5 Soluble
6 Soluble
7 Soluble
8 Soluble
9 Soluble
10 Soluble

The problem with this is that it only outputs the predictions, we don’t see the rest of the data, including
the column which says what the answer should have been, so we need to join those predictions to the
training data

forest_fit %>%
predict(testing(split_data)) %>%
bind_cols(testing(split_data)) -> prediction_results

You can now click on the prediction_results in the environment window to see the predictions (in
the .pred_class column) alongside the known correct answers (in the transmembrane column)

Evaluating the predictions

From the set of predictions we can now see how well the model actually did by comparing the predictions
to the known true values.

We can start by simply counting the number of times we see different combinations of predictions and
true values in the data.

prediction_results %>%
group_by(transmembrane, .pred_class) %>%
count()

From this you can see how many times a correct and incorrect prediction was made and the break down
of the mistakes which were made.

We can also get more specific values for sensitivity and specificity

prediction_results %>%
sens(transmembrane, .pred_class)
Exercises: Introduction to Machine Learning 12

..and..

prediction_results %>%
spec(transmembrane, .pred_class)

Finally we can get an overall accuracy value, and we can also get the Coehn’s kappa value to say
whether we’re actually performing better than chance on the data.

prediction_results %>%
metrics(transmembrane, .pred_class)

What is your evaluation of how well the model has performed? Feel free to try playing with the setup
parameters for the model to see if you can improve on the initial performance. Remember though that
there is a random component, so just because a model works better once doesn’t mean that those
settings will always be better.
Exercises: Introduction to Machine Learning 13

Exercise 4: Using Recipes and Workflows

We’re going to build another model from the same transmembrane data as before, but this time we’re
constructing a neural net.

Because neural networks have more constraints on the data which goes into the model we’re going to
have to do more pre-processing, and we’re going to have to apply this to both the training and testing
data (and we’d have to do it to any unknown proteins in future), so we’re going to automate this with a
recipe and we’re going to integrate this into a workflow to run it.

For the first part of the model where we:

1. Loaded the required packages

2. Loaded the data
3. Prepared the data
4. Split the data into training and testing

We can follow the same steps as before, or we can use the same split_data variable as for the
random forest model.

Building a Recipe
Firstly we’re going to build a recipe which will combine the formula for prediction and the training data.
Once we have it we can then add steps to it to complete the pre-processing.

recipe(
transmembrane ~ . ,
data=training(split_data)
) -> neural_recipe

We can then view the recipe with

neural_recipe

Now we have a recipe we can add processing steps to it. The steps will be:

1. Log transform the gene_length and transcript_length columns

2. Z-Score normalise all of the numeric columns
3. Turn all of the text columns into dummy number columns

neural_recipe %>%
step_log(gene_length, transcript_length) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) -> neural_recipe

Look at the recipe again to see the new steps have been added.
Exercises: Introduction to Machine Learning 14

Building the model

We can now create the neural network model. We’re going to use a single hidden layer with 10 nodes
in it. You could play around with the settings made here once you had the basic model in place.

mlp(
epochs = 1000,
hidden_units = 10,
penalty = 0.01,
learn_rate = 0.01
) %>%
set_engine("brulee", validation = 0) %>%
set_mode("classification") -> nnet_model

The arguments here are as follows:

• epochs = how many rounds of refinement (back propagation) the model goes through
• hidden_units = how many nodes we want in the hidden layer.
• penalty = a value which penalises complexity in the model to try to prevent overfitting
• learn_rate = how much the estimates are moved to try to optimise the model

Again, these values could be modified after generating an initial model, but these will give us something
to work from.

We can see the model with

nnet_model %>% translate()

Building a workflow
A workflow will combine the recipe and the model together and will allow us to run everything at once.

workflow() %>%
add_recipe(neural_recipe) %>%
add_model(nnet_model) -> neural_workflow

We can view the workflow with

neural_workflow

Training the model via the workflow

To train the model we run the fit function and pass in our training data. This will preprocess the data
then feed it to the model.

fit(neural_workflow,data=training(split_data)) -> neural_fit

This will take a couple of minutes to complete. Once complete we can see the fitted model with
Exercises: Introduction to Machine Learning 15

neural_fit

You should see that a load more parameters have now been set because the model and the pre-
processing have been finalised.

Evaluating the Model

We can now use the model to make predictions on our testing data to see how well it is performing. As
before, the predict function only returns the predictions, so we need to bind the results to the training
data itself so we can see the predictions alongside the known correct values.

predict(neural_fit, new_data=testing(split_data)) %>%

bind_cols(testing(split_data)) %>%
select(.pred_class, transmembrane) -> neural_predictions

You can look at the contents of the neural_predictions variable to get an idea of how well it did.

Now we can calculate some of the standard metrics from this. We can make up a simple confusion table.

neural_predictions %>%
group_by(.pred_class,transmembrane) %>%
count()

..or if we want to be fancier…

neural_predictions %>%
group_by(.pred_class,transmembrane) %>%
count() %>%
pivot_wider(
names_from=.pred_class,
values_from=n,
names_prefix = "predicted_"
) %>%
rename(true_transmembrane=transmembrane)

We can also calculate the specific metrics

neural_predictions %>%
metrics(transmembrane, .pred_class)

neural_predictions %>%
sens(transmembrane, .pred_class)

neural_predictions %>%
spec(transmembrane, .pred_class)
Exercises: Introduction to Machine Learning 16
Exercises: Introduction to Machine Learning 17

Additional Exercise: Tuning models

For a final more challenging exercise we are going to try to rerun the transmembrane data but this time
using a k-nearest neighbour model. As well as changing the model type we will also try to optimise the
number of nearest neighbours to use.

The preparation of the data will be the same as before initially, but then we will hit some changes.

For the model you are going to use a knn model, and let the number of neighbours be a tuneable
parameter

nearest_neighbor(neighbors = tune(), weight_func = "triangular") %>%

set_mode("classification") %>%
set_engine("kknn") -> model

For the data you need to build a 10 fold cross validation split of the full dataset, rather than a single 80%
split.

vfold_cv(
data,
v=10
) -> vdata

You can then build a workflow from the model and data using the same formula as before.

Once you have the workflow you can look at the tuneable paramters.

workflow %>%
extract_parameter_set_dials()

…and from these we want to change the neighbors parameter to run from 1 to 50

workflow %>%
extract_parameter_set_dials() %>%
update(
neighbors = neighbors(c(1,50))
) -> tune_parameters

We’re then going to run the workflow generating a regular grid of 20 samples over the 1-50 range. We
are going to measure both the sensitivity and specificity of the model.

workflow %>%
tune_grid(
vdata,
grid = grid_regular(tune_parameters, levels=20),
metrics = metric_set(sens,spec)
) -> tune_results
Exercises: Introduction to Machine Learning 18

Finally we can plot out the tuned results to see which value for k we think is best.

autoplot(tune_results)

Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
100% (4)
Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
55 pages
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
100% (1)
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
424 pages
Instructions For Writing A Departmental Report
83% (6)
Instructions For Writing A Departmental Report
2 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
LBYMF1A Syllabus PDF
No ratings yet
LBYMF1A Syllabus PDF
5 pages
Scheme of Work Science Stage 2v1
100% (1)
Scheme of Work Science Stage 2v1
16 pages
ML
No ratings yet
ML
8 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Module_-1
No ratings yet
Module_-1
9 pages
Day 2 Presentation
No ratings yet
Day 2 Presentation
65 pages
Lecture 01
No ratings yet
Lecture 01
23 pages
Statistics and Machine Learning Toolbox™ Release Notes
No ratings yet
Statistics and Machine Learning Toolbox™ Release Notes
150 pages
Big Data Mid Term
No ratings yet
Big Data Mid Term
14 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
470 pages
1694266379-Unit1 Machine Learning Introduction CU 2.0
No ratings yet
1694266379-Unit1 Machine Learning Introduction CU 2.0
58 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
369 pages
AIAM_2023_DIY1
No ratings yet
AIAM_2023_DIY1
3 pages
2482
No ratings yet
2482
41 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
From Everand
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
PARTHA MAJUMDAR
No ratings yet
Article - 10 Machine Learning Algorithms in R
No ratings yet
Article - 10 Machine Learning Algorithms in R
2 pages
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet
Python Machine Learning - Logistic Regression
No ratings yet
Python Machine Learning - Logistic Regression
1 page
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
ISL439 E - Syllabus - 2023 - 2024
No ratings yet
ISL439 E - Syllabus - 2023 - 2024
4 pages
R_code_intro
No ratings yet
R_code_intro
46 pages
PRACTICAL FILE DL
No ratings yet
PRACTICAL FILE DL
14 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
algorithmeknn-121213175830-phpapp02
No ratings yet
algorithmeknn-121213175830-phpapp02
52 pages
Data Science Content
No ratings yet
Data Science Content
11 pages
Machine Learning With Python
100% (2)
Machine Learning With Python
137 pages
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
No ratings yet
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
9 pages
Using Random Forests v4.0
No ratings yet
Using Random Forests v4.0
33 pages
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
No ratings yet
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
10 pages
Machine Learning A Z Q A
100% (1)
Machine Learning A Z Q A
52 pages
Machine Learning The Basics
No ratings yet
Machine Learning The Basics
158 pages
MlLabManualdocx 2024 09 04 22 02 58
No ratings yet
MlLabManualdocx 2024 09 04 22 02 58
19 pages
Practical Labs Guide
No ratings yet
Practical Labs Guide
34 pages
Updated ML LAB Manual-2020-21
No ratings yet
Updated ML LAB Manual-2020-21
57 pages
2 Machine Learning
No ratings yet
2 Machine Learning
21 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
UNIT 1
No ratings yet
UNIT 1
28 pages
Unit 1 Machine Learning - PDF Lands
No ratings yet
Unit 1 Machine Learning - PDF Lands
5 pages
Lecture 1
No ratings yet
Lecture 1
36 pages
Bookdown Demo PDF
No ratings yet
Bookdown Demo PDF
19 pages
Exercises
No ratings yet
Exercises
69 pages
Week 2: Machine Learning Intro: Instructor: Ting Sun
No ratings yet
Week 2: Machine Learning Intro: Instructor: Ting Sun
21 pages
Record
No ratings yet
Record
23 pages
Thyroid Disease Classification Using ML
No ratings yet
Thyroid Disease Classification Using ML
37 pages
Python - Follow Dr. AngShu (@drangshu) For More
100% (1)
Python - Follow Dr. AngShu (@drangshu) For More
300 pages
Introduction To ML Linear Regression
No ratings yet
Introduction To ML Linear Regression
33 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Practical Machine Learning Course Notes
No ratings yet
Practical Machine Learning Course Notes
76 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Wa0001
No ratings yet
Wa0001
39 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
10 pages
Introduction r
No ratings yet
Introduction r
9 pages
ML Lab Manual
No ratings yet
ML Lab Manual
90 pages
Practical 3 2022
No ratings yet
Practical 3 2022
8 pages
A Neural Network Model Using Python
No ratings yet
A Neural Network Model Using Python
10 pages
Intro To ML
No ratings yet
Intro To ML
26 pages
Predicting True Value of Cars Using Ml-1
No ratings yet
Predicting True Value of Cars Using Ml-1
36 pages
Creep Model Analysis of Rock Salt Cavern Under Normal Operations
No ratings yet
Creep Model Analysis of Rock Salt Cavern Under Normal Operations
9 pages
Autonomous Uavs Wildlife Detection Using Thermal Imaging, Predictive Navigation and Computer Vision
No ratings yet
Autonomous Uavs Wildlife Detection Using Thermal Imaging, Predictive Navigation and Computer Vision
8 pages
Beating The Odds: Learning To Bet On Soccer Matches Using Historical Data
No ratings yet
Beating The Odds: Learning To Bet On Soccer Matches Using Historical Data
7 pages
Black Pages of Shustah Cards
No ratings yet
Black Pages of Shustah Cards
3 pages
Interpretable Machine Learning For Genomics: University College London
No ratings yet
Interpretable Machine Learning For Genomics: University College London
30 pages
ARIST - Sports Knowledge Management and Data Mining
100% (1)
ARIST - Sports Knowledge Management and Data Mining
64 pages
Using Cumulative Win Probabilities To Predict NCAA Performance Bashuk
No ratings yet
Using Cumulative Win Probabilities To Predict NCAA Performance Bashuk
10 pages
Major 12211
No ratings yet
Major 12211
15 pages
2009 Moons Et Al Prognosis and Prognostic Research What Why and How BMJ-1
No ratings yet
2009 Moons Et Al Prognosis and Prognostic Research What Why and How BMJ-1
4 pages
Handbook of HydroInformatics: Volume III: Water Data Management Best Practices Saeid Eslamian - Get the ebook instantly with just one click
100% (1)
Handbook of HydroInformatics: Volume III: Water Data Management Best Practices Saeid Eslamian - Get the ebook instantly with just one click
49 pages
DBA401 Strategic Management
No ratings yet
DBA401 Strategic Management
83 pages
16.yarn Price Prediction Using Advanced Analytics Model
No ratings yet
16.yarn Price Prediction Using Advanced Analytics Model
8 pages
Wikipedia Participation Challenge Solution
No ratings yet
Wikipedia Participation Challenge Solution
15 pages
Hope in Filipino
No ratings yet
Hope in Filipino
13 pages
GE4 - PW1 (RW) - Unit 6
No ratings yet
GE4 - PW1 (RW) - Unit 6
7 pages
Machine Learning Curriculum Berkley
100% (1)
Machine Learning Curriculum Berkley
12 pages
Question Set Machine Learning A Revolution in Risk Management and Compliance
100% (11)
Question Set Machine Learning A Revolution in Risk Management and Compliance
11 pages
Applications of Information Technology in Industry
No ratings yet
Applications of Information Technology in Industry
14 pages
Buku Riset Komunikasi
0% (1)
Buku Riset Komunikasi
10 pages
Scilab Outline
No ratings yet
Scilab Outline
2 pages
Element of Politics Chapters 1-3
80% (5)
Element of Politics Chapters 1-3
69 pages
A Survey On Retail Sales Forecasting and Prediction in Fashion Markets
No ratings yet
A Survey On Retail Sales Forecasting and Prediction in Fashion Markets
9 pages
MYP 3 Criterion B-Rusting
75% (4)
MYP 3 Criterion B-Rusting
5 pages
Tracking Signal: Given The Forecast Demand and Demand For Candies, Compute The Tracking Signal
No ratings yet
Tracking Signal: Given The Forecast Demand and Demand For Candies, Compute The Tracking Signal
56 pages
Factors Analysis On Stock Exchange of Thailand (SET) Index Movement
No ratings yet
Factors Analysis On Stock Exchange of Thailand (SET) Index Movement
6 pages
Top 10 Uses of Statistics in Our Day To Day Life: Stat Analytica Presents
No ratings yet
Top 10 Uses of Statistics in Our Day To Day Life: Stat Analytica Presents
17 pages

Introduction_to_Machine_Learning_Exercises

Uploaded by

Introduction_to_Machine_Learning_Exercises

Uploaded by

Exercises:

• to copy, distribute, display, and perform the work

• to make derivative works

Under the following conditions:

• Attribution. You must give the original author credit.

Please note that:

Full details of this licence can be found at

Exercise 1: Running machine learning models

A snapshot of the first part of the data looks like this:

The stats for the dataset are:

Changing Model Parameters

Exercise 2: Evaluating Models

Exercise 3: Building your first model

1. Load the R packages we’re going to need for this analysis

2. Load in the original data

3. Prepare the data for modelling

4. Split the data into a training and testing subset

5. Build the model

6. Train the model using the training data

7. Predict the transmembrane proteins from the testing data

8. Check how good the predictions are

Setting up your environment

File > New File > R Script

Loading the R packages we need

We can load these with

Loading the input data

read_delim("transmembrane_data.txt") -> data

Preparing the data for modelling

Removing the gene_id column

Shuffling the rows

Removing missing values

Splitting the data

We can see the data in the two subsets by running:

Building the model

To see the model you can run

forest_model %>% translate()

Training the model

Once the model is fit we can see it by running

Testing the model

Which will give us something like:

Evaluating the predictions

Exercise 4: Using Recipes and Workflows

For the first part of the model where we:

1. Loaded the required packages

We can then view the recipe with

1. Log transform the gene_length and transcript_length columns

Building the model

The arguments here are as follows:

We can see the model with

nnet_model %>% translate()

We can view the workflow with

Training the model via the workflow

fit(neural_workflow,data=training(split_data)) -> neural_fit

Evaluating the Model

predict(neural_fit, new_data=testing(split_data)) %>%

..or if we want to be fancier…

We can also calculate the specific metrics

Additional Exercise: Tuning models

nearest_neighbor(neighbors = tune(), weight_func = "triangular") %>%

You might also like