0% found this document useful (0 votes)
55 views

Worksheet Classification1

- The document discusses a worksheet on classification that covers parts of the Classification I chapter from an online textbook. - It provides learning goals for the lecture and tutorial, which include recognizing situations for simple classifiers, explaining k-nearest neighbor classification, interpreting classifier output, and using k-nearest neighbor classification in R. - It imports relevant libraries and provides questions to test understanding of classification concepts like training data sets, identifying classification problems, visualizing data, and calculating distances between points for k-nearest neighbor classification.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Worksheet Classification1

- The document discusses a worksheet on classification that covers parts of the Classification I chapter from an online textbook. - It provides learning goals for the lecture and tutorial, which include recognizing situations for simple classifiers, explaining k-nearest neighbor classification, interpreting classifier output, and using k-nearest neighbor classification in R. - It imports relevant libraries and provides questions to test understanding of classification concepts like training data sets, identifying classification problems, visualizing data, and calculating distances between points for k-nearest neighbor classification.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Worksheet 6 - Classification

Lecture and Tutorial Learning Goals:


After completing this week's lecture and tutorial work, you will be able to:

• Recognize situations where a simple classifier would be appropriate for making


predictions.
• Explain the k-nearest neighbour classification algorithm.
• Interpret the output of a classifier.
• Compute, by hand, the distance between points when there are two explanatory
variables/predictors.
• Describe what a training data set is and how it is used in classification.
• Given a dataset with two explanatory variables/predictors, use k-nearest neighbour
classification in R using the tidymodels framework to predict the class of a single new
observation.

This worksheet covers parts of the Classification I chapter of the online textbook. You should
read this chapter before attempting the worksheet.

### Run this cell before continuing.


library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source('cleanup.R')

Question 0.1 Multiple Choice: {points: 1}

Which of the following statements is NOT true of a training data set (in the context of
classification)?

A. A training data set is a collection of observations for which we know the true classes.

B. We can use a training set to explore and build our classifier.

C. The training data set is the underlying collection of observations for which we don't know the
true classes.

Assign your answer to an object called answer0.1. Make sure the correct answer is an
uppercase letter. Remember to surround your answer with quotation marks (e.g. "D").

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer
test_0.1()

Question 0.2 Multiple Choice {points: 1}

(Adapted from James et al, "An introduction to statistical learning" (page 53))

Consider the scenario below:

We collect data on 20 similar products. For each product we have recorded whether it was a
success or failure (labelled as such by the Sales team), price charged for the product, marketing
budget, competition price, customer data, and ten other variables.

Which of the following is a classification problem?

A. We are interested in comparing the profit margins for products that are a success and
products that are a failure.

B. We are considering launching a new product and wish to know whether it will be a success or
a failure.

C. We wish to group customers based on their preferences and use that knowledge to develop
targeted marketing programs.

Assign your answer to an object called answer0.2. Make sure the correct answer is an
uppercase letter. Remember to surround your answer with quotation marks (e.g. "F").

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_0.2()

1. Breast Cancer Data Set


We will work with the breast cancer data from this week's pre-reading.

Note that the breast cancer data in this worksheet have been standardized (centred
and scaled) for you already. We will implement these steps in future
worksheets/tutorials later, but for now, know the data has been standardized.
Therefore the variables are unitless and hence why we have zero and negative values
for variables like Radius.

Question 1.0 {points: 1}

Read the clean-wdbc-data.csv file (found in the data directory) using the read_csv()
function into the notebook and store it as a data frame. Name it cancer.

# your code here


fail() # No Answer - remove if you provide an answer
cancer
test_1.0()

Question 1.1 True or False: {points: 1}

After looking at the first six rows of the cancer data fame, suppose we asked you to predict the
variable "area" for a new observation. Is this a classification problem?

Assign your answer to an object called answer1.1. Make sure the correct answer is written in
lower-case. Remember to surround your answer with quotation marks (e.g. "true" / "false").

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_1.1()

We will be treating Class as a categorical variable, so we should convert it into a factor using
the as_factor() function.

# run this cell


cancer <- cancer |>
mutate(Class = as_factor(Class))

Question 1.2 {points: 1}

Create a scatterplot of the data with Symmetry on the x-axis and Radius on the y-axis. Modify
your aesthetics by colouring for Class. As you create this plot, ensure you follow the guidelines
for creating effective visualizations. In particular, note on the plot axes whether the data is
standardized or not.

Assign your plot to an object called cancer_plot.

# your code here


fail() # No Answer - remove if you provide an answer
cancer_plot

test_1.2()

Question 1.3 {points: 1}

Just by looking at the scatterplot above, how would you classify an observation with Symmetry
= 1 and Radius = 1 (Benign or Malignant)?

Assign your answer to an object called answer1.3. Make sure the correct answer is written
fully. Remember to surround your answer with quotation marks (e.g. "Benign" / "Malignant").

# Replace the fail() with your answer.


# your code here
fail() # No Answer - remove if you provide an answer

test_1.3()

We will now compute the distance between the first and second observation in the breast
cancer dataset using the explanatory variables/predictors Symmetry and Radius. Recall we can
calculate the distance between two points using the following formula:

√ 2
D i s t a n c e= ( a x − b x ) + ( a y −b y )
2

Question 1.4 {points: 1}

First, extract the coordinates for the two observations and assign them to objects called:

• ax (Symmetry value for the first row)


• ay (Radius value for the first row)
• bx (Symmetry value for the second row)
• by (Radius value for the second row).

Scaffolding for ax is given.


Note we are using the function pull() because we want the numeric value (and
object) as our output rather than a tibble type object so we can do calculations later
on. You can verify the object type in R with the class() function. Check the class of
ax with and without the pull() function and see what you get!

#ax <- slice(cancer, 1) |>


# pull(Symmetry)

# your code here


fail() # No Answer - remove if you provide an answer

test_1.4()

Question 1.5 {points: 1}

Plug the coordinates into the distance equation.

Assign your answer to an object called answer1.5.


Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

#... <- sqrt((... - ...)^... + (... - ...)^...)

# your code here


fail() # No Answer - remove if you provide an answer
answer1.5

test_1.5()
Question 1.6 {points: 1}

Now we'll do the same thing with 3 explanatory variables/predictors: Symmetry, Radius and
Concavity. Again, use the first two rows in the data set as the points you are calculating the
distance between (point a is row 1, and point b is row 2).

Find the coordinates for the third variable (Concavity) and assign them to objects called az and
bz. Use the scaffolding given in Question 1.4 as a guide.

# your code here


fail() # No Answer - remove if you provide an answer

test_1.6()

Question 1.7 {points: 1}

Again, calculate the distance between the first and second observation in the breast cancer
dataset using 3 explanatory variables/predictors: Symmetry, Radius and Concavity.

Assign your answer to an object called answer1.7. Use the scaffolding given to calculate
answer1.5 as a guide.

# your code here


fail() # No Answer - remove if you provide an answer
answer1.7

test_1.7()

Question 1.8 {points: 1}

Let's do this without explicitly making coordinate variables!

Create a vector of the coordinates for each point. Name one vector point_a and the other
vector point_b. Within the vector, the order of coordinates should be: Symmetry, Radius,
Concavity.

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Here will use select and as.numeric instead of pull because we need to get the
numeric values of 3 columns. pull, that we used previously, only works to extract the
numeric values from a single column.

# This is only the scaffolding for one vector


# You need to make another one for row number 2!

#... <- slice(cancer, 1) |>


# select(..., Radius, ...) |>
# as.numeric()

# your code here


fail() # No Answer - remove if you provide an answer
point_a
point_b

test_1.8()

Question 1.9 {points: 1}

Compute the squared differences between the two vectors, point_a and point_b. The result
should be a vector of length 3 named dif_square. Hint: ^ is the exponent symbol in R.

# your code here


fail() # No Answer - remove if you provide an answer
dif_square

test_1.09()

Question 1.09.1 {points: 1}

Sum the squared differences between the two vectors, point_a and point_b. The result
should be a single number named dif_sum.

Hint: the sum function in R returns the sum of the elements of a vector

# your code here


fail() # No Answer - remove if you provide an answer
dif_sum

test_1.09.1()

Question 1.09.2 {points: 1}

Square root the sum of your squared differences. The result should be a double named
root_dif_sum.

# your code here


fail() # No Answer - remove if you provide an answer
root_dif_sum

test_1.09.2()

Question 1.09.3 {points: 1}

If we have more than a few points, calculating distances as we did in the previous questions is
VERY tedious. Let's use the dist() function to find the distance between the first and second
observation in the breast cancer dataset using Symmetry, Radius and Concavity.

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Assign your answer to an object called dist_cancer_two_rows.


# ... <- cancer |>
# slice(1,2) |>
# select(..., ..., Concavity) |>
# dist()

# your code here


fail() # No Answer - remove if you provide an answer
dist_cancer_two_rows

test_1.09.3()

Question 1.09.4 True or False: {points: 1}

Compare answer1.7, root_dif_sum, and dist_cancer_two_rows.

Are they all the same value?

Assign your answer to an object called answer1.09.4. Make sure the correct answer is written
in lower-case. Remember to surround your answer with quotation marks (e.g. "true" / "false").

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_1.09.4()

2. Classification - A Simple Example Done Manually


Question 2.0.0 {points: 1}

Let's take a random sample of 5 observations from the breast cancer dataset using the
sample_n function. To make this random sample reproducible, we will use set.seed(20).
This means that the random number generator will start at the same point each time when we
run the code and we will always get back the same random samples.

We will focus on the predictors Symmetry and Radius only. Thus, we will need to select the
columns Symmetry and Radius and Class. Save these 5 rows and 3 columns to a data frame
named small_sample.

Fill in the ... in the scaffolding provided below.

#set.seed(20)
#... <- sample_n(cancer, 5) |>
# select(...)

# your code here


fail() # No Answer - remove if you provide an answer

test_2.0.0()
Question 2.0.1 {points: 1}

Finally, create a scatter plot where Symmetry is on the x-axis, and Radius is on the y-axis.
Color the points by Class. Name your plot small_sample_plot.

Fill in the ... in the scaffolding provided below.

As you create this plot, ensure you follow the guidelines for creating effective visualizations. In
particular, note on the plot axes whether the data is standardized or not.

#... <- ...|>


# ggplot(...) +
# geom_...() +
# ...

# your code here


fail() # No Answer - remove if you provide an answer
small_sample_plot

test_2.0.1()

Question 2.1 {points: 1}

Suppose we are interested in classifying a new observation with Symmetry = 0.5 and Radius
= 0, but unknown Class. Using the small_sample data frame, add another row with
Symmetry = 0.5, Radius = 0, and Class = "unknown".

Fill in the ... in the scaffolding provided below.

Assign your answer to an object called newData.

#... <- ... |>


# add_row(Symmetry = ..., ... = 0, Class = ...)

# your code here


fail() # No Answer - remove if you provide an answer
newData

test_2.1()

Question 2.2 {points: 1}

Compute the distance between each pair of the 6 observations in the newData dataframe using
the dist() function based on two variables: Symmetry and Radius. Fill in the ... in the
scaffolding provided below.

Assign your answer to an object called dist_matrix.

# ... <- newData |>


# select(..., ...) |>
# ...() |>
# as.matrix()

# your code here


fail() # No Answer - remove if you provide an answer
dist_matrix

test_2.2()

Question 2.3 Multiple Choice: {points: 1}

In the table above, the row and column numbers reflect the row number from the data frame the
dist function was applied to. Thus numbers 1 - 5 were the points/observations from rows 1 - 5
in the small_sample data frame. Row 6 was the new observation that we do not know the
diagnosis class for. The values in dist_matrix are the distances between the points of the row
and column number. For example, the distance between the point 2 and point 4 is 4.196683.
And the distance between point 3 and point 3 (the same point) is 0.

Which observation is the nearest to our new point?

Assign your answer to an object called answer2.3. Make sure your answer is a number.
Remember to surround your answer with quotation marks (e.g. "8").

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_2.3()

Question 2.4 Multiple Choice: {points: 1}

Use the K-nearest neighbour classification algorithm with K = 1 to classify the new observation
using your answers to Questions 2.2 & 2.3. Is the new data point predicted to be benign or
malignant?

Assign your answer to an object called answer2.4. Make sure the correct answer is written fully
as either "Benign" or "Malignant". Remember to surround your answer with quotation marks.

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_2.4()

Question 2.5 Multiple Choice: {points: 1}

Using your answers to Questions 2.2 & 2.3, what are the three closest observations to your new
point?

A. 1, 3, 2
B. 5, 1, 4

C. 5, 2, 4

D. 3, 4, 2

Assign your answer to an object called answer2.5. Make sure the correct answer is an
uppercase letter. Remember to surround your answer with quotation marks (e.g. "F").

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_2.5()

Question 2.6 Multiple Choice: {points: 1}

We will now use the K-nearest neighbour classification algorithm with K = 3 to classify the new
observation using your answers to Questions 2.2 & 2.3. Is the new data point predicted to be
benign or malignant?

Assign your answer to an object called answer2.6. Make sure the correct answer is written
fully. Remember to surround your answer with quotation marks (e.g. "Benign" / "Malignant").

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_2.6()

Question 2.7 {points: 1}

Compare your answers in 2.4 and 2.6. Are they the same?

Assign your answer to an object called answer2.7. Make sure the correct answer is written in
lower-case. Remember to surround your answer with quotation marks (e.g. "yes" / "no").

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_2.7()

3. Using tidymodels to perform k-nearest neighbours


Now that we understand how K-nearest neighbours (k-nn) classification works, let's get familar
with the tidymodels R package. The benefit of using tidymodels is that it will keep our code
simple, readable and accurate. Coding less and in a tidier format means that there is less chance
for errors to occur.

We'll again focus on Radius and Symmetry as the two predictors. This time, we would like to
predict the class of a new observation with Symmetry = 1 and Radius = 0. This one is a bit
tricky to do visually from the plot below, and so is a motivating example for us to compute the
prediction using k-nn with the tidymodels package. Let's use K = 7.

options(repr.plot.width = 8, repr.plot.height = 6) # you can change


the size of the plot if it is too small or too large
cancer_plot

Question 3.1 {points: 1}

Create a model specification for K-nearest neighbours classification by using the


nearest_neighbor() function. Specify that we want to set k = 7 and use the straight-line
distance. Furthermore, specify the computational engine to be "kknn" for training the model
with the set_engine() function. Finally, identify that this is a classification problem with the
set_mode() function.

Name your model specification knn_spec.

#... <- nearest_neighbor(weight_func = ..., neighbors = ...) |>


# set_engine(...) |>
# set_mode(...)

# your code here


fail() # No Answer - remove if you provide an answer
knn_spec

test_3.1()

Question 3.2 {points: 1}

To train the model on the breast cancer dataset, pass knn_spec and the cancer dataset to the
fit() function. Specify Class as your target variable and the Symmetry and Radius variables
as your predictors. Name your fitted model as knn_fit.

# ... <- ... |>


# fit(... ~ Symmetry + ..., data = ...)

# your code here


fail() # No Answer - remove if you provide an answer
knn_fit

test_3.2()

Question 3.3 {points: 1}


Now we will make our prediction on the Class of a new observation with a Symmetry of 1 and a
Radius of 0. First, create a tibble with these variables and values and call it new_obs. Next, use
the predict() function to obtain our prediction by passing knn_fit and new_obs to it. Name
your predicted class as class_prediction.

#... <- tibble(... = 1,... = 0)


#... <- predict(..., ...)

# your code here


fail() # No Answer - remove if you provide an answer
class_prediction

test_3.3()

Question 3.4 {points: 1}

Let's perform K-nearest neighbour classification again, but with three predictors. Use the
tidymodels package and K = 7 to classify a new observation where we measure Symmetry
= 1, Radius = 0 and Concavity = 1. Use the scaffolding from Questions 3.2 and 3.3 to
help you.

• Pass the same knn_spec from before to fit, but this time specify Symmetry, Radius,
and Concavity as the predictors. Store the output in knn_fit_2.
• store the new observation values in an object called new_obs_2
• store the output of predict in an object called class_prediction_2
# your code here
fail() # No Answer - remove if you provide an answer

test_3.4()

Question 3.5 {points: 1}

Finally, we will perform K-nearest neighbour classification again, using the tidymodels
package and K = 7 to classify a new observation where we use all the predictors in our data set
(we give you the values in the code below).

But we first have to do one important thing: we need to remove the ID variable from the analysis
(it's not a numerical measurement that we should use for classification). Thankfully,
tidymodels provides a nice way of combining data preprocessing and training into a single
consistent workflow.

We will first create a recipe to remove the ID variable using the step_rm preprocessing step.
Do so below using the provided scaffolding. Name the recipe object knn_recipe.

Hint: If you want to find out about the available preprocessing steps that you can include in a
recipe, visit the tidymodels page.

#... <- recipe(... ~ ., data = ...) |>


# step_rm(...)
# your code here
fail() # No Answer - remove if you provide an answer
knn_recipe

test_3.5()

You can examine the output of a recipe by using the prep and bake functions. For example, let's
see if step_rm above actually removed the ID variable. Run the below code to see!

Note: You have to pass the cancer data to bake() again, even though we already specified it in
the recipe above. Why? This is because tidymodels is flexible enough to let you compute
preprocessing steps using one dataset (prep), and applying those steps to another (bake). For
example, if we apply the step_center preprocessing step (which shifts a variable to have
mean 0), we need to compute the shift value (prep) before subtracting it from each observation
(bake). This will be very useful in the next chapter when we have to split our dataset into two
subsets, and only prep using one of them.

preprocessed_data <- knn_recipe |>


prep() |>
bake(cancer)
preprocessed_data

Question 3.6 {points: 1}

Create a workflow that includes the new recipe (knn_recipe) and the model specification
(knn_spec) using the scaffolding below. Name the workflow object knn_workflow.

#... <- workflow() |>


# add_recipe(...) |>
# add_model(...)

# your code here


fail() # No Answer - remove if you provide an answer
knn_workflow

test_3.6()

Question 3.7 {points: 1}

Finally, fit the workflow and predict the class label for the new observation named
new_obs_all. Name the fit object knn_fit_all, and the class prediction
class_prediction_all.

new_obs_all <- tibble(ID = NA, Radius = 0,


Texture = 0,
Perimeter = 0,
Area = 0,
Smoothness = 0.5,
Compactness = 0,
Concavity = 1,
Concave_points = 0,
Symmetry = 1,
Fractal_dimension = 0)

#... <- knn_workflow |>


# fit(data = ...)
#... <- ...(knn_fit_all, ...)

# your code here


fail() # No Answer - remove if you provide an answer
class_prediction_all

test_3.7()

4. Reviewing Some Concepts


We will conclude with two multiple choice questions to reinforce some key concepts when doing
classification with K-nearest neighbours.

Question 4.0 {points: 1}

In the K-nearest neighbours classification algorithm, we calculate the distance between the new
observation (for which we are trying to predict the class/label/outcome) and each of the
observations in the training data set so that we can:

A. Find the K nearest neighbours of the new observation

B. Assess how well our model fits the data

C. Find outliers

D. Assign the new observation to a cluster

Assign your answer (e.g. "E") to an object called: answer4.0. Make sure your answer is an
uppercase letter and is surrounded with quotation marks.

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_4.0()

Question 4.1 {points: 1}

In the K-nearest neighbours classification algorithm, we choose the label/class for a new
observation by:

A. Taking the mean (average value) label/class of the K nearest neighbours

B. Taking the median (middle value) label/class of the K nearest neighbours


C. Taking the mode (value that appears most often, i.e., the majority vote) label/class of the K
nearest neighbours

Assign your answer (e.g., "E") to an object called answer4.1. Make sure your answer is an
uppercase letter and is surrounded with quotation marks.

# Replace the fail() with your answer.

# your code here


fail() # No Answer - remove if you provide an answer

test_4.1()

source('cleanup.R')

You might also like