The document is a transcript of a lecture on the k-nearest neighbors (KNN) algorithm. The lecturer:
1) Explains the KNN algorithm through a 4-step process - choose k neighbors, compute distances to neighbors, count neighbors in each category, assign data point to most common neighbor category.
2) Provides an example applying the KNN algorithm to classify a new data point based on its nearest neighbors.
3) Discusses using KNN in MATLAB to build a model to predict if users will purchase an SUV based on their profile data from a social network. Standardization preprocessing is applied before building the KNN classification model.
The document is a transcript of a lecture on the k-nearest neighbors (KNN) algorithm. The lecturer:
1) Explains the KNN algorithm through a 4-step process - choose k neighbors, compute distances to neighbors, count neighbors in each category, assign data point to most common neighbor category.
2) Provides an example applying the KNN algorithm to classify a new data point based on its nearest neighbors.
3) Discusses using KNN in MATLAB to build a model to predict if users will purchase an SUV based on their profile data from a social network. Standardization preprocessing is applied before building the KNN classification model.
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 12
0:00
hello and welcome back to our next
0:02 lecture we will talking about 0:03 nearest neighbor algorithm so without 0:06 any further delay let's start 0:08 let's assume is a scenario our data set 0:10 contains data appliance for two 0:12 categories 0:13 which is graded by different colors in 0:15 the case that's at point 0:17 points corresponding to their categories 0:19 one is orange color and which is on the 0:20 right side 0:22 and the second category is blue color 0:25 which is on the left side of the plot 0:28 so uh so you're just going to consider 0:31 to worry about 0:32 it is we don't know is a variable x and 0:35 y 0:36 which is actually represented two 0:38 columns of our data set 0:40 so for instance there are many they are 0:42 may in 0:44 age and salary in the example that we 0:46 consider in the data 0:47 data pre-processing section of the 0:50 course 0:51 [Music] 0:53 uh however to make the things extremely 0:55 simple we are going to 0:57 we are going to add my any skill to 1:00 these 1:01 variables now let's assume that we 1:03 encounter a new data point 1:05 we are we were asked in category based 1:08 on the 1:09 available information so 1:12 the key issue is uh whether whether it 1:15 should be fall 1:15 in blue categories or in orange category 1:18 there is uh 1:19 category one and category two in other 1:21 words uh 1:22 how do we decide or classify these data 1:25 data points 1:26 so this is this is the time to which the 1:28 nearest negabor algorithm is the 1:30 going to resolve for us at that point 1:33 algorithm we will 1:34 be able to determine the category of 1:36 further 1:37 data point uh in the in the case turn 1:40 out of the category 2 1:42 which is shown by the blue color okay 1:45 so let's see how how can uh how 1:48 k n algorithm is going to do that for us 1:52 in order to understand the an algorithm 1:55 we are going to introduce four 1:57 step procedure 2:03 so you will notice there is very small 2:05 simple algorithm 2:06 the the very first step is to choose the 2:08 number of neighbors that 2:10 you are going to have in your algorithm 2:13 and this means that you need to identify 2:15 whether k is equal to 1 2:17 to 3 or some other numbers one of the 2:20 most commonly 2:22 used value for k is value of five 2:29 step two step two is to the 2:33 computer navy neighbors of the new data 2:36 new data 2:37 find according to the some distance 2:39 measures such as the euclidean 2:41 euclidean distance measure so so you 2:44 don't 2:45 have to use euclidean distance all the 2:47 time you you can use 2:48 the distance matters such as 2:52 manhattan or city block or hamming 2:55 distance measure 2:56 since is the most case of euclidean 2:58 distance is used 2:59 so in this example we will just stick to 3:02 that 3:04 so once you have to compute 3:08 okay nearest neighbor the next step is 3:10 to count the number of data 3:12 lines from each category among the 3:14 neighbor 3:15 computing second step 3:19 so how so how to neighbor appliance 3:21 happen to 3:22 be in category one and how many of these 3:26 um happens into the into categories we 3:28 need to determine that 3:30 in this step which which is step number 3:33 three 3:33 if you have more than two categories in 3:35 in the data 3:36 set and then you just simply need to 3:39 counter how many neighbors 3:41 that applies happened in each of the 3:43 category finally in the step 3:45 number four we will assign 3:48 the due date applied to the category 3:52 with most neighbors so this is as simple 3:54 as 3:55 that and after after these four steps 3:59 you are done and your model is ready to 4:01 predict 4:02 uh any a new data point 4:06 so let's do uh let's do a man will 4:09 exercise to the solidify or our 4:12 knowledge 4:12 and see that and see 4:15 they can and they can add algorithms in 4:18 action 4:19 remember remember the issue you you 4:24 yes we need to classify the new data 4:27 apply based on the available appliance 4:29 in the two categories so let's start for 4:32 steps process 4:34 of the algorithm into is to choose the 4:38 number of uh 4:39 neighbors so we we keep it five the next 4:43 the next step which is uh which is step 4:45 number two 4:46 is to determine determine the five 4:49 neighbors of this new data point 4:51 according to some distance measure uh 4:53 point out as earlier we use uh 4:55 jupiter and distance and we also talk 4:58 about the nucleating distance is the 5:00 data preprocessing section 5:02 so euclidean distance is a basic method 5:04 which we are 5:05 studying in which we are studying in the 5:08 geometry and uh 5:12 so basically if we have two points such 5:14 as p1 and p2 5:16 in this case the euclidean distance 5:17 between those two points is major 5:19 according to the formula which is 5:21 which means that we you know we need to 5:23 determine and determine the 5:24 difference of x coordinate values of two 5:27 points and that 5:28 in difference of y coordinates and 5:30 values 5:31 and then taking square root of of the 5:34 difference and take some and finally 5:37 under root 5:39 so that's coming to our example of uh 5:41 [Music] 5:42 of the algorithm we were in the step 5:46 of control algorithm which we need to 5:49 determine the 5:50 neighbor based or locally um locally 5:54 the distance so basically we just look 5:57 at 5:57 them and we see the distance here we we 6:00 can 6:00 see that this is the closest one we were 6:03 if we were to give the actual value of 6:05 these lines then we could easily verify 6:07 that they are five years 6:09 uh neighbors step three is to count the 6:11 number of data points 6:12 for each categories so in this case we 6:16 see that 6:17 from the neighbors which are data 6:19 appliance inside the 6:20 circle of two 2002 belongs to categories 6:24 one and 6:25 three belongs to category two finally 6:28 finally step four is to assign the data 6:30 points uh to the category 6:32 which most neighbors which is in this 6:34 case happened to the 6:35 uh to be category 2. so that was simple 6:38 as that 6:39 and now we have to classify our data 6:41 appliance and 6:43 we have already the model of to classify 6:47 and any further data points so 6:50 in the conclusion it's one of the oldest 6:52 algorithm in machine learning and 6:54 one of the simplest one is to so 6:57 i believe that enough to get you started 6:59 with the 7:01 knn algorithm so now 7:04 we know we will apply these 7:07 this algorithm in our matlab 7:13 our first machine learning model of k 7:15 nearest next board and i can't wait to 7:17 show you the first result 7:19 to show how they can manage chapter data 7:21 of some categories 7:23 and predict the categories into unseen 7:26 data line 7:26 so let's start making the model right 7:28 now the first thing we need to do a 7:30 local 7:31 load the data set and the data that we 7:33 will be using is related to the social 7:35 network 7:36 so you can see on the screen the the 7:38 data set contain information of user in 7:39 social network 7:40 and the information include the user a 7:43 gender range estimation 7:45 uh salary and the social social network 7:48 has several business clients 7:49 which can put their which can put their 7:53 data on social network and the client is 7:57 a car company 7:58 who has launched their brand so 8:02 we we are trying to see which of the 8:04 users of social network are going to buy 8:06 this brand new 8:08 uh suv okay so last column of 8:12 uh so here the last column tells us if a 8:15 certain user of social uh 8:16 network has bought suv or 8:19 he has not bought the suv so even 8:23 when the building is modeled is going to 8:25 predict if user is going to buy the 8:28 suv or not on the variable given on the 8:31 table 8:34 so there are 400 instance in the 8:37 particular data set 8:39 and let's load the data set into the 8:42 matlab 8:44 okay we will 8:48 we will need to pre-process the template 8:51 and if you build in the last section 8:54 of the course we will just copy the 8:56 template and paste it over here in order 8:58 to load the dataset 9:00 so 9:04 we so we will ignore this and we will 9:06 take a right off the variable we will 9:08 build 9:08 build up our model and you will see that 9:11 this is a very 9:12 easy in matlab in fact we and we will 9:15 see 9:16 not by writing an extra statement or 9:18 code for this in matlab 9:20 uh this is uh so 9:23 so the question is you know do we need 9:25 to apply any pre-process 9:27 technique that we learned in earlier 9:29 section of pre-processing 9:31 and the answer is yes we are going to 9:35 apply process 9:36 apply the standardization 9:40 technique to preprocess our data 9:45 so as you know that we will use 9:48 euclidean distance for 9:50 standardization 9:58 so we will need some a few functions in 10:01 matlab so let's see how can we 10:04 do this in matlab so the function we 10:06 need to use in order to build 10:07 the classification model is a fade c 10:11 and we need to best provide the variable 10:14 which contain that data 10:16 and as a second input the function 10:18 expect the variable name 10:19 which will we will use as a response 10:23 variable name or in other words variable 10:25 name for which we will make a prediction 10:28 so in this case uh the uh the wording 10:31 table in the purchase variable 10:33 uh remember that point uh point of one 10:36 point out in the last lecture and we 10:38 will building our model based on two 10:40 and uh we are two years of age 10:43 estimation salary and i told you that 10:45 there is 10:45 going to be a very simple very simple 10:49 in matlab so in order to specify that we 10:51 need to write a 10:54 variable names that we use for our 10:56 building our model 10:58 we also need to insert the place between 11:01 and wheel variable name 11:02 and rows and our model is ready to 11:07 use we need to store this model in some 11:10 variables so we use the variable name of 11:12 classification underscore 11:14 model in this case this will 11:17 mean that classification underscore 11:19 model which contain our keys 11:22 in very classic in any classification 11:25 model so 11:26 now we are on all done we have both 11:28 classification model 11:30 and that model is stored in the variable 11:34 okay please please note 11:37 that if you don't mention anything here 11:41 then 11:41 if we delete the variable name of age 11:45 or if we will salary then the default 11:48 matlab is going to build the model based 11:50 on all on our label within our data 11:55 another very important part uh that 11:59 name model uh model we build which is 12:01 stored in the variable classification 12:03 underscore model is built with some 12:05 default options 12:06 and we can change uh we can check some 12:09 office properties by being in 12:11 command windows for instance 12:50 and please note that there is a sum of 12:52 all the four values happen 12:54 to be a total number of prediction 12:57 uh so classifiers didn't respond 13:00 uh reasonable well in this case so we 13:04 did successfully implement our first 13:06 machine learning 13:07 model and tested its performance on the 13:11 testing data 13:11 and surprisingly we did this 13:15 by using only five lines of code that we 13:18 that you some indicate the strength of 13:20 matlab and that analysis that 13:23 is very and that is with very few lines 13:26 of code 13:27 so enjoy machine learning 13:31 first algorithm 13:35 so now let's have a look at the results 13:38 again we will be 13:39 uh looking at them emitting the values 13:41 later that will always be a part of 13:43 the course but i know we are just not at 13:46 this 13:47 corresponding to the correct predictions 13:49 and these two diagonals values 13:51 correspond to the incorrect 13:52 prediction