Machine Learning: Dr. Windhya Rankothge (PHD - Upf, Barcelona)
Machine Learning: Dr. Windhya Rankothge (PHD - Upf, Barcelona)
2
Machine Learning
• In data science, an algorithm is a sequence of statistical processing steps.
• In machine learning, algorithms are 'trained' to find patterns and features
in massive amounts of data
in order to make decisions and predictions based on new data.
• The better the algorithm,
the more accurate the decisions and predictions will become as it processes more data.
3
Machine Learning
Past Future
i ct
n e d
a r p r
le
Training Model/ Testing Model/
Data Predictor Data Predictor
Machine Learning
Using and improving the model
6
Supervised Learning Algorithms
8
Supervised Learning Algorithms
• Supervised learning is where you have input variables (x) and an output variable (Y)
and you use an algorithm to learn the mapping function from the input to the output.
Y = f(X)
9
Supervised Learning Algorithms
Accurately assign test data into specific
categories
10
Supervised Learning Algorithms
11
Naïve Bayes
• A statistical classification technique based on Bayes Theorem.
• One of the simplest and fast supervised learning algorithms.
• Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of
other features.
• For example, a loan applicant is desirable or not depending on his/her income, previous loan and
transaction history, age, and location.
• Even if these features are interdependent, these features are still considered independently.
• This assumption simplifies computation, and that's why it is considered as naive.
12
Naïve Bayes
• P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior
probability of h.
• P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
• P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
• P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior
probability.
13
Naïve Bayes
• Assume we have a bunch of emails that we want to classify as spam or not spam.
• Our dataset has 15 Not Spam emails and 10 Spam emails. Some analysis had been done, and the
frequency of each word had been recorded as shown below:
14
Naïve Bayes
Exploring some probabilities:
• P(Dear|Not Spam) = 8/34
• P(Visit|Not Spam) = 2/34
• P(Dear|Spam) = 3/47
• P(Visit|Spam) = 6/47
15
Naïve Bayes
So, using Bayes’ Theorem:
But, P(Hello friend | Not Spam) = 0, as this case (Hello friend) doesn’t exist in our dataset, i.e. we deal
with single words, not the whole sentence, and the same for P(Hello friend | Spam) will be zero as
well, which in turn will make both probabilities of being a spam and not spam both are zero, which
has no meaning!!
16
Naïve Bayes
But wait!! we said that the Naive Bayes assumes that `the features we use to predict the target are
independent`.
17
Naïve Bayes
Now let’s calculate the probability of being spam using the same procedure:
18
Supervised Learning Algorithms
19
Support Vector Machines (SVM)
• Typically leveraged for classification problems (can be used for regression too),
constructing a hyperplane where the distance between two classes of data points is at its
maximum.
• This hyperplane is known as the decision boundary,
separating the classes of data points (e.g., oranges vs. apples) on either side of the plane
20
Support Vector Machines (SVM)
• Plot each data item as a point in n-dimensional space (where n is number of features you have)
with the value of each feature being the value of a particular coordinate.
• Perform classification by finding the hyper-plane that differentiates the two classes very well.
Which Hyperplane ?
21
Support Vector Machines (SVM)
• The vector points closest to the hyperplane are known as the support vector points because only
these two points are contributing to the result of the algorithm, and other points are not.
• The distance of the vectors from the hyperplane
is called the margin, which is a separation
of a line to the closest support vector points.
• We would like to choose a hyperplane
that maximizes the margin between classes.
22
Support Vector Machines (SVM)
23
Support Vector Machines (SVM)
24
Support Vector Machines (SVM)
25
Support Vector Machines (SVM)
• Maximizing margin is equivalent to Minimizing Loss (Minimizing misclassification)
• The loss function that SVM uses is known as hinge loss
26
Support Vector Machines (SVM)
• SVM has a technique called the kernel trick.
• These are functions that take low dimensional input space and transform it into a higher-
dimensional space
• It converts not separable problem
to separable problem.
27
Support Vector Machines (SVM)
28
Support Vector Machines (SVM)
29
Supervised Learning Algorithms
30
K-Nearest Neighbors (K-NN)
• Can be used for Regression as well as for Classification but mostly it is used for the Classification
problems.
• A non-parametric algorithm, which means it does not make any assumption on underlying data.
31
K-Nearest Neighbors (K-NN)
32
K-Nearest Neighbors (K-NN)
33
K-Nearest Neighbors (K-NN)
34
K-Nearest Neighbors (K-NN)
35
K-Nearest Neighbors (K-NN)
• Research has shown that no optimal number of neighbors (k) suits all kind of data sets.
• Each dataset has it's own requirements.
• In the case of a small number of neighbors, the noise will have a higher influence on the result,
and a large number of neighbors make it computationally expensive.
36
Supervised Learning Algorithms
Accurately assign test data into specific
categories
37
Supervised Learning Algorithms
38
Linear Regression
• Performs the task to predict a dependent variable value (y) based on a given independent variable
(x).
• So, this regression technique finds out a linear relationship between x (input) and y(output)
39
Linear Regression
40
Linear Regression
41
Linear Regression
42
Linear Regression
43