IT 802 ML Unit-1 Notes
IT 802 ML Unit-1 Notes
INTRODUCTION
Software engineering combined human created rules with data to create answers to a
problem. Instead, machine learning uses data and answers to discover the rules behind a
problem. To learn the rules governing a phenomenon, machines have to go through a
learning process, trying different rules and learning from how well they perform. Therefore,
it’s known as Machine Learning.
Dataset: A set of data examples that contain features important to solving the
problem.
Features: Important pieces of data that help us understand a problem. These are fed
into a Machine Learning algorithm to help it learn.
Data Collection: Collect the data that the algorithm will learn from.
Data Preparation: Format and engineer the data into the optimal format, extracting
important features and performing dimensionality reduction.
Training: This is where the Machine Learning algorithm learns by showing it the data
that has been collected and prepared.
There are many approaches that can be taken when conducting Machine Learning.
Supervised and Unsupervised are well established approaches and the most used. Semi-
supervised and Reinforcement Learning are newer and more complex but have shown
impressive results.
There are three basic types of learning paradigms widely associated with machine learning,
namely
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Supervised Learning
Supervised learning is a machine learning task in which a function maps the input to output
data using the provided input-output pairs.
Figure: 1.2: Supervised Learning
In this type of learning, you need to give both the input and output (usually in the form of
labels) to the computer for it to learn from it. The computer generates a function based on
this data, which can be anything like a simple line, to a complex convex function, depending
on the data provided.
This is the most basic type of learning paradigm, and most algorithms we learn today are
based on this type of learning pattern. For example:
Regression: Machine is trained to predict some value like price, weight, or height.
Unsupervised Learning
In this type of learning paradigm, the computer is provided with just the input to develop a
learning pattern. It is basically learning from no results.
This means that the computer has to recognize a pattern in the given input and develop a
learning algorithm accordingly. So, we conclude that “the machine learns through
observation & find structures in data”. This is still a much unexplored field of machine
learning, and big tech giants like Google and Microsoft are currently researching on
development in it.
Clustering: A clustering problem is where you want to discover the inherent groupings in the
data
Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data
Reinforcement Learning
Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial
Intelligence. It allows machines and software agents to automatically determine the ideal
behavior within a specific context, in order to maximize its performance.
There is an excellent analogy to explain this type of learning paradigm, “training a dog”.
This learning paradigm is like a dog trainer, which teaches the dog how to respond to
specific signs, like a whistle, clap, or anything else. Whenever the dog responds correctly,
the trainer gives a reward to the dog, which can be a “Bone or a biscuit”.
Game playing — determining the best move to make in a game often depends on a number
of different factors; hence the number of possible states that can exist in a particular game
is usually very large.
Control problems — such as elevator scheduling. Again, it is not obvious what strategies
would provide the best, most timely elevator service. For control problems such as this, RL
agents can be left to learn in a simulated environment and eventually they will come up
with good controlling policies.
Perspective:
It involves searching a very large space of possible hypothesis to determine the one that
best fits the observed data. Machine perception is the capability of a computer system to
interpret data in a manner that is similar to the way humans use their senses to relate to the
world around them. The basic method that the computers take in and respond to their
environment is through the attached hardware. Until recently input was limited to a
keyboard, or a mouse, but advances in technology, both in hardware and software, have
allowed computers to take in sensory input in a way similar to humans. Machine perception
allows the computer to use this sensory input, as well as conventional computational means
of gathering information, to gather information with greater accuracy and to present it in a
way that is more comfortable for the user.
The end goal of machine perception is to give machines the ability to see, feel and perceive
the world as humans do and therefore for them to be able to explain in a human way why
they are making their decisions, to warn us when it is failing and more importantly, the
reason why it is failing. This purpose is very similar to the proposed purposes for artificial
intelligence generally, except that machine perception would only grant machines limited
sentience, rather than bestow upon machines full consciousness, self-awareness, and
intentionality.
Issues:
Some of the issues that the science of machine perception still has to overcome include:
Embodied Cognition - The theory that cognition is a full body experience, and
therefore can only exist, and therefore be measure and analyzed, in fullness if all
required human abilities and processes are working together through a mutually
aware and supportive systems network.
The Principle of Similarity - The ability young children develop to determine what
family a newly introduced stimulus falls under even when the said stimulus is
different from the members with which the child usually associates said family with.
(An example could be a child figuring that a Chihuahua is a dog and house pet rather
than vermin.)
The innate human ability to follow the Likelihood Principle in order to learn from
circumstances and others over time.
The Free energy principle - determining long before hand how much energy one can
safely delegate to being aware of things outside one's self without the loss of the
needed energy one requires for sustaining their life and function satisfactorily. This
allows one to become both optimally aware of the world around them self without
depleting their energy so much that they experience damaging stress, decision
fatigue, and/or exhaustion.
CONCEPT LEARNING
Concept learning, also known as category learning. It searches for and listing of attributes
that can be used to distinguish exemplars from non exemplars of various categories. More
simply put, concepts are the mental categories that help us classify objects, events, or ideas,
building on the understanding that each object, event, or idea has a set of common relevant
features. Thus, concept learning is a strategy which requires a learner to compare and
contrast groups or categories that contain concept-relevant features with groups or
categories that do not contain concept-relevant features.
Let’s Design the problem formally with TPE (Task, Performance, Experience):
Task T: Learn to predict the value of EnjoySport for an arbitrary day, based on the values of
the attributes of the day.
Hence h1 will look like (the first row of the table above):
h1(x=1): <Sunny, Warm, Normal, Strong, Warm, Same > Note: x=1 represents a positive
hypothesis / Positive example
We want to find the most suitable hypothesis which can represent the concept. For
example, the person enjoys his favorite sport only on cold days with high humidity.
Here ‘?’ indicates that any value of the attribute is acceptable. The most generic hypothesis
will be < ?, ?, ?, ?, ?, ?> where every day is a positive example and the most specific
hypothesis will be <?,?,?,?,?,? > where no day is a positive example. The two most popular
approaches to find a suitable hypothesis, they are:
1. Find-S Algorithm
2. List-Then-Eliminate Algorithm
Find-S Algorithm:
Then do nothing
Output hypothesis h
A version space is a hierarchical representation of knowledge that enables you to keep track
of all the useful information supplied by a sequence of learning examples without
remembering any of the examples.
The version space method is a concept learning process accomplished by managing multiple
models within a version space.
A plausible description is one that is applicable to all known positive examples and
no known negative example.
A hypothesis is a function on the sample space, giving a value for each point in the sample
space. If the possible values are {0, 1} then we can identify a hypothesis with the subset of
those points that are given value 1. The error of a hypothesis is the probability of that
subset where the hypothesis disagrees with the true hypothesis. Learning from examples is
the process of making independent random observations and eliminating those hypotheses
that disagree with observations.
The hypothesis space is the set of all possible hypotheses (i.e., functions from inputs to the
outputs) that can be returned by a model. The hypothesis space is important because it
specifies what types of functions you can model and what types you cannot. The absolute
best error you can achieve on a dataset is lower bounded by the error of the “best” function
in your hypothesis space.
PAC LEARNING
In this framework, the learner receives samples and must select a generalization function
(called the hypothesis) from a certain class of possible functions. The goal is that, with high
probability (the "probably" part), the selected function will have low generalization error
(the "approximately correct" part). The learner must be able to learn the concept given any
arbitrary approximation ratio, probability of success, or distribution of the samples.
Probably approximately correct (PAC) learning theory helps analyze whether and under
what conditions a learner L will probably output an approximately correct classifier.
Approximate: A hypothesis h∈H is approximately correct if its error over the distribution of
inputs is bounded by some ϵ,0 ≤ ϵ ≤ (1/2). I.e., errorD(h)<ϵ, where D is the distribution over
inputs.
Probably: If L will output such a classifier with probability 1−δ, with 0 ≤ δ ≤ (1/2), we call
that classifier probably approximately correct.
Knowing that a target concept is PAC-learnable allows to bound the sample size necessary
to probably learn an approximately correct classifier, which is what's shown in the formula
reproduced:
To gain some intuition about this, note the effects on m when you alter variables in the
right-hand side. As allowable error decreases, the necessary sample size grows. Likewise, it
grows with the probability of an approximately correct learner, and with the size of the
hypothesis space H. (Loosely, a hypothesis space is the set of classifiers algorithm
considers.) More plainly, as we consider more possible classifiers, or desire a lower error or
higher probability of correctness, we need more data to distinguish between them.
VC DIMENSION
The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity,
expressive power, richness, or flexibility) of a set of functions that can be learned by a
statistical binary classification algorithm. It is defined as the cardinality of the largest set of
points that the algorithm can shatter.
The capacity of a classification model is related to how complicated it can be. For example,
consider the threshold of a high-degree polynomial: if the polynomial evaluates above zero,
that point is classified as positive, otherwise as negative. A high-degree polynomial can be
wiggly, so it can fit a given set of training points well. But one can expect that the classifier
will make errors on other points, because it is too wiggly. Such a polynomial has a high
capacity. A much simpler alternative is to threshold a linear function.
Suppose we want a model (e.g., some classifier) that generalizes well on unseen data. And
we are limited to a specific amount of sample data.
The following figure shows some Models (S1 up to Sk) of differing complexity (VC
dimension), here shown on the x-axis and called h.
The diagram shows that a higher VC dimension allows for a lower empirical risk (the error a
model makes on the sample data), but also introduces a higher confidence interval. This
interval can be seen as the confidence in the model's ability to generalize.
Low VC dimension (high bias)
If we use a model of low complexity, we introduce assumption (bias) regarding the dataset
e.g., when using a linear classifier, we assume the data can be described with a linear
model. If this is not the case, our given problem cannot be solved by a linear model, for
example because the problem is of nonlinear nature. We will end up with a bad performing
model which will not be able to learn the data's structure. We should therefore try to avoid
introducing a strong bias.
On the other side of the x-axis, we see models of higher complexity which might be of such
a great capacity that it will rather memorize the data instead of learning its general
underlying structure i.e. the model over fits. After realizing this problem, it seems that we
should avoid complex models.
This may seem controversial as we shall not introduce a bias i.e., have low VC dimension but
should also do not have high VC dimension. This problem has deep roots in statistical
learning theory and is known as the bias-variance-tradeoff. What we should do in this
situation is to be as complex as necessary and as simplistic as possible, so when comparing
two models which end up with the same empirical error, we should use the less complex
one.