Chapter-1 Ml Intro
Chapter-1 Ml Intro
• Neural networks
• 2009: Google builds self driving car
• Deep learning
• 2011: Watson winsJeopardy
– New hardware • 2014: Human vision surpassed by ML systems
• GPU’s
– Cloud Enabled
– Availability of Big Data
1
Programs vs. learning algorithms
Traditional Programming refers to any manually created program that uses input data and runs on a
computer to produce the output.
In Machine Learning, also known as augmented analytics, the input data and output are fed to an
algorithm to create a program. This yields powerful insights that can be used to predict future
outcomes.
2
Machine learning definition
Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming, coined
the term “Machine Learning”.
Computers learning from data are known as machine learning.
“Field of study that gives computers the capability to learn without being explicitly
programmed”.
In a very layman manner, Machine Learning (ML) can be explained as automating and
improving the learning process of computers based on their experiences without being
actually programmed i.e. without any human assistance.
Learning denotes changes in a system that ... enables a system to do the same task … more
efficiently the next time.” - Herbert Simon
“Learning is constructing or modifying representations of what is being experienced.” -
Ryszard Michalski
According to SAS, “Machine learning is a method of data analysis that automates analytical
model building. It is a branch of artificial intelligence based on the idea that systems can learn
from data, identify patterns and make decisions with minimal human intervention”.
“Learning is making useful changes in our minds.” - Marvin Minsky
Definition of Machine Learning (Mitchell 1997) — “A computer program is said to learn
from experience E with respect to some class of tasks T and performance measure P, if its
performance at the tasks improves with the experiences”.
“Machine learning refers to a system capable of the autonomous acquisition and
integration of knowledge.”
Machine Learning: can be defined as the practice of using algorithms to extract data, learn
from it, and then forecast future trends for that topic.
Programs that perform with better experience.
Any computer program that improves its performance at some task through experience
Components of Learning Problem
Task: The behaviour or task being improved.
3
Fruit Prediction Problem
Task – forecasting different fruits for recognition
Performance Measure – able to predict maximum variety of fruits
Data or Experience – training machine with the largest datasets of fruits images
Face Recognition Problem
Task – predicting different types of faces
Performance Measure – able to predict maximum types of faces
Data or Experience – training machine with maximum amount of datasets of different face
images
Automatic Translation of documents
Task – translating one type of language used in a document to other language
Performance Measure – able to convert one language to other efficiently
Data or Experience – training machine with a large dataset of different types of languages
Supervised Learning
X y
Input1 Output1
New Input x
Input2 Output2
Input3 Output3 Learning
Algorithm
Model
• Predictive Modeling (Supervised Learning): Building a model for the target variable as a
function of the explanatory variable.
– Classification: Which is used for Discrete Target Variables
Ex: Predicting whether a web user will make a purchase at an online book store
(Target variable is binary valued).
– Regression: Which is used for Continuous Target Variables.
– Ex: Forecasting the future price of a stock (Price is a continuous-valued attribute)
Unsupervised Learning:
Correct responses are not provided, but instead the algorithm tries to identify similarities between the
inputs so that inputs that have something in common are categorized together. The statistical
approach to unsupervised learning is known as density estimation.
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. In unsupervised learning algorithms, a
classification or categorization is not included in the observations. There are no output values and so
there is no estimation of functions. Since the examples given to the learner are unlabeled, the
accuracy of the structure that is output by the algorithm cannot be evaluated. The most common
unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find
hidden patterns or grouping in data.
6
Unsupervised Learning
X Clusters
Input1
Input2
Input3 Learning
Algorithm
Input-n
Cluster Analysis:
• Grouping of similar things is called cluster.
• The objects are clustered or grouped based on the principle of maximizing the intra class
similarity (Within a Cluster) and minimizing the interclass similarity(Cluster to Cluster).
Example:
Article Word
Document Clustering
• Each Article is represented as a set of word frequency pairs (w, c), where w is a word and
c is the number of times the word appears in the article.
• There are 2 natural clusters in the above dataset
• First Cluster consists of the first 3 articles (News about the Economy)
• Second cluster contain last 3 articles (News about the Heath Care)
7
Semi-supervised Learning
Semi-supervised learning is an approach to machine learning that combines a small amount of labeled
data with a large amount of unlabeled data during training. Semi-supervised learning falls between
unsupervised learning and supervised learning. It is a special instance of weak supervision.
• Reinforcement Learning
This is somewhere between supervised and unsupervised learning. The algorithm gets told when
the answer is wrong, but does not get told how to correct it. It has to explore and try out different
possibilities until it works out how to get the answer right. Reinforcement learning is sometime
called learning with a critic because of this monitor that scores the answer, but does not suggest
improvements.
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its
rewards. A learner (the program) is not told what actions to take as in most forms of machine
learning, but instead must discover which actions yield the most reward by trying them. In the
most interesting and challenging cases, actions may affect not only the immediate reward but also
the next situations and, through that, all subsequent rewards.
Example
Consider teaching a dog a new trick: we cannot tell it what to do, but we can reward/punish it if it
does the right/wrong thing. It has to find out what it did that made it get the reward/punishment.
We can use a similar method to train computers to do many tasks, such as playing backgammon
8
or chess, scheduling jobs, and controlling robot limbs. Reinforcement learning is different from
supervised learning. Supervised learning is learning from examples provided by a knowledgeable
expert.
Reinforcement Learning
Action at
State st St+1
RLearner Environment
Reward rt rt+1
Q-values
update
State,
Policy
Action at
action
state
Best
State st St+1
User Environment
Reward rt rt+1
9
Well-Posed Learning Problems
Definition: A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E.
Examples
1. Checkers game: A computer program that learns to play checkers might improve its
performance as measured by its ability to win at the class of tasks involving playing
checkers games, through experience obtained by playing games against itself.
10
Perspectives and Issues in Machine Learning
The field of machine learning, and much of this book, is concerned with answering questions such
as the following
What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired function,
given sufficient training data? Which algorithms perform best for which types of problems
and representations?
How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the character of
the learner's hypothesis space?
When and how can prior knowledge held by the learner guide the process of generalizing from
examples? Can prior knowledge be helpful even when it is only approximately correct?
What is the best strategy for choosing a useful next training experience, and how does the
choice of this strategy alter the complexity of the learning problem?
What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn? Can
this process itself be automated?
How can the learner automatically alter its representation to improve its ability to represent
and learn the target function?
11
CONCEPT LEARNING
Learning involves acquiring general concepts from specific training examples. Example: People
continually learn general concepts or categories such as "bird," "car," "situations in which I should
study more in order to pass the exam," etc.
Each such concept can be viewed as describing some subset of objects or events defined over a
larger set
Alternatively, each concept can be thought of as a Boolean-valued function defined over this larger
set. (Example: A function defined over all animals, whose value is true for birds and false for other
animals).
Definition: Concept learning - Inferring a Boolean-valued function from training examples of its
input and output
Consider the example task of learning the target concept "Days on which Aldo enjoys his favorite
water sport”
Table: Positive and negative training examples for the target concept EnjoySport.
The task is to learn to predict the value of EnjoySport for an arbitrary day, based on thevalues
of its other attributes?
12
For each attribute, the hypothesis will either
Indicate by a "?' that any value is acceptable for this attribute,
Specify a single required value (e.g., Warm) for the attribute, or
Indicate by a "Φ" that no value is acceptable
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive
example (h(x) = 1).
The hypothesis that PERSON enjoys his favorite sport only on cold days with high humidity is
represented by the expression
(?, Cold, High, ?, ?, ?)
The set of items over which the concept is defined is called the set of instances, which is
denoted by X.
Example: X is the set of all possible days, each represented by the attributes: Sky, AirTemp,
Humidity, Wind, Water, and Forecast
The concept or function to be learned is called the target concept, which is denoted by c. c can
be any Boolean valued function defined over the instances X
c: X→ {O, 1}
Example: The target concept corresponds to the value of the attribute EnjoySport
(i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).
Instances for which c(x) = 1 are called positive examples, or members of the target concept.
Instances for which c(x) = 0 are called negative examples, or non-members of the target
concept.
The ordered pair (x, c(x)) to describe the training example consisting of the instance x and its
target concept value c(x).
D to denote the set of available training examples
The symbol H to denote the set of all possible hypotheses that the learner may consider regarding
the identity of the target concept. Each hypothesis h in H represents a Boolean- valued function
defined over X
h: X→{O, 1}
13
The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in X.
Given:
Instances X: Possible days, each described by the attributes
Sky (with possible values Sunny, Cloudy, and Rainy),
AirTemp (with values Warm and Cold),
Humidity (with values Normal and High),
Wind (with values Strong and Weak),
Water (with values Warm and Cool),
Forecast (with values Same and Change).
Determine:
A hypothesis h in H such that h(x) = c(x) for all x in X.
Any hypothesis found to approximate the target function well over a sufficiently large set of training
examples will also approximate the target function well over other unobservedexamples.
14
CONCEPT LEARNING AS SEARCH
Concept learning can be viewed as the task of searching through a large space of
hypotheses implicitly defined by the hypothesis representation.
The goal of this search is to find the hypothesis that best fits the training examples.
Example:
Consider the instances X and hypotheses H in the EnjoySport learning task. The attribute Sky has
three possible values, and AirTemp, Humidity, Wind, Water, Forecast each have two possible values,
the instance space X contains exactly
3.2.2.2.2.2 = 96 distinct instances
5.4.4.4.4.4 = 5120 syntactically distinct hypotheses within H.
Every hypothesis containing one or more "Φ" symbols represents the empty set of instances; that is, it
classifies every instance as negative.
1 + (4.3.3.3.3.3) = 973. Semantically distinct hypotheses
Consider the sets of instances that are classified positive by hl and by h2.
h2 imposes fewer constraints on the instance, it classifies more instances as positive. So, any
instance classified positive by hl will also be classified positive by h2. Therefore, h2 is more
general than hl.
Given hypotheses hj and hk, hj is more-general-than or- equal do hk if and only if any instancethat
satisfies hk also satisfies hi
Definition: Let hj and hk be Boolean-valued functions defined over X. Then hj is more general-than-
or-equal-to hk (written hj ≥ hk) if and only if
15
In the figure, the box on the left represents the set X of all instances, the box on the right the set
H of all hypotheses.
Each hypothesis corresponds to some subset of X-the subset of instances that it classifies
positive.
The arrows connecting hypotheses represent the more - general -than relation, with the arrow
pointing toward the less general hypothesis.
Note the subset of instances characterized by h2 subsumes the subset characterized by hl ,
hence h2 is more - general– than h1
FIND-S Algorithm
3. Output hypothesis h
16
To illustrate this algorithm, assume the learner is given the sequence of training examples from
the EnjoySport task
Observing the first training example, it is clear that hypothesis h is too specific. None of the
"Ø" constraints in h are satisfied by this example, so each is replaced by the next more general
constraint that fits the example
h1 = <Sunny Warm Normal Strong Warm Same>
The second training example forces the algorithm to further generalize h, this time
substituting a "?" in place of any attribute value in h that is not satisfied by the new
example
h2 = <Sunny Warm ? Strong Warm Same>
Upon encountering the third training the algorithm makes no change to h. The FIND-S
algorithm simply ignores every negative example.
h3 = < Sunny Warm ? Strong Warm Same>
17
The fourth example leads to a further generalization of h
h4 = < Sunny Warm ? Strong ? ? >
18
VERSION SPACES AND THE CANDIDATE-ELIMINATION ALGORITHM
The key idea in the CANDIDATE-ELIMINATION algorithm is to output a description of theset of all
hypotheses consistent with the training examples
Representation
Definition: version space- The version space, denoted V S with respect to hypothesis space
H, D
H and training examples D, is the subset of hypotheses from H consistent with the training
examples in D
V S {h H | Consistent (h, D)}
H, D
The LIST-THEN-ELIMINATE algorithm first initializes the version space to contain all
hypotheses in H and then eliminates any hypothesis found inconsistent with any training
example.
19
A More Compact Representation for Version Spaces
The version space is represented by its most general and least general members. These members form
general and specific boundary sets that delimit the version space within the partially ordered
hypothesis space.
Definition: The general boundary G, with respect to hypothesis space H and training data D,
is the set of maximally general members of H consistent with D
Definition: The specific boundary S, with respect to hypothesis space H and training data D,
is the set of minimally general (i.e., maximally specific) members of H consistent with D.
The CANDIDATE-ELIMINTION algorithm computes the version space containing all hypotheses
from H that are consistent with an observed sequence of training examples.
• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than another hypothesis in G
CANDIDATE-ELIMINTION algorithm begins by initializing the version space to the set ofall
hypotheses in H;
When the second training example is observed, it has a similar effect of generalizing S further
to S2, leaving G again unchanged i.e., G2 = G1 = G0
21
Consider the third training example. This negative example reveals that the G boundary of the
version space is overly general, that is, the hypothesis in G incorrectly predicts that this new
example is a positive example.
The hypothesis in the G boundary must therefore be specialized until it correctly classifies
this new negative example
Given that there are six attributes that could be specified to specialize G 2, why are there only three
new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of G 2 that
correctly labels the new example as a negative example, but it is not included in G3. The reason
this hypothesis is excluded is that it is inconsistent with the previously encountered positive
examples
22
Consider the fourth training example.
This positive example further generalizes the S boundary of the version space. It also results
in removing one member of the G boundary, because this member fails to cover the new
positive example
After processing these four examples, the boundary sets S4 and G4 delimit the version space of all
hypotheses consistent with the set of incrementally observed training examples.
23
Linear Regression:
• Linear regression is a statistical regression method which is used for predictive analysis.
• It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
• It is used for solving the regression problem in machine learning.
• Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
• If there is only one input variable (x), then such linear regression is called simple linear
regression and if there is more than one input variable, then such linear regression is called
multiple linear regression.
• The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.
24
Or
Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
25
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
• Simple Linear Regression: If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear
Regression.
• Multiple Linear Regression: If more than one independent variable is used to predict the value
of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression.
Linear Regression Line: A linear line showing the relationship between the dependent and
independent variables is called a regression line.
A regression line can show two types of relationship:
• Positive Linear Relationship: If the dependent variable increases on the Y-axis and
independent variable increases on X-axis, then such a relationship is termed as a Positive
linear relationship.
• Negative Linear Relationship: If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then such a relationship is called a negative
linear relationship.
26
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function
• The different values for weights or coefficient of lines (a0, a1) gives the different
line of regression, and the cost function is used to estimate the values of the
coefficient for the best fit line.
• Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
• We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also
known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values. It can be
written as:
For the above linear equation, MSE can be calculated as:
Residuals: The distance between the actual value and predicted values is called residual.
If the observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent: o Gradient descent is used to minimize the MSE by calculating the
gradient of the cost function.
• A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
• It is done by a random selection of values of coefficient and then iteratively
updates the values to reach the minimum cost function.
27
Model Performance: The Goodness of fit determines how the line of regression fits the
set of observations. The process of finding the best model out of various models is called
optimization. It can be achieved by below method:
28
Example:
GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample
size (6, in our case).
Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample
size (6, in our case).
=484979 / 7445
=65.14
Find b:
(6(20,485) – (247 × 486)) / (6 (11409) – 2472)
(122,910 – 120,042) / 68,454 – 2472
2,868 / 7,445
= .385225
Step 3: Insert the values into the equation.
y’ = a + bx
y’ = 65.14 + .385225x
29
30
Example:
GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample
size (6, in our case).
Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample
size (6, in our case).
=484979 / 7445
=65.14
Find b:
(6(20,485) – (247 × 486)) / (6 (11409) – 2472)
(122,910 – 120,042) / 68,454 – 2472
2,868 / 7,445
= .385225
Step 3: Insert the values into the equation.
y’ = a + bx
y’ = 65.14 + .385225x
29
30