0% found this document useful (0 votes)
13 views

Chapter-1 Ml Intro

The document provides an overview of machine learning, detailing its history, definitions, and various types of learning algorithms, including supervised, unsupervised, semi-supervised, and reinforcement learning. It discusses the components of a learning system, the importance of machine learning, and well-posed learning problems. Additionally, it addresses key issues and perspectives in the field of machine learning, emphasizing the significance of algorithms and the role of data in improving performance.

Uploaded by

andrajub4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Chapter-1 Ml Intro

The document provides an overview of machine learning, detailing its history, definitions, and various types of learning algorithms, including supervised, unsupervised, semi-supervised, and reinforcement learning. It discusses the components of a learning system, the importance of machine learning, and well-posed learning problems. Additionally, it addresses key issues and perspectives in the field of machine learning, emphasizing the significance of algorithms and the role of data in improving performance.

Uploaded by

andrajub4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

INTRODUCTION

History of Machine Learning, Programs vs learning algorithms, Machine Learning definition,


Components of a learning, Different Types of Learning, FIND-S and Candidate-Elimination
algorithm, Linear regression, Logistic Regression.

History of Machine Learning


• 1950s:
– Samuel's checker-playing program
• 1960s:
– Neural network: Rosenblatt's perceptron
– Minsky & Papert prove limitations of Perceptron
• 1970s:
– Symbolic concept induction
– Expert systems and knowledge acquisition bottleneck
– Quinlan’s ID3
– Natural language processing (symbolic)
1980s:
– Advanced decision tree and rule learning
– Learning and planning and problem solving
– Resurgence of neural network
– Valiant’s PAC learning theory
– Focus on experimental methodology

• 90's ML and Statistics


– Data Mining
– Adaptive agents and web applications
– Text learning • 1994: Self-driving carroad test
– Reinforcement learning
– Ensembles
• 1997: Deep Blue beats Gary Kasparov
– Bayes Net learning

Popularity of this field in recent time and the reasonsbehind that


– New software/ algorithms

• Neural networks
• 2009: Google builds self driving car

• Deep learning
• 2011: Watson winsJeopardy
– New hardware • 2014: Human vision surpassed by ML systems

• GPU’s
– Cloud Enabled
– Availability of Big Data

(History of Machine Learning) https://www.youtube.com/watch?v=T3PsRW6wZSY

1
Programs vs. learning algorithms

Traditional Programming refers to any manually created program that uses input data and runs on a
computer to produce the output.

In Machine Learning, also known as augmented analytics, the input data and output are fed to an
algorithm to create a program. This yields powerful insights that can be used to predict future
outcomes.

Fig: Programs vs. learning algorithms

2
Machine learning definition

 Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming, coined
the term “Machine Learning”.
 Computers learning from data are known as machine learning.
 “Field of study that gives computers the capability to learn without being explicitly
programmed”.
 In a very layman manner, Machine Learning (ML) can be explained as automating and
improving the learning process of computers based on their experiences without being
actually programmed i.e. without any human assistance.
 Learning denotes changes in a system that ... enables a system to do the same task … more
efficiently the next time.” - Herbert Simon
 “Learning is constructing or modifying representations of what is being experienced.” -
Ryszard Michalski
 According to SAS, “Machine learning is a method of data analysis that automates analytical
model building. It is a branch of artificial intelligence based on the idea that systems can learn
from data, identify patterns and make decisions with minimal human intervention”.
 “Learning is making useful changes in our minds.” - Marvin Minsky
 Definition of Machine Learning (Mitchell 1997) — “A computer program is said to learn
from experience E with respect to some class of tasks T and performance measure P, if its
performance at the tasks improves with the experiences”.
 “Machine learning refers to a system capable of the autonomous acquisition and
integration of knowledge.”
 Machine Learning: can be defined as the practice of using algorithms to extract data, learn
from it, and then forecast future trends for that topic.
 Programs that perform with better experience.
 Any computer program that improves its performance at some task through experience
Components of Learning Problem
Task: The behaviour or task being improved.

– For example: classification, acting in an


environment
Data: The experiences that are being used toimprove performance in the
task.
Measure of improvement:
– For example: increasing accuracy in prediction,acquiring new,
improved speed and efficiency

3
Fruit Prediction Problem
 Task – forecasting different fruits for recognition
 Performance Measure – able to predict maximum variety of fruits
 Data or Experience – training machine with the largest datasets of fruits images
Face Recognition Problem
 Task – predicting different types of faces
 Performance Measure – able to predict maximum types of faces
 Data or Experience – training machine with maximum amount of datasets of different face
images
Automatic Translation of documents
 Task – translating one type of language used in a document to other language
 Performance Measure – able to convert one language to other efficiently
 Data or Experience – training machine with a large dataset of different types of languages

Components of learning System

1) Feature Extraction + Domain knowledge


First and foremost we really need to understand what type of data we are dealing with and what
eventually we want to get out of it. Essentially we need to understand how and what features need to
be extracted from the data.
For instance assume we want to build software that distinguishes between male and female names.
All the names in text can be thought of as our raw data while our features could be number of vowels
in the name, length, first & last character, etc of the name.
2) Feature Selection
In many scenarios we end up with a lot of features at our disposal. We might want
to select a subset of those based on the resources and computation power we have. In this step we
4
select a few of those influential features and separate them from the not-so-influential features. There
are many ways to do this, information gain, gain ratio, correlation etc.
3) Choice of Algorithm
There are wide range of algorithms from which we can choose based on whether we are trying to do
prediction, classification or clustering. We can also choose between linear and non-linear algorithms.
Naive Bayes, Support Vector Machines, Decision Trees, k-Means Clustering are some common
algorithms used.
4) Training
In this step we tune our algorithm based on the data we already have. This data is called training set
as it is used to train our algorithm. This is the part where our machine or software learn and improve
with experience.
5) Choice of Metrics/Evaluation Criteria
Here we decide our evaluation criteria for our algorithm. Essentially we come up with metrics to
evaluate our results. Commonly used measures of performance are precision, recall, f1-measure,
robustness, specificity-sensitivity, error rate etc.
6) Testing
Lastly, we test how our machine learning algorithm performs on an unseen set of test cases. One way
to do this is to partition the data into training and testing set. The training set is used in step 4 while
the test set is then used in this step. Techniques such as cross-validation and leave-one-out can be
used to deal with scenarios where we do not have enough data.

Different Types of learning ( https://www.youtube.com/watch?v=T3PsRW6wZSY )


• Supervised Learning
– X,y (pre-classified training examples)
– Given an observation x, what is the best label for y?
• Unsupervised learning
– X
– Given a set of x’s, cluster or summarize them
• Semi-supervised Learning
• Reinforcement Learning
– Determine what to do based on rewards and punishments.
Supervised Learning
A training set of examples with the correct responses (targets) is provided and, based on this training
set, the algorithm generalises to respond correctly to all possible inputs. This is also called learning
from exemplars. Supervised learning is the machine learning task of learning a function that maps an
input to an output based on example input-output pairs.
5
In supervised learning, each example in the training set is a pair consisting of an input object
(typically a vector) and an output value. A supervised learning algorithm analyzes the training data
and produces a function, which can be used for mapping new examples. In the optimal case, the
function will correctly determine the class labels for unseen instances. Both classification and
regression problems are supervised learning problems. A wide range of supervised learning
algorithms are available, each with its strengths and weaknesses. There is no single learning
algorithm that works best on all supervised learning problems.

Supervised Learning
X y
Input1 Output1
New Input x
Input2 Output2
Input3 Output3 Learning
Algorithm
Model

Input-n Output-n Output y

• Predictive Modeling (Supervised Learning): Building a model for the target variable as a
function of the explanatory variable.
– Classification: Which is used for Discrete Target Variables
Ex: Predicting whether a web user will make a purchase at an online book store
(Target variable is binary valued).
– Regression: Which is used for Continuous Target Variables.
– Ex: Forecasting the future price of a stock (Price is a continuous-valued attribute)
Unsupervised Learning:
Correct responses are not provided, but instead the algorithm tries to identify similarities between the
inputs so that inputs that have something in common are categorized together. The statistical
approach to unsupervised learning is known as density estimation.
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. In unsupervised learning algorithms, a
classification or categorization is not included in the observations. There are no output values and so
there is no estimation of functions. Since the examples given to the learner are unlabeled, the
accuracy of the structure that is output by the algorithm cannot be evaluated. The most common
unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find
hidden patterns or grouping in data.

6
Unsupervised Learning
X Clusters
Input1
Input2
Input3 Learning
Algorithm

Input-n

Cluster Analysis:
• Grouping of similar things is called cluster.
• The objects are clustered or grouped based on the principle of maximizing the intra class
similarity (Within a Cluster) and minimizing the interclass similarity(Cluster to Cluster).
Example:

Article Word

1 Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2


2 Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1
3 Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3
4 Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2
5 Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2
6 Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1

Document Clustering
• Each Article is represented as a set of word frequency pairs (w, c), where w is a word and
c is the number of times the word appears in the article.
• There are 2 natural clusters in the above dataset
• First Cluster consists of the first 3 articles (News about the Economy)
• Second cluster contain last 3 articles (News about the Heath Care)

7
Semi-supervised Learning
Semi-supervised learning is an approach to machine learning that combines a small amount of labeled
data with a large amount of unlabeled data during training. Semi-supervised learning falls between
unsupervised learning and supervised learning. It is a special instance of weak supervision.

• Reinforcement Learning
This is somewhere between supervised and unsupervised learning. The algorithm gets told when
the answer is wrong, but does not get told how to correct it. It has to explore and try out different
possibilities until it works out how to get the answer right. Reinforcement learning is sometime
called learning with a critic because of this monitor that scores the answer, but does not suggest
improvements.
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its
rewards. A learner (the program) is not told what actions to take as in most forms of machine
learning, but instead must discover which actions yield the most reward by trying them. In the
most interesting and challenging cases, actions may affect not only the immediate reward but also
the next situations and, through that, all subsequent rewards.
Example
Consider teaching a dog a new trick: we cannot tell it what to do, but we can reward/punish it if it
does the right/wrong thing. It has to find out what it did that made it get the reward/punishment.
We can use a similar method to train computers to do many tasks, such as playing backgammon

8
or chess, scheduling jobs, and controlling robot limbs. Reinforcement learning is different from
supervised learning. Supervised learning is learning from examples provided by a knowledgeable
expert.

Reinforcement Learning
Action at

State st St+1
RLearner Environment
Reward rt rt+1
Q-values
update
State,

Policy
Action at
action

state
Best

State st St+1
User Environment
Reward rt rt+1

Why is Machine Learning Important?


 Some tasks cannot be defined well, except by examples (e.g., recognizing people).
 Relationships and correlations can be hidden within large amounts of data. Machine
Learning/Data Mining may be able to find these relationships.
 Human designers often produce machines that do not work as well as desired in the
environments in which they are used.
 The amount of knowledge available about certain tasks might be too large for explicit
encoding by humans (e.g., medical diagnostic).
 Environments change over time.
 New knowledge about tasks is constantly being discovered by humans. It may be
difficult to continuously re-design systems “by hand”.

9
Well-Posed Learning Problems

Definition: A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E.

To have a well-defined learning problem, three features needs to be identified:


1. The class of tasks
2. The measure of performance to be improved
3. The source of experience

Examples
1. Checkers game: A computer program that learns to play checkers might improve its
performance as measured by its ability to win at the class of tasks involving playing
checkers games, through experience obtained by playing games against itself.

Fig: Checker game board


A checkers learning problem:
 Task T: playing checkers
 Performance measure P: percent of games won against opponents
 Training experience E: playing practice games against itself

2. A handwriting recognition learning problem:


 Task T: recognizing and classifying handwritten words within images
 Performance measure P: percent of words correctly classified
 Training experience E: a database of handwritten words with
givenclassifications
3. A robot driving learning problem:
 Task T: driving on public four-lane highways using vision sensors
 Performance measure P: average distance travelled before an error (as judged by
human overseer)
 Training experience E: a sequence of images and steering commands recorded
while observing a human driver

10
Perspectives and Issues in Machine Learning

Issues in Machine Learning

The field of machine learning, and much of this book, is concerned with answering questions such
as the following
 What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired function,
given sufficient training data? Which algorithms perform best for which types of problems
and representations?
 How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the character of
the learner's hypothesis space?
 When and how can prior knowledge held by the learner guide the process of generalizing from
examples? Can prior knowledge be helpful even when it is only approximately correct?
 What is the best strategy for choosing a useful next training experience, and how does the
choice of this strategy alter the complexity of the learning problem?
 What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn? Can
this process itself be automated?
 How can the learner automatically alter its representation to improve its ability to represent
and learn the target function?

11
CONCEPT LEARNING

 Learning involves acquiring general concepts from specific training examples. Example: People
continually learn general concepts or categories such as "bird," "car," "situations in which I should
study more in order to pass the exam," etc.
 Each such concept can be viewed as describing some subset of objects or events defined over a
larger set
 Alternatively, each concept can be thought of as a Boolean-valued function defined over this larger
set. (Example: A function defined over all animals, whose value is true for birds and false for other
animals).

Definition: Concept learning - Inferring a Boolean-valued function from training examples of its
input and output

A CONCEPT LEARNING TASK

Consider the example task of learning the target concept "Days on which Aldo enjoys his favorite
water sport”

Example Sky AirTemp Humidity Wind Wate Forecast Enjoy


r Sport
1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes

Table: Positive and negative training examples for the target concept EnjoySport.

The task is to learn to predict the value of EnjoySport for an arbitrary day, based on thevalues
of its other attributes?

What hypothesis representation is provided to the learner?

 Let’s consider a simple representation in which each hypothesis consists of aconjunction


of constraints on the instance attributes.
 Let each hypothesis be a vector of six constraints, specifying the values of the six
attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.

12
For each attribute, the hypothesis will either
 Indicate by a "?' that any value is acceptable for this attribute,
 Specify a single required value (e.g., Warm) for the attribute, or
 Indicate by a "Φ" that no value is acceptable

If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive
example (h(x) = 1).

The hypothesis that PERSON enjoys his favorite sport only on cold days with high humidity is
represented by the expression
(?, Cold, High, ?, ?, ?)

The most general hypothesis-that every day is a positive example-is represented by


(?, ?, ?, ?, ?, ?)

The most specific possible hypothesis-that no day is a positive example-is represented by


(Φ, Φ, Φ, Φ, Φ, Φ)
Notation

 The set of items over which the concept is defined is called the set of instances, which is
denoted by X.

Example: X is the set of all possible days, each represented by the attributes: Sky, AirTemp,
Humidity, Wind, Water, and Forecast

 The concept or function to be learned is called the target concept, which is denoted by c. c can
be any Boolean valued function defined over the instances X

c: X→ {O, 1}

Example: The target concept corresponds to the value of the attribute EnjoySport
(i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).

 Instances for which c(x) = 1 are called positive examples, or members of the target concept.
 Instances for which c(x) = 0 are called negative examples, or non-members of the target
concept.
 The ordered pair (x, c(x)) to describe the training example consisting of the instance x and its
target concept value c(x).
 D to denote the set of available training examples

 The symbol H to denote the set of all possible hypotheses that the learner may consider regarding
the identity of the target concept. Each hypothesis h in H represents a Boolean- valued function
defined over X
h: X→{O, 1}

13
The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in X.

 Given:
 Instances X: Possible days, each described by the attributes
 Sky (with possible values Sunny, Cloudy, and Rainy),
 AirTemp (with values Warm and Cold),
 Humidity (with values Normal and High),
 Wind (with values Strong and Weak),
 Water (with values Warm and Cool),
 Forecast (with values Same and Change).

 Hypotheses H: Each hypothesis is described by a conjunction of constraints on the attributes


Sky, AirTemp, Humidity, Wind, Water, and Forecast. The constraints may be "?" (any value is
acceptable), “Φ” (no value is acceptable), or a specific value.

 Target concept c: EnjoySport : X → {0, l}


 Training examples D: Positive and negative examples of the target function

 Determine:
 A hypothesis h in H such that h(x) = c(x) for all x in X.

Table: The EnjoySport concept learning task.

The inductive learning hypothesis

Any hypothesis found to approximate the target function well over a sufficiently large set of training
examples will also approximate the target function well over other unobservedexamples.

14
CONCEPT LEARNING AS SEARCH

 Concept learning can be viewed as the task of searching through a large space of
hypotheses implicitly defined by the hypothesis representation.
 The goal of this search is to find the hypothesis that best fits the training examples.

Example:
Consider the instances X and hypotheses H in the EnjoySport learning task. The attribute Sky has
three possible values, and AirTemp, Humidity, Wind, Water, Forecast each have two possible values,
the instance space X contains exactly
3.2.2.2.2.2 = 96 distinct instances
5.4.4.4.4.4 = 5120 syntactically distinct hypotheses within H.

Every hypothesis containing one or more "Φ" symbols represents the empty set of instances; that is, it
classifies every instance as negative.
1 + (4.3.3.3.3.3) = 973. Semantically distinct hypotheses

General-to-Specific Ordering of Hypotheses

Consider the two hypotheses


h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)

 Consider the sets of instances that are classified positive by hl and by h2.
 h2 imposes fewer constraints on the instance, it classifies more instances as positive. So, any
instance classified positive by hl will also be classified positive by h2. Therefore, h2 is more
general than hl.

Given hypotheses hj and hk, hj is more-general-than or- equal do hk if and only if any instancethat
satisfies hk also satisfies hi

Definition: Let hj and hk be Boolean-valued functions defined over X. Then hj is more general-than-
or-equal-to hk (written hj ≥ hk) if and only if

( xX ) [(hk (x) = 1) → (hj (x) = 1)]

15
 In the figure, the box on the left represents the set X of all instances, the box on the right the set
H of all hypotheses.
 Each hypothesis corresponds to some subset of X-the subset of instances that it classifies
positive.
 The arrows connecting hypotheses represent the more - general -than relation, with the arrow
pointing toward the less general hypothesis.
 Note the subset of instances characterized by h2 subsumes the subset characterized by hl ,
hence h2 is more - general– than h1

FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS

FIND-S Algorithm

1. Initialize h to the most specific hypothesis in H


2. For each positive training instance x
For each attribute constraint a in h
i
If the constraint a is satisfied by x
i
Then do nothing
Else replace a in h by the next more general constraint that is satisfied by x
i

3. Output hypothesis h

16
To illustrate this algorithm, assume the learner is given the sequence of training examples from
the EnjoySport task

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport


1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

 The first step of FIND-S is to initialize h to the most specific hypothesis in H


h - (Ø, Ø, Ø, Ø, Ø, Ø)

 Consider the first training example


x1 = <Sunny Warm Normal Strong Warm Same>, +

Observing the first training example, it is clear that hypothesis h is too specific. None of the
"Ø" constraints in h are satisfied by this example, so each is replaced by the next more general
constraint that fits the example
h1 = <Sunny Warm Normal Strong Warm Same>

 Consider the second training example


x2 = <Sunny, Warm, High, Strong, Warm, Same>, +

The second training example forces the algorithm to further generalize h, this time
substituting a "?" in place of any attribute value in h that is not satisfied by the new
example
h2 = <Sunny Warm ? Strong Warm Same>

 Consider the third training example


x3 = <Rainy, Cold, High, Strong, Warm, Change>, -

Upon encountering the third training the algorithm makes no change to h. The FIND-S
algorithm simply ignores every negative example.
h3 = < Sunny Warm ? Strong Warm Same>

 Consider the fourth training example


x4 = <Sunny Warm High Strong Cool Change>, +

17
The fourth example leads to a further generalization of h
h4 = < Sunny Warm ? Strong ? ? >

The key property of the FIND-S algorithm


 FIND-S is guaranteed to output the most specific hypothesis within H that is consistent with
the positive training examples
 FIND-S algorithm’s final hypothesis will also be consistent with the negative examples
provided the correct target concept is contained in H, and provided the training examples are
correct.
Unanswered by FIND-S

1. Has the learner converged to the correct target concept?


2. Why prefer the most specific hypothesis?
3. Are the training examples consistent?
4. What if there are several maximally specific consistent hypotheses?

18
VERSION SPACES AND THE CANDIDATE-ELIMINATION ALGORITHM

The key idea in the CANDIDATE-ELIMINATION algorithm is to output a description of theset of all
hypotheses consistent with the training examples

Representation

Definition: consistent- A hypothesis h is consistent with a set of training examples D if and


only if h(x) = c(x) for each example (x, c(x)) in D.

Consistent (h, D)  ( x, c(x)  D) h(x) = c(x))

Note difference between definitions of consistent and satisfies


 An example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is
a positive or negative example of the target concept.
 An example x is said to consistent with hypothesis h iff h(x) = c(x)

Definition: version space- The version space, denoted V S with respect to hypothesis space
H, D
H and training examples D, is the subset of hypotheses from H consistent with the training
examples in D
V S {h  H | Consistent (h, D)}
H, D

The LIST-THEN-ELIMINATION algorithm

The LIST-THEN-ELIMINATE algorithm first initializes the version space to contain all
hypotheses in H and then eliminates any hypothesis found inconsistent with any training
example.

1. VersionSpace c a list containing every hypothesis in H


2. For each training example, (x, c(x))
remove from VersionSpace any hypothesis h for which h(x) ≠ c(x)
3. Output the list of hypotheses in VersionSpace

The LIST-THEN-ELIMINATE Algorithm

 List-Then-Eliminate works in principle, as long as version space is finite.


 However, since it requires exhaustive enumeration of all hypotheses in practice it is not
feasible.

19
A More Compact Representation for Version Spaces

The version space is represented by its most general and least general members. These members form
general and specific boundary sets that delimit the version space within the partially ordered
hypothesis space.

Definition: The general boundary G, with respect to hypothesis space H and training data D,
is the set of maximally general members of H consistent with D

G {g  H | Consistent (g, D)(g'  H)[(g'  g)  Consistent(g', D)]}


g

Definition: The specific boundary S, with respect to hypothesis space H and training data D,
is the set of minimally general (i.e., maximally specific) members of H consistent with D.

S {s  H | Consistent (s, D)(s'  H)[(s  s')  Consistent(s', D)]}


g

CANDIDATE-ELIMINATION Learning Algorithm

The CANDIDATE-ELIMINTION algorithm computes the version space containing all hypotheses
from H that are consistent with an observed sequence of training examples.

Initialize G to the set of maximally general hypotheses in H


Initialize S to the set of maximally specific hypotheses in H For
each training example d, do
• If d is a positive example
• Remove from G any hypothesis inconsistent with d
• For each hypothesis s in S that is not consistent with d
• Remove s from S
• Add to S all minimal generalizations h of s such that
• h is consistent with d, and some member of G is more general than h
• Remove from S any hypothesis that is more general than another hypothesis in S

• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than another hypothesis in G

CANDIDATE- ELIMINTION algorithm using version spaces


20
An Illustrative Example
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

CANDIDATE-ELIMINTION algorithm begins by initializing the version space to the set ofall
hypotheses in H;

Initializing the G boundary set to contain the most general hypothesis in H


G0 ?, ?, ?, ?, ?, ?

Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 , , , , , 

 When the first training example is presented, the CANDIDATE-ELIMINTION algorithm checks
the S boundary and finds that it is overly specific and it fails to cover the positive example.
 The boundary is therefore revised by moving it to the least more general hypothesis that covers
this new example
 No update of the G boundary is needed in response to this training example because G o correctly
covers this example

 When the second training example is observed, it has a similar effect of generalizing S further
to S2, leaving G again unchanged i.e., G2 = G1 = G0

21
 Consider the third training example. This negative example reveals that the G boundary of the
version space is overly general, that is, the hypothesis in G incorrectly predicts that this new
example is a positive example.
 The hypothesis in the G boundary must therefore be specialized until it correctly classifies
this new negative example

Given that there are six attributes that could be specified to specialize G 2, why are there only three
new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of G 2 that
correctly labels the new example as a negative example, but it is not included in G3. The reason
this hypothesis is excluded is that it is inconsistent with the previously encountered positive
examples

22
 Consider the fourth training example.

 This positive example further generalizes the S boundary of the version space. It also results
in removing one member of the G boundary, because this member fails to cover the new
positive example

After processing these four examples, the boundary sets S4 and G4 delimit the version space of all
hypotheses consistent with the set of incrementally observed training examples.

23
Linear Regression:
• Linear regression is a statistical regression method which is used for predictive analysis.
• It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
• It is used for solving the regression problem in machine learning.
• Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
• If there is only one input variable (x), then such linear regression is called simple linear
regression and if there is more than one input variable, then such linear regression is called
multiple linear regression.
• The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.

Below is the mathematical equation for linear regression:

Y= aX+b, Here, Y = dependent variables (target variables), X= Independent variables


(predictor variables), a and b are the linear coefficients

Some popular applications of linear regression are:

 Analyzing trends and sales estimates


 Salary forecasting
 Real estate prediction
 Arriving at ETAs in traffic

24
Or
Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as: y= a0+a1 x+ ε


Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.

25
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
• Simple Linear Regression: If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear
Regression.
• Multiple Linear Regression: If more than one independent variable is used to predict the value
of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression.
Linear Regression Line: A linear line showing the relationship between the dependent and
independent variables is called a regression line.
A regression line can show two types of relationship:
• Positive Linear Relationship: If the dependent variable increases on the Y-axis and
independent variable increases on X-axis, then such a relationship is termed as a Positive
linear relationship.

• Negative Linear Relationship: If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then such a relationship is called a negative
linear relationship.

26
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function
• The different values for weights or coefficient of lines (a0, a1) gives the different
line of regression, and the cost function is used to estimate the values of the
coefficient for the best fit line.
• Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
• We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also
known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values. It can be
written as:
For the above linear equation, MSE can be calculated as:

Where, N=Total number of observation, Yi = Actual value, (a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual.
If the observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent: o Gradient descent is used to minimize the MSE by calculating the
gradient of the cost function.
• A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
• It is done by a random selection of values of coefficient and then iteratively
updates the values to reach the minimum cost function.

27
Model Performance: The Goodness of fit determines how the line of regression fits the
set of observations. The process of finding the best model out of various models is called
optimization. It can be achieved by below method:

28
Example:

GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y

1 43 99 4257 1849 9801


2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022

Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample
size (6, in our case).

Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample
size (6, in our case).

Find a: ((486 × 11,409) – ((247 × 20,485))


/ 6 (11,409) – 2472)

=484979 / 7445
=65.14

Find b:
(6(20,485) – (247 × 486)) / (6 (11409) – 2472)
(122,910 – 120,042) / 68,454 – 2472
2,868 / 7,445
= .385225
Step 3: Insert the values into the equation.
y’ = a + bx
y’ = 65.14 + .385225x

29
30
Example:

GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y

1 43 99 4257 1849 9801


2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022

Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample
size (6, in our case).

Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample
size (6, in our case).

Find a: ((486 × 11,409) – ((247 × 20,485))


/ 6 (11,409) – 2472)

=484979 / 7445
=65.14

Find b:
(6(20,485) – (247 × 486)) / (6 (11409) – 2472)
(122,910 – 120,042) / 68,454 – 2472
2,868 / 7,445
= .385225
Step 3: Insert the values into the equation.
y’ = a + bx
y’ = 65.14 + .385225x

29
30

You might also like