0% found this document useful (0 votes)
44 views

Machine Learning Complete Notes

Machine Learning

Uploaded by

chandranaiik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Machine Learning Complete Notes

Machine Learning

Uploaded by

chandranaiik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

MACHINE LEARNING

OBJECTIVES:

 Familiarity with a set of well-known supervised, unsupervised and semi-supervised


Learning algorithms.
 The ability to implement some basic machine learning algorithms
 Understanding of how machine learning algorithms are evaluated

UNIT -I

The ingredients of machine learning, Tasks: the problems that can be solved with
machine learning, Models: the output of machine learning, Features, the workhorses of
machine learning. Binary classification and related tasks: Classification, Scoring and
ranking, Class probability estimation

UNIT- II

Beyond binary classification: Handling more than two classes, Regression,


Unsupervised and descriptive learning.

Concept learning: The hypothesis space, Paths through the hypothesis space, Beyond
conjunctive concepts
UNIT- III

Tree models: Decision trees, Ranking and probability estimation trees, Tree learning as
variance reduction.
Rule models: Learning ordered rule lists, Learning unordered rule sets, Descriptive rule
learning, First-order rule learning
UNIT –IV

Linear models: The least-squares method, The perception: a heuristic learning algorithm
for linear classifiers, Support vector machines, obtaining probabilities from linear
classifiers, Going beyond linearity with kernel methods.
Distance Based Models: Introduction, Neighbours and exemplars, Nearest Neighbours
classification, Distance Based Clustering, Hierarchical Clustering.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
UNIT- V

Probabilistic models: The normal distribution and its geometric interpretations, Probabilistic
models for categorical data, Discriminative learning by optimizing conditional like li hood
Probabilistic models with hidden variables.
Features: Kinds of feature, Feature transformations, Feature construction and selection. Model
ensembles: Bagging and random forests, Boosting
UNIT- VI

Dimensionality Reduction: Principal Component Analysis (PCA), Implementation and


demonstration.

Artificial Neural Networks: Introduction, Neural network representation, appropriate problems


for neural network learning, Multilayer networks and the back propagation algorithm.
OUTCOMES:

 Recognize the characteristics of machine learning that make it useful to real-world


Problems.
 Characterize machine learning algorithms as supervised, semi-supervised, and
Unsupervised.
 Have heard of a few machine learning toolboxes.
 Be able to use support vector machines.
 Be able to use regularized regression algorithms.
 Understand the concept behind neural networks for learning non-linear functions.

TEXT BOOKS:

1. Machine Learning: The art and science of algorithms that make sense of data, Peter Flach,
Cambridge.
2. Machine Learning, Tom M. Mitchell, MGH.

REFERENCE BOOKS:

1. UnderstandingMachine Learning: From Theory toAlgorithms, Shai Shalev-Shwartz, Shai Ben-


David, Cambridge.
2. Machine Learning in Action, Peter Harington, 2012, Cengage.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Machine Learning

 Machine Learning is an algorithm. That has ability to learn from past experience.
 Machine learning combines data with statistical tools to predict an output. This output is
then used by corporate to makes actionable insights.
 Machine learning is closely related to data mining and Bayesian predictive modeling. The
machine receives data as input, use an algorithm to formulate answers.
 A typical machine learning tasks are to provide a recommendation. For those who have a
Netflix account, all recommendations of movies or series are based on the user‘s
historical data.
 Machine learning is also used for a variety of task like fraud detection, predictive
maintenance, portfolio optimization task and so on.
 Machine learning is only one functionality and we can use different programs.

Machine learning VS Traditional programming


Traditional Programming: In traditional programming, a programmer codes all the rules in consultation with
an expert in the industry for which software is being developed. Each rule is based on a logical foundation; the
machine will execute an output following the logical statement. When the system grows complex, more rules
need to be written. It can quickly become unsustainable to maintain.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Machine Learning: Machine learning is supposed to overcome this issue. The machine learns how the input
and output data are correlated and it writes a rule. The programmers do not need to write new rules each time
there is new data. The algorithms adapt in response to new data and experiences to improve efficacy over
time.

Types of Machine Learning

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Supervised learning
Type of machine learning in which machine are trained using well labeled training data and
machine predict the output. Labeled data means some input data is already tagged with the
correct output.

Types of Supervised learning

Classification
 Classification is a supervised learning
 Classification is a categorical variable
 Help you divide your data into different classes and the algorithm which implements the
classification on a dataset is known as a classifier.
 There are two types of classifications

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
1) Binary classification: if the classification problem has only two possible classes is called
binary classification(T/F,Y/N,0,1)
2) Multi class classification: if the classification program has more than two classes is
called multi class classification(Movies, Music)

Types of Classification Algorithms


 Knn
 Naïve bayes
 Decision tree
 Logistic regression
 Support vector machine
Regression

 Regression algorithm is used if there is a relation between dependent and independent


variable or input and output variable is called regression.
 Regression it is used for the prediction of continuous variable such as a weather,
forecasting, market trends etc.

Types of Regression Algorithms


 Linear regression
 Logistic Regression
 Polynomial Regression
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Unsupervised Learning

Unsupervised learning is a type of algorithm that learns patterns from untagged data. It
mainly deal with the unlabelled data Unsupervised learning algorithm allows users to
perform more complex processing task compared to supervised learning.

Clustering

Clustering is a unsupervised learning. There is not any label for each instance of data.
Clustering is alternatively called as grouping Clustering is the task of grouping a set of
objects in such a way that objects in the same group are more similar to each other than to
those in other group.

Types of clustering algorithms

 Exclusive cluster
 Overlap cluster
 Hierarchical

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Reinforcement Learning

Reinforcement learning is an important type of machine learning where an agent learns


how to behave in an environment by performing actions and seeing the results.
Reinforcement learning is learning from mistakes at the beginning stage. Reinforcement
learning is a relationship between supervised and unsupervised learning

Applications of machine learning

1. Image Recognition

Image recognition is one of the most common applications of machine learning. It is used to identify objects,
persons, places, digital images, etc. The popular use case of image recognition and face detection is, Automatic
friend tagging suggestion: Face book provides us a feature of auto friend tagging suggestion. Whenever we
upload a photo with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm. It is based on the
Facebook project named "Deep Face," which is responsible for face recognition and person identification in the
picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a
popular application of machine learning. Speech recognition is a process of converting voice instructions into
text, and it is also known as "Speech to text", or "Computer speech recognition." At present, machine learning

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
algorithms are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.

3. Traffic prediction

If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest
route and predicts the traffic conditions. It predicts the traffic conditions such as whether traffic is cleared, slow-
moving, or heavily congested with the help of two ways:

 Real Time location of the vehicle form Google Map app and sensors
 Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information from the user and
sends back to its database to improve the performance.

4. Product recommendations

Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product on
Amazon, then we started getting an advertisement for the same product while internet surfing on the same
browser and this is because of machine learning. Google understands the user interest using various machine
learning algorithms and suggests the product as per customer interest. As similar, when we use Netflix, we find
some recommendations for entertainment series, movies, etc., and this is also done with the help of machine
learning.

5. Email Spam

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the
technology behind this is Machine learning. Below are some spam filters used by Gmail:

 Content Filter
 Header filter

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
 General blacklists filter
 Rules-based filters
 Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.

6. Online Fraud Detection

Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever we
perform some online transaction, there may be various ways that a fraudulent transaction can take place such
as fake accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed Forward
Neural network helps us by checking whether it is a genuine transaction or a fraud transaction. For each
genuine transaction, the output is converted into some hash values, and these values become the input for the
next round. For each genuine transaction, there is a specific pattern which gets change for the fraud transaction
hence, it detects it and makes our online transactions more secure.

7. Stock Market trading

Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and
downs in shares, so for this machine learning's long short term memory neural network is used for the
prediction of stock market trends.

8. Medical Diagnosis

In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing
very fast and able to build 3D models that can predict the exact position of lesions in the brain. It helps in
finding brain tumors and other brain-related diseases easily.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
The ingredients of machine learning Tasks: The problems that can be
solved with machine learning

PROBLEMS SOLVED BY MACHINE LEARNING

 MANUAL DATA ENTRY: ML programs use the discovered data to improve the process as more
calculations are made. Thus machines can learn to perform time-intensive documentation and data entry
tasks.
 DETECTING SPAM: Spam detection is the earliest problem solved by Machine Learning. Four
years ago, email service providers used pre-existing rule-based techniques to remove spam.
 PRODUCT RECOMMENDATION: Unsupervised learning enables a product based
recommendation system. The algorithm identifies hidden pattern among items and focuses on grouping
similar products into clusters. E-Commerce businesses such as Amazon have this capability.
 MEDICAL DIAGNOSIS: Machine Learning in the medical field will improve patient‗s health with
minimum costs. These predictions are based on the dataset of anonym zed patient records and symptoms
exhibited by a patient.
 FINANCIAL ANALYSIS: Due to large volume of data, quantitative nature and accurate historical
data, machine learning can be used in financial analysis. Future applications of ML in finance include
chat bots and conversational interfaces for customer service, security and sentiment analysis.
 PREDICTIVE MAINTENANCE: Whereas predictive maintenance minimizes the risk of
unexpected failures and reduces the amount of unnecessary preventive maintenance activities. For
predictive maintenance, ML architecture can be built which consists of historical device data, flexible
analysis environment, and workflow visualization tool and operations feedback loop. Azure ML platform
provides an example of simulated aircraft engine run-to-failure events to demonstrate the predictive
maintenance modeling process.
 IMAGE RECOGNITION (COMPUTER VISION): Computer vision produces numerical or
symbolic information from images and high-dimensional data. It involves machine learning, data
mining, database knowledge discovery and pattern recognition. This customization requires highly
qualified data scientists or ML consultants.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Models: The output of machine learning

Models form the central concept in machine learning as they are what is being learned from the data, in order to
solve a given task. There is a considerable not to say bewildering range of machine learning models to choose
from.

1) Geometric models
2) Probabilistic models
3) Logical models
4) Grouping and grading

Geometric models: A geometric model is constructed directly in instance space, using geometric concepts such
as lines, planes and distances. One main advantage of geometric classifiers is that they are easy to visualize, as
long as we keep to two or three dimensions.

Probabilistic models: Probabilistic classifier is a classifier that is able to predict, given an observation of an
input, a probability distribution over a set of classes, rather than only outputting the most likely class that the
observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or
when combining classifiers into ensembles.

Logical models: Logic models are hypothesized descriptions of the chain of causes and effects leading to an
outcome of interest (e.g. prevalence of cardiovascular diseases, annual traffic collision, etc). While they can be
in a narrative form, logic model usually take form in a graphical depiction of the "if-then" (causal) relationships
between the various elements leading to the outcome. However, the logic model is more than the graphical
depiction: it is also the theories, scientific evidences, assumptions and beliefs that support it and the various
processes behind it.

Grouping and Grading: Grouping models do this by breaking up the instance space into groups or segments,
the number of which is determined at training time. One could say that grouping models have a fixed and finite
‗resolution‘ and cannot distinguish between individual instances beyond this resolution.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Features: the workhorses of machine learning.

 Univariate model: In mathematics, univariate refers to an expression, equation, function or polynomial


of only one variable. Objects of any of these types involving more than one variable may be called
multivariate. In some cases the distinction between the univariate and multivariate cases is fundamental;
for example, the fundamental theorem of algebra and Euclid's algorithm for polynomials are fundamental
properties of univariate polynomials that cannot be generalized to multivariate polynomials.
 Binary splitting is a technique for speeding up numerical evaluation of many types of series with
rational terms. In particular, it can be used to evaluate hyper geometric series at rational points.

Binary classification and related tasks: Classification

Classification

 Classification is a supervised learning


 Classification is a categorical variable
 Help you divide your data into different classes and the algorithm which implements the
classification on a dataset is known as a classifier.
 There are two types of classifications
3) Binary classification: if the classification problem has only two possible classes is called
binary classification(T/F,Y/N,0,1)
4) Multi class classification: if the classification program has more than two classes is
called multi class classification(Movies, Music)

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Types of classification techniques:

 Logistic Regression
 Naive Bayes
 K-Nearest Neighbors
 Decision Tree
 Random Forest
 Support Vector Machines

Logistic Regression

 Is one of the most popular machines learning algorithm which comes under the
supervised learning.
 Logistic regression is a categorical variable
 Logistic regression used for solving classification problem.
 Logistic regression instead of fitting a regression line we fit an S shaped which predict
two max values (0 or 1)

Naive Bayes Classifier

 Classification is one of the simplest and most effective classification algorithm.


 It used to solving classification problem.
 Mainly used for text classification

𝐵
𝐴 𝑃 ∗𝑃 𝐴
𝑃 = 𝐴
𝐵 𝑃 𝐵
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
K-Nearest Neighbors

 KNN is one of the simplest machine learning algorithms based on supervised learning.
 KNN algorithm can be used for classification problem and regression problem.
 Mostly used for classification problem.
 KNN algorithm use data and classify new data points based on similarity
measures(Distance function)

Euclidean Distance:

2 2
XH − H1 + XW − W1

Where

XH = Observed Value

H1= Actual value

XW = Observed Value

W1= Actual value

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Decision Tree

 Decision tree is supervised learning technique that can be used for both classification
problem and regression problem.
 Mostly used for classification problem.
 It is a tree structure classification

Random Forest tree

 Is a popular machine learning algorithm that belongs to the supervised learning technique
 It can be used for both classification problem and regression problem
 Mostly used for classification problem.
 It is based on the concept of model ensemble learning.
 Collection of decision tree.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Support Vector Machines

 SVM is supervised learning.


 Which can be used for classification problem and regression problem
 Mostly used for classification problem.
 Each data point N-dimensions.
 SVM simply the coordinate of individual observation
 We perform classification by finding the hyper plan that two different classes.

The followings are important concepts in SVM −

 Support Vectors − Datapoints that are closest to the hyperplane is called support vectors. Separating
line will be defined with the help of these data points.

 Hyperplane − As we can see in the above diagram, it is a decision plane or space which is divided
between a set of objects having different classes.

 Margin − It may be defined as the gap between two lines on the closet data points of different classes. It
can be calculated as the perpendicular distance from the line to the support vectors. Large margin is
considered as a good margin and small margin is considered as a bad margin.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Scoring and ranking

Variable Ranking is the process of ordering the features by the value of some scoring function,
which usually measures feature-relevance. Resulting set: The score S(fi) is computed from the
training data, measuring some criteria of feature fi. By convention a high score is indicative for a
valuable (relevant) feature.
List of scoring modules

Machine Learning Studio (classic) provides many different scoring modules. You select one
depending on the type of model you are using, or the type of scoring task you are performing:

 Apply Transformation: Applies a well-specified data transformation to a dataset. Use


this module to apply a saved process to a set of data.
 Assign Data to Clusters: Assigns data to clusters by using an existing trained clustering
model. Use this module if you want to cluster new data based on an existing K-Means
clustering model. This module replaces the Assign to Clusters (deprecated) module,
which has been deprecated but is still available for use in existing experiments.
 Score Matchbox Recommender: Scores predictions for a dataset by using the Matchbox
recommender. Use this module if you want to generate recommendations, find related
items or users, or predict ratings.
 Score Model: Scores predictions for a trained classification or regression model.

Use this module for all other regression and classification models, as well as some anomaly
detection models.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Coverage curve

Class probability estimation

A probabilistic classifier assigns the probabilities to each class, where the probability of a particular
class corresponds to the probability of the image belonging to that class. This is called probability
estimation.

Turning rankers into class probability estimators

Concavity relates to the rate of change of a function's derivative. A function f is concave up (or
upwards) where the derivative f′ is increasing. This is equivalent to the derivative of f′ , which is
f′′f, start superscript, prime, prime, end superscript, being positive.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Beyond binary classification: Handling more than two classes

How to evaluate multi-class performance and how to build multi-class


models out of binary models.
1. Multi-class classification
2. Multi-class scores and probabilities

1. Multiclass Classification: A classification task with more than two classes; e.g., classify
a set of images of fruits which may be oranges, apples, or pears. Multi-class classification makes
the assumption that each sample is assigned to one and only one label: a fruit can be either an
apple or a pear but not both at the same time.
Problem – Given a dataset of m training examples, each of which contains information in the
form of various features and a label. Each label corresponds to a class, to which the training
example belongs to. In multiclass classification, we have a finite set of classes. Each training
example also has n features.
For example, in the case of identification of different types of fruits, ―Shape‖, ―Color‖, ―Radius‖
can be features and ―Apple‖, ―Orange‖, ―Banana‖ can be different class labels.

In a multiclass classification, we train a classifier using our training data, and use this classifier
for classifying new examples.
Aim– We will use different multiclass classification methods such as, KNN, Decision trees,
SVM, etc. We will compare their accuracy on test data.

Approach –

1. Load dataset from source.


2. Split the dataset into ―training‖ and ―test‖ data.
3. Train Decision tree, SVM, and KNN classifiers on the training data.
4. Use the above classifiers to predict labels for the test data.
5. Measure accuracy and visualize classification.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Multi-class scores and probabilities
Extension from binary This section discusses strategies of extending the existing binary
classifiers to solve multi-class classification problems. Several algorithms have been developed
based on neural networks, decision trees, k-nearest neighbors, naive Bayes, support vector
machines and Extreme Learning Machines to address multi-class classification problems. These
types of techniques can also be called algorithm adaptation techniques.
Neural networks

Multiclass perceptrons provide a natural extension to the multi-class problem. Instead of just
having one neuron in the output layer, with binary output, one could have N binary neurons
leading to multi-class classification. In practice, the last layer of a neural network is usually a
softmax function layer, which is the algebraic simplification of N logistic classifiers, normalized
per class by the sum of the N-1 other logistic classifiers.
Decision tree classifier – Decision tree classifier is a systematic approach for multiclass
classification. It poses a set of questions to the dataset (related to its attributes/features). The
decision tree classification algorithm can be visualized on a binary tree. On the root and each of
the internal nodes, a question is posed and the data on that node is further split into separate
records that have different characteristics. The leaves of the tree refer to the classes in which the
dataset is split.
SVM (Support vector machine) classifier

SVM (Support vector machine) is an efficient classification method when the feature vector is
high dimensional.

KNN (k-nearest neighbours) classifier – KNN or k-nearest neighbours is the simplest


classification algorithm. This classification algorithm does not depend on the structure of the
data. Whenever a new example is encountered, its k nearest neighbours from the training data are
examined. Distance between two examples can be the euclidean distance between their feature
vectors.
Naive Bayes classifier – Naive Bayes classification method is based on Bayes‗ theorem. It is
termed as ‗Naive‗ because it assumes independence between every pair of feature in the data.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
1

Regression

Regression

 Regression algorithm is used if there is a relation between dependent and independent


variable or input and output variable is called regression.
 Regression it is used for the prediction of continuous variable such as a weather,
forecasting, market trends etc.
 Regression is an important tool for modeling and analyzing data. Here, we fit a curve /
line to the data points, in such a manner that the differences between the distances of data
points from the curve or line is minimized.

Types of Regression

 Linear Regression
 Polynomial Regression
 Support Vector Regression
 Decision Tree Regression
 Random Forest Regression
Linear Regression

 Linear regression is simple and easy algorithm


 Linear regression is a statistical approach is used for predictive analysis
 Linear regression to solve regression problems
 Linear regression is a continuous variable
 Relationship between dependent variable and Independent variable
 Either positive or negative regression BEST Fit Line – Straight Line

Y=b0+b1*X
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
2

Where:

Y= Dependent variable

X= Independent variable

B0=Intercept

B1=coefficient of relationship between X&Y

Linear Regression Line

1. Positive Regression
 If the dependent variable increase on the Y-axis and independent variable on X- axis then
such a relationship is termed as a POSITIVE Regression

Y=b0+b1*X
2. Negative Regression
 If the dependent variable decrease on the Y-axis and independent variable increase on the
X-axis then such relationship is called a negative regression

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
3

Y= - b0+b1*X
Example: Linear Regression Using Least square method

Independent Dependent
variable Variable 𝟐
𝑿−𝑿 𝒀−𝒀 𝑿−𝑿 (𝑿 − 𝑿)*(𝒀 − 𝒀)
X Y

1 2 1-3=-2 2-4=-2 4 4
2 4 2-3=-1 4-4=0 1 0
3 5 3-3=0 5-4=1 0 0
4 4 4-3=1 4-4=0 1 0
5 5 5-3=2 5-4=1 4 2
3 4 10 6

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
4

𝑋−𝑋 ∗ 𝑌−𝑌 6
𝑏1 = = 0.6
𝑋−𝑋 2 10

4 = b0+0.6(3)

4 = b0+1.8

-1.8 = - 1.8

2.2 =b0

Polynomial Regression

A regression equation is a polynomial regression equation if the power of independent variable is


more than 1. The equation below represents a polynomial equation:

Y=a+b*x^2

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
5

In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into
the data points.

Support Vector Regression

Support Vector Machine‖ (SVM) is a supervised machine learning algorithm which can be used
for both classification and regression challenges. However, it is mostly used in classification
problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is

Number of features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two
classes very well.

Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine
is a frontier which best segregates the two classes (hyper-plane/ line).

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
6

Decision Tree Regression

Similar to decision tree classification, however uses mean squared error or similar metrics

instead of cross entropy or gini impurity to determine splits.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of
a procedure for estimating an unobserved quantity) measures the average of the squares of the
errors—that is, the average squared difference between the estimated values and what is actually
estimated. The MSE is a measure of the quality of an estimator—it is always non-negative, and
values closer to zero are better.

The Mean Squared Error is given by:

Random Forest Regression


Random forest is an ensemble approach where we take into account the predictions of several
decision regression trees.
 Select K random points

 Identify n where n is the number of decision tree regression to be created. Repeat step 1
and 2 to create several regression trees.
 The average of each branch is assigned to leaf node in each decision tree.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
7

 To predict output for a variable, the average of all the predictions of all decision trees are
taken into consideration.
Random Forest prevents over fitting (which is common in decision trees) by creating random
subsets of the features and building smaller trees using these subsets.

Unsupervised Learning

 Unsupervised learning is a type of algorithm that learns patterns from untagged data.
 It mainly deal with the unlabelled data
 Unsupervised learning algorithm allows users to perform more complex processing task
compared to supervised learning.

Clustering

 Clustering is a unsupervised learning


 There is not any label for each instance of data.
 Clustering is alternatively called as grouping
 Clustering is the task of grouping a set of objects in such a way that objects in the same
group are more similar to each other than to those in other group.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
8

Types of clustering algorithms

 Exclusive cluster
 Overlap cluster
 Hierarchical

Exclusive (partitioning)

In this clustering method, Data are grouped in such a way that one data can belong to one cluster
only.
Example: K-means

Agglomerative

In this clustering technique, every data is a cluster. The iterative unions between the two nearest
clusters reduce the number of clusters.

Example: Hierarchical clustering

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
9

Overlapping

In this technique, fuzzy sets are used to cluster data. Each point may belong to two or more
clusters with separate degrees of membership.

Descriptive learning

Descriptive Learning: Using descriptive analysis you came up with the idea that, two products A
(Burger) and B (french fries) are brought together with very high frequency.
Now you want that if user buys A then machine should automatically give him a suggestion to buy B. So
by seeing past data and deducing what could be the possible factors influencing this situation can be
achieved using ML.

Predictive Learning: We want to increase our sales, using descriptive learning we came to know about
what could be the possible factors influencing sales. By tuning the parameters in such a way so that sales
should be maximized in the next quarter, and therefore predicting what sales we could generate and hence
making investments accordingly. This task can be handled using ML also.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
10

Concept learning: The hypothesis space

The most studied task in machine learning is inferring a function that classifies examples represented in
some language as members or nonmembers of a concept from pre classified training examples. This is
called concept learning or classification

An algorithm that supports concept learning requires:

 Training data (past experiences to train our models)


 Target concept (hypothesis to identify data objects)
 Actual data objects (for testing the models)

Hypothesis Testing
Multi step procedure that lead the researcher from the hypothesis statement to the decision regarding the
hypothesis.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
11

Hypothesis Space (H):


Hypothesis space is the set of all the possible legal hypothesis. This is the set from which
the machine learning algorithm would determine the best possible (only one) which would best
describe the target function or the outputs.
Hypothesis (h):

A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that
an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we
have imposed on the data. To better understand the Hypothesis Space and Hypothesis consider the following
coordinate that shows the distribution of some data:

Say suppose we have test data for which we have to determine the outputs or results. The test data is as shown
below:

We can predict the outcomes by dividing the coordinate as shown below:

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
12

So the test data would yield the following result:

But note here that we could have divided the coordinate plane as:

The way in which the coordinate would be divided depends on the data, algorithm and constraints.

 All these legal possible ways in which we can divide the coordinate plane to predict the outcome of the
test data composes of the Hypothesis Space.
 Each individual possible way is known as the hypothesis.
Hence, in this example the hypothesis space would be like:

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
13

Paths through the hypothesis space

We have not one but two most general hypotheses. What we can also notice is that every concept
between the least general one and one of the most general ones is also a possible hypothesis, i.e.,
covers all the positives and none of the negatives. Mathematically speaking we say that the set of
Algorithm LGG-Conj-ID(x, y) – find least general conjunctive generalization of two
conjunctions, employing internal disjunction.

Input : conjunctions x, y.

Output : conjunction z.

1. z ←true;

2. for each feature f do

3. if f = vx is a conjunct in x and f = vy is a conjunct in y then

4. add f = Combine-ID(vx , vy ) to z; // Combine-ID: see text

5. end

6. end

7. return z

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
14

Tree models

Tree-based machine learning methods are among the most commonly used
supervised learning methods. Tree-based ML methods are built by recursively splitting a training
sample, using different features from a dataset at each node that splits the data most effectively.

Decision Tree
 Decision Tree is a supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
 In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into sub trees.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
15

Algorithm
1. Start with a training data set which we‘ll call S. It should have attributes and classification.

2. Determine the best attribute in the dataset. (We will go over the definition of best attribute)

3. Split S into subset that contains the possible values for the best attribute.

4. Make decision tree node that contains the best attribute.

5. Recursively generate new decision trees by using the subset of data created from step 3 until a
stage is reached where you cannot classify the data further. Represent the class as leaf node.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
16

Example

Formulas

Infogain

𝑝 𝑝 𝑛 𝑛
I(p,n) = - log2 𝑠 − log2 𝑠
𝑠 𝑠

Entropy

𝑝 𝑖 +𝑛 𝑖
E(A) = . I 𝑝𝑖 , 𝑛𝑖
𝑝+𝑛

Gain

Gain(A) = infogain – entropy

= I(p,n) – E(A)

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
17

Infogain

9 9 5 5
=- log2 14 + log2 14
14 14

= 0.409+0.530 = 0.940

Calculate Entropy of outlook

Sunny (5) = 2 Yes , 3 No

2 2 3 3
I(Outlook,Sunny) = log2 5 + log2 5
5 5

= 0.971

I(Outlook, Overcast) = 0

I(Outlook, Rain) = 0.971

Total Entropy = 0.694

Gain = infogain – Entropy

= 0.940-0.694 = 0.246

Gain of each attribute

Attribute Gain
Outlook 0.246 First Splitting Point
Temperature 0.029
Humidity 0.151
Wind 0.048

(i) Repeat Entire process for outlook = sunny

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
18

Outlook Temperature Humidity Wind Play?


Sunny Hot High Weak No
Sunny Hot High Strong No
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Sunny Mild Normal Strong Yes
gain 0.571 0.971 0.97
Second Splitting Point

Outlook = Rain

Outlook Temperature Humidity Wind Play?


Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Rain Mild Normal Weak Yes
Rain Mild High Strong No
gain 0.019 x 0.97

Rain (5) = 3 Yes , 2 No

3 3 2 2
= log2 5 + log2 5
5 5

= 0.97

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
19

Advantage

 Easy to use and understand.

 Can handle both categorical and numerical data.

 Resistant to outliers, hence require little data preprocessing.

Disadvantages
 The decision tree contains lots of layers, which makes it complex.
 It may have an over fitting issue, which can be resolved using the Random Forest algorithm.
 For more class labels, the computational complexity of the decision tree may increase.

Application

Decision tree has been used to develop models for prediction and classification in different domains some of
which are

 Business management
 Customer relationship management
 Fraudulent statement detection
 Engineering, Energy consumption
 Fault diagnosis
 Healthcare Management
 Agriculture

Ranking and probability estimation trees

Decision tree learning algorithms are commonly used in machine learning for
classification problems. A tree is defined as a set of logical conditions on attributes; a leaf represents
the subset of instances corresponding to the conjunction of conditions along its branch, or path
back to the root.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
20

An instance being classified is passed along the tree to a leaf and is assigned the majority class label of
that leaf. For many applications, it is useful to order instances, which involves assigning them a
rank, rather than a class label. For example, a web-page recommender may want to order web pages by
the likelihood of their being of interest to the user, instead of classifying them as ―of interest‖ and ―not
of interest.‖ A simple approach to ranking is to estimate the probability of an instance's
membership in a class, and assign that probability as the instance's rank. A decision tree can
easily be used to estimate these probabilities. If a leaf node contains class frequencies n1, n2,..., nc, the
probability that an instance falling in that leaf belongs to class i can be defined as ni ∑ni . Decision trees
acting as probability estimators, however, are often observed to produce bad probability
estimates. Specifically, every instance in a node is assigned the same probability, resulting in a
proliferation of ties, which reduces the results' usefulness for ranking. Sparse training sets may also
lead to skewed estimates. Most developments in decision tree learning algorithms have aimed at
improving classification accuracy rather than probability estimates. We explore two techniques
for producing improved probability estimates.

Tree learning as variance reduction.

Reduction in Variance

Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm
uses the standard formula of variance to choose the best split. The split with lower variance is selected as the
criteria to split the population:

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
21

Above X-bar is the mean of the values, X is actual and n is the number of values.
Steps to calculate Variance:
Calculate variance for each node.
Calculate variance for each split as the weighted average of each node variance.

Chi-Square
The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree
classification methods. It finds out the statistical significance between the differences between sub-nodes and
parent node. We measure it by the sum of squares of standardized differences between observed and expected
frequencies of the target variable. It works with the categorical target variable ―Success‖ or ―Failure‖. It can
perform two or more splits. Higher the value of Chi-Square higher the statistical significance of differences
between sub-node and Parent node.
Mathematically, Chi-squared is represented as:

Steps to Calculate Chi-square for a split:


Calculate Chi-square for an individual node by calculating the deviation for Success and Failure both
Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
22

Rule models
Rule-based methods are a popular class of techniques in machine learning and data mining. They share the
goal of finding regularities in data that can be expressed in the form of an IF-THEN rule. Depending on the
type of rule that should be found, we can discriminate between descriptive rule discovery, which aims at
describing significant patterns in the given dataset in terms of rules, and predictive rule learning. In the latter
case, one is often also interesting in learning a collection of the rules that collectively cover the instance space
in the sense that they can make a prediction for every possible instance. In the following, we will briefly
introduce both tasks and point out some key works in this area. While in some application areas rule learning
algorithms are superseded by statistical approaches such as Support Vector Machines (SVMs). An emerging
use case for rule learning is the Semantic Web, whose representation is built on rule-based formalisms.

Learning ordered rule lists


An ordered rule set is known as a decision list. Rules are rank ordered according to their priority.
For example, when a test record is presented to the classifier, it is assigned to the class label of
the highest ranked rule it has triggered. If none of the rules fired, it is assigned to the default
class.
That is, if more than one rule is triggered, need conflict resolution:
 Size ordering - assign the highest priority to the triggering rules that has the ―toughest‖
requirement (i.e., with the most attribute test)
 Class-based ordering - decreasing order of prevalence or misclassification cost per class
 Rule-based ordering (decision list) - rules are organized into one long priority list,
according to some measure of rule quality or by experts

Learning unordered rule sets


This approach allows a test record to trigger multiple classifications rules and consider the
consequent of each rule as a vote for a particular class. The votes are tallied and the class that
receives the highest number of votes will be assigned to that test record

Descriptive rule learning


 Look back to the past
 To extract compact and easily understood information from large, sometimes gigantic database
 OLAP(Online analytical processing), SQL (Structured query language)

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
23

 Descriptive, Which produces new, nontrivial information based on the available data set
 Descriptive is used to learn about and understand the data
 Ex. Identify and describe groups of customers with common buying behavior

First-order rule learning

We discussed algorithms for learning sets of propositional (i.e., variable-free) rules. In this section, we consider
learning rules that contain variables-in particular, learning first-order Horn theories. Our motivation for
considering such rules is that they are much more expressive than propositional rules. Inductive learning of
first-order rules or theories is often referred to as inductive logic programming (or LP for short), because this
process can be viewed as automatically inferring PROLOG programs from examples. PROLOG is a general
purpose, Turing-equivalent programming language in which programs are expressed as collections of Horn
clauses.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
24

Linear models
Linear models generate a formula to create a best-fit line to predict unknown values. Linear models are
considered ―old school‖ and often not as predictive as newer algorithm classes, but they can be trained
relatively quickly and are generally more straightforward to interpret,

Two types of linear models:

1. Linear regression, which is used for regression (continous predictions).

2. Logistic regression, which is used for classification (categorical predictions).

1. Linear Regression
Linear regression is one of the most basic types of regression in machine learning. The
linear regression model consists of a predictor variable and a dependent variable related
linearly to each other. In case the data involves more than one independent variable,
then linear regression is called multiple linear regression models.
y=mx+c+e
where m is the slope of the line, c is an intercept, and e represents the error in the model.

The best fit line is determined by varying the values of m and c. The predictor error is the
difference between the observed values and the predicted value. The values of m and c get

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
25

selected in such a way that it gives the minimum predictor error. It is important to note that a
simple linear regression model is susceptible to outliers. Therefore, it should not be used in
case of big size data.
2. Logistic Regression

 Is one of the most popular machine learning algorithm which comes under the supervised
learning.
 Logistic regression is a categorical variable
 Logistic regression used for solving classification problem.
 Logistic regression instead of fitting a regression line we fit an S shaped which predict
two max values (0 or 1)

Least Squares Regression


Line of Best Fit

Imagine you have some points, and want to have a line that best fits them like this:

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
26

We can place the line "by eye": try to have the line as close as possible to all points, and a similar number of points
above and below the line.

But for better accuracy let's see how to calculate the line using Least Squares Regression.

The Line

Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a line :

y = mx + b
Where:

y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)

Steps

To find the line of best fit for N points:

Step 1: For each (x,y) point calculate x2 and xy

Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ means "sum up")

Step 3: Calculate Slope m: m = N Σ(xy) − ΣxΣyN Σ(x2) − (Σx)2

(N is the number of points.)

Step 4: Calculate Intercept b: b = Σy − m ΣxN

Step 5: Assemble the equation of a line y = mx + b

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
27

Example

Sam found how many hours of sunshine vs how many ice creams were sold at the shop from Monday to Friday:

"x" "y"
Hours of Sunshine Ice Creams Sold

2 4
3 5
5 7
7 10
9 15

Let us find the best m (slope) and b (y-intercept) that suits that data

y = mx + b

Step 1: For each (x,y) calculate x2 and xy:

x y x2 xy

2 4 4 8

3 5 9 15

5 7 25 35

7 10 49 70

9 15 81 135

Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and Σxy):

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
28

x y x2 xy

2 4 4 8

3 5 9 15

5 7 25 35

7 10 49 70

9 15 81 135

Σx: 26 Σy: 41 Σx2: 168 Σxy: 263

Also N (number of data values) = 5

Step 3: Calculate Slope m:

m = N Σ(xy) − ΣxΣyN Σ(x2) − (Σx)2

= 5 x 263 − 26 x 415 x 168 − 262

= 1315 − 1066840 − 676

= 249164 = 1.5183...

Step 4: Calculate Intercept b:

b = Σy − m ΣxN

= 41 − 1.5183 x 265

= 0.3049...

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
29

Step 5: Assemble the equation of a line:

y = mx + b

y = 1.518x + 0.305

Let's see how it works out:

x y y = 1.518x + 0.305 error

2 4 3.34 −0.66

3 5 4.86 −0.14

5 7 7.89 0.89

7 10 10.93 0.93

9 15 13.97 −1.03

Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:

Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the above equation to
estimate that he will sell

y = 1.518 x 8 + 0.305 = 12.45 Ice Creams

Sam makes fresh waffle cone mixture for 14 ice creams just in case.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
30

The least-squares method


The "least squares" method is a form of mathematical regression analysis used to determine the line of best
fit for a set of data, providing a visual demonstration of the relationship between the data points. Each point of
data represents the relationship between a known independent variable and an unknown dependent variable.

 The least squares method is a statistical procedure to find the best fit for a set of data points by
minimizing the sum of the offsets or residuals of points from the plotted curve.
 Least squares regression is used to predict the behavior of dependent variables.

Example of the Least Squares Method


An example of the least squares method is an analyst who wishes to test the relationship between a
company‘s stock returns, and the returns of the index for which the stock is a component. In this example, the
analyst seeks to test the dependence of the stock returns on the index returns. To achieve this, all of the returns
are plotted on a chart. The index returns are then designated as the independent variable, and the stock returns
are the dependent variable. The line of best fit provides the analyst with coefficients explaining the level of
dependence.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
31

The Line of Best Fit Equation


The line of best fit determined from the least squares method has an equation that tells the story of the
relationship between the data points. Line of best fit equations may be determined by computer software models,
which include a summary of outputs for analysis, where the coefficients and summary outputs explain the
dependence of the variables being tested.

Least Squares Regression Line


If the data shows a leaner relationship between two variables, the line that best fits this linear relationship is
known as a least squares regression line, which minimizes the vertical distance from the data points to the
regression line. The term ―least squares‖ is used because it is the smallest sum of squares of errors, which is also
called the "variance".

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
32

The perception: a heuristic learning algorithm for linear classifiers

Perceptron: Perceptron is a single layer neural network and a multi-layer perceptron is called Neural

Networks.Perceptron is a linear classifier (binary). Also, it is used in supervised learning. It helps to classify the

given input data.

The perceptron consists of 4 parts.

1. Input values or One input layer

2. Weights and Bias

3. Net sum

4. Activation Function

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
33

A heuristic learning algorithm for linear classifiers

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
34

Support vector machines

 Support vector machine is supervised learning algorithm


 Which can be used for classification problems and regression problems
 Mostly used in classification problems
 Each data point in n – dimensional space.
 SVM simply the coordinate of individual observation
 We perform classification by finding the hyper plan that differentiate the two classes
 Which best segregates the two classes(Hyper plan)

The followings are important concepts in SVM −

1) Support Vector: Data points that closest to the hyper plan is called support vector
2) Hyper plan: It is a decision plan space which is divided between a set of objects
having different classes.
3) Max Margin: it may be defined as the gap between lines on the closet data points
of different classes and it can be calculated as the perpendicular distance from the
line to the support vector. Max Margin is considering as a GOOD Margin. Small
margin-BAD Margin

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
35

Advantages

 Works very well with limited dataset.


 Good Accuracy

Disadvantages

 Doesn‘t work well with large Dataset

Applications

 Image Classification
 Face Detect
 Handwriting recognition
 Text Categorization

Obtaining probabilities from linear classifiers


Scores from a linear classifier

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
36

Logistic calibration

The Logistic function

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
37

Distance Based Models


Distance-based models are the second class of Geometric models. Like Linear models, distance-based models
are based on the geometry of data. As the name implies, distance-based models work on the concept of
distance. In the context of Machine learning, the concept of distance is not based on merely the physical
distance between two points. Instead, we could think of the distance between two points considering the mode
of transport between two points. Travelling between two cities by plane covers less distance physically than by
train because a plane is unrestricted. Similarly, in chess, the concept of distance depends on the piece used – for
example, a Bishop can move diagonally. Thus, depending on the entity and the mode of travel, the concept of
distance can be experienced differently. The distance metrics commonly used
are Euclidean, Minkowski, Manhattan, and Mahalanobis.

Distance is applied through the concept of neighbours and exemplars. Neighbours are points in proximity with
respect to the distance measure expressed through exemplars. Exemplars are either centroids that find a centre
of mass according to a chosen distance metric or medoids that find the most centrally located data point. The
most commonly used centroid is the arithmetic mean, which minimises squared Euclidean distance to all other
points.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
38

K-Nearest Neighbor Algorithm (K-NN)

 K-NN is one of the simplest machine learning algorithm based supervised learning
technique.
 K-NN algorithm can be used for regression problems and classification but mostly used for
classification problem.
 K-NN algorithm use data and classify new data points based on similarity measures (Distance

function).

Euclidean Distance:

2 2
XH − H1 + XW − W1

Where

XH = Observed Value

H1= Actual value

XW = Observed Value

W1= Actual value

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
39

K-NN Algorithm working Step by Step process

1) Select the number K of the neighbors


2) Calculate the Euclidean Distance of K no of neighbors
3) Take the KNN as per the Euclidean Distance
4) Among these K NN count the no of the data points in each category
5) Assign the new data points to that category for which the no of neighbors is max
6) Finally our KNN model is ready

Example:
Perform KNN classification algorithm on following dataset and predict the class for x
(P1=3 and P2=7) K=3.

P1 P2 CLASS
i 7 7 FALSE
ii 7 4 FALSE
iii 3 4 TRUE
iv 1 4 TRUE
3 7

2 2
XH − H1 + XW − W1
2 2
D(X,i) = 𝟑−7 + 7−7

= −4 2 + 0 2 = 16 = 4

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
40

2 2
D(X,ii) = 3−7 + 7−4

= −4 2 + 3 2 = 16 + 9 = 25 = 5
2 2
D(X,iii) = 3−3 + 7−4

= 0 2 + 3 2 9= 3
2 2
D(X,iv) = 3−1 + 7−4

= 2 2 + 3 2 = 13 = 3.60

K=3

KNN value – 3,3.6,4


K (P1=3, P2=7) will belong to class - TRUE

Application

 Used in classification
 Used in get missing values
 Used in pattern recognition

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
41

K-Means Algorithm

 K mean algorithm is a unsupervised learning. It is used to solve the clustering problem.


 It is a centroid based algorithm, where each cluster is associated with a centroid.
 The algorithm takes the unlabeled dataset as input divides the dataset into K no of clusters
and repeat the process until it does not find the best clusters.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
42

Example:

Divide the given sample data in two clusters using K-Means Algorithm [Euclidean Distance]

Height(H) Weight(W)
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76

Euclidean Distance:

2 2
X H − H1 + X W − W1

Where

XH = Observed value

H1= Actual value

XW = Observed value

W= Actual value

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
43

Initialize two clusters

H W Centroid
c1 185 74 (185,74)
c2 169 58 (169,58)
Euclidean Distance: Row 3
2 2
C1 : 168 − 185 + 60 − 72

2 2
= −17 + −12

= 289 + 144

= 433 = 20.80

2 2
C2 : 168 − 170 + 60 − 56

2 2
= 2 + 4

= 4 + 16

= 20 = 4.47

C2 = 170+168 60+56
2
, 2

338 116
= 2
, 2

= (169,58)

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
44

Euclidean Distance: Row 4

2 2
C1 : 179 − 185 + 68 − 72

2 2
= 6 + 4

= 36 + 16

= 52 = 7.211
2 2
C2 : 179 − 169 + 68 − 58

2 2
= 10 + 10

= 100 + 100

= 200 = 14.14

185+179 72+68
C1 = ,
2 2

364 140
= ,
2 2

= (182,70)
Euclidean Distance: Row 5

2 2
C1 : 182 − 182 + 72 − 70

2 2
= 0 + 2

= 4 = 2
2 2
C2 : 182 − 169 + 72 − 58

2 2
= 13 + 14

= 169 + 196
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
45

= 365 = 19.19

182+182 72+70
C1 = 2
, 2

364 142
= ,
2 2

= (182,71)
Euclidean Distance: Row 6

2 2
C1 : 188 − 182 + 77 − 71

2 2
= 6 + 6

= 36 + 36

= 72 = 8.48
2 2
C2 : 188 − 169 + 77 − 58

2 2
= 19 + 19

= 361 + 361

= 722 = 26.87

188+182 77+71
C1 = 2
, 2

370 148
= 2
, 2

= (185,74)

C1  { 1,4,5,6,7,8,9,10,11,12}

C2  { 2,3,

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
46

Distance Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying
different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas. These algorithms can face difficulty in clustering the data
points if the dataset has varying densities and high dimensions.

Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian distribution. The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
47

Hierarchical Clustering.

Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of
pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to
create a tree-like structure, which is also called a dendrogram. The observations or any number of clusters
can be selected by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.

Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
48

Probabilistic models
The third family of machine learning algorithms is the probabilistic models. We have seen before that
the k-nearest neighbor algorithm uses the idea of distance (e.g., Euclidian distance) to classify entities,
and logical models use a logical expression to partition the instance space. We see how the probabilistic
models use the idea of probability to classify new entities.
Probabilistic models see features and target variables as random variables. The process of modeling
represents and manipulates the level of uncertainty with respect to these variables. There are two types
of probabilistic models: Predictive and Generative. Predictive probability models use the idea of a
conditional probability distribution P (Y |X) from which Y can be predicted from X. Generative
models estimate the joint distribution P (Y, X). Once we know the joint distribution for the generative
models, we can derive any conditional or marginal distribution involving the same variables. Thus, the
generative model is capable of creating new data points and their labels, knowing the joint probability
distribution. The joint distribution looks for a relationship between two variables. Once this relationship
is inferred, it is possible to infer new data points.
Naïve Bayes Algorithm

 Naïve bayes is a supervised learning algorithm based on bayes theory and used to
solving classification problems.
 It is mainly used in Text classification.
 Naïve bayes classifier is one the simplest and most effective classification
algorithm which helps in building the fast machine learning algorithm that can
make quick predictions.
𝐵
𝐴 𝑃 ∗𝑃 𝐴
𝑃 = 𝐴
𝐵 𝑃 𝐵

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
49

The normal distribution and its geometric interpretation


Normal distribution, also known as the Gaussian distribution, is a probability distribution that is
symmetric about the mean, showing that data near the mean are more frequent in occurrence than data
far from the mean. In graph form, normal distribution will appear as a bell curve.

Normal distributions have the following features:

 symmetric bell shape


 Mean and median are equal; both located at the center of the distribution
 68% of the data falls within 1 standard deviation of the mean
 95% of the data falls within 2 standard deviation of the mean
 99.7% of the data falls within 3 standard deviation of the mean

Geometric interpretation
 Row vectors as points or arrows in n-dimension space
 Very intuitive, good for visualization
 Use techniques from geometry and linear algebra

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
50

Discriminative learning by optimizing conditional like li hood Probabilistic


models with hidden variables.
In many supervised learning tasks, the entities to be labeled are related to each other in complex ways
and their labels are not independent. For example, in hypertext classification, the labels of linked pages
are highly correlated. A standard approach is to classify each entity independently, ignoring the
correlations between them. Recently, Probabilistic Relational Models, a relational version of Bayesian
networks, were used to define a joint probabilistic model for a collection of related entities. Present an
alternative framework that builds on (conditional) Markov networks and addresses two limitations of the
previous approach. First, undirected models do not impose the acyclicity constraint that hinders
representation of many important relational dependencies in directed models. Second, undirected models
are well suited for discriminative training, where we optimize the conditional likelihood of the labels
given the features, which generally improves classification accuracy. We show how to train these
models effectively, and how to use approximate probabilistic inference over the learned model for
collective classification of multiple related entities. We provide experimental results on a webpage
classification task, showing that accuracy can be significantly improved by modeling relational
dependencies.

Feature transformations
Feature transformation (FT) refers to family of algorithms that create new features using the existing
features. These new features may not have the same interpretation as the original features, but they may
have more discriminatory power in a different space than the original space. This can also be used for
feature reduction. FT may happen in many ways, by simple/linear combinations of original features or
using non-linear functions. Some common techniques for FT are:

 Scaling or normalizing features within a range, say between 0 to 1.


 Principle Component Analysis and its variants.
 Random Projection.
 Neural Networks.
 SVM also transforms features internally.
 Transforming categorical features to numerical.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
51

Feature construction and selection.


Feature construction
Feature construction involves transforming a given set of input features to generate a new
set of more powerful features which can then used for prediction. ... Engineering a
good feature space is a prerequisite for achieving high performance in any machine
learning task.
Model ensembles
Feature Selection
In machine learning and statistics, feature selection, also known as variable selection,
attribute selection or variable subset selection, is the process of selecting a subset of
relevant features (variables, predictors) for use in model construction.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
52

Model ensembles

Ensemble learning helps improve machine learning results by combining several models. This approach
allows the production of better predictive performance compared to a single model. That is why ensemble
methods placed first in many prestigious machine learning competitions, such as the Netflix Competition,
KDD 2009, and Kaggle.The Statsbot team wanted to give you the advantage of this approach and asked a
data scientist, VadimSmolyakov, to dive into three basic ensemble learning techniques.Ensemble
methods are meta-algorithms that combine several machine learning techniques into one predictive model
in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking).

Ensemble methods can be divided into two groups:

 Sequential ensemble methods where the base learners are generated sequentially (e.g.
AdaBoost).The basic motivation of sequential methods is to exploit the dependence between the
base learners. The overall performance can be boosted by weighing previously mislabeled
examples with higher weight.
 Parallel ensemble methods where the base learners are generated in parallel (e.g. Random
Forest).The basic motivation of parallel methods is to exploit independence between the base
learners since the error can be reduced dramatically by averaging.

Most ensemble methods use a single base learning algorithm to produce homogeneous base learners, i.e.
learners of the same type, leading to homogeneous ensembles.

There are also some methods that use heterogeneous learners, i.e. learners of different types, leading
to heterogeneous ensembles. In order for ensemble methods to be more accurate than any of its individual
members, the base learners have to be as accurate as possible and as diverse as possible.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
53

Bagging and random forests

Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method. Bagging
is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically
decision trees.

1. Suppose there are N observations and M features. A sample from observation is selected randomly
with replacement(Bootstrapping).

2. A subset of features are selected to create a model with sample of observations and subset of
features.

3. Feature from the subset is selected which gives the best split on the training data.(Visit my blog on
Decision Tree to know more of best split)

4. This is repeated to create many models and every model is trained in parallel

5. Prediction is given based on the aggregation of predictions from all the models.

When bagging with decision trees, we are less concerned about individual trees overfitting the training
data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training
samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high
variance and low bias. These are important characterize of sub-models when combining predictions using
bagging. The only parameters when bagging decision trees is the number of samples and hence the
number of trees to include. This can be chosen by increasing the number of trees on run after run until the
accuracy begins to stop showing improvement.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
54

Figure: Bagging

Algorithm

Advantages
 Efficient on large datasets
 More accurate than decision trees
 Averaging results of many trees reduces variance
Disadvantages
 More difficult to interpret than decision trees
 Less clear which variable are of greatest importance for predicting the response
 More computationally intensive than forming a single decision tree

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
55

Random forests

Random forest is a supervised learning algorithm which is used for both classification as well as
regression. But however, it is mainly used for classification problems. As we know that a forest is made
up of trees and more trees means more robust forest. Similarly, random forest algorithm creates
decision trees on data samples and then gets the prediction from each of them and finally selects the
best solution by means of voting. It is an ensemble method which is better than a single decision tree
because it reduces the over-fitting by averaging the result.

Working of Random Forest Algorithm

We can understand the working of Random Forest algorithm with the help of following steps −

 Step 1 − First, start with the selection of random samples from a given dataset.

 Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the
prediction result from every decision tree.

 Step 3 − In this step, voting will be performed for every predicted result.

 Step 4 − At last, select the most voted prediction result as the final prediction result.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
56

The following diagram will illustrate its working −

Advantages

1. Powerful and accurate


2. Good performance on many problems including non linear.
Disadvantages

1. No interpretability
2. Overfitting can easily occur
3. Need to choose the number of trees
So basically Random forest is used when you are just looking for high performance with less need for
interpretation.

Boosting

Boosting is a technique to combine weak learners and convert them into strong ones with the help of
Machine Learning algorithms. It uses ensemble learning to boost the accuracy of a model. Ensemble learning
is a technique to improve the accuracy of Machine Learning models. There are two types of ensemble
learning:

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
57

1. Sequential Ensemble Learning

It is a boosting technique where the outputs from individual weak learners associate sequentially during the
training phase. The performance of the model is boosted by assigning higher weights to the samples that are
incorrectly classified. AdaBoost algorithm is an example of sequential learning that we will learn later in this
blog.

2. Parallel Ensemble Learning

It is a bagging technique where the outputs from the weak learners are generated parallelly. It reduces errors
by averaging the outputs from all weak learners. The random forest algorithm is an example of parallel
ensemble learning.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
58

Mechanism of Boosting Algorithms

Boosting is creating a generic algorithm by considering the prediction of the majority of weak learners. It
helps in increasing the prediction power of the Machine Learning model. This is done by training a series of
weak models.

Below are the steps that show the mechanism of the boosting algorithm:

1. Reading data

2. Assigning weights to observations

3. Identification of misinterpretation (false prediction)

4. Assigning the false prediction, along with a higher weightage, to the next learner

5. Finally, iterating Step 2 until we get the correctly classified output

Now, we will explore various interpretations of weakness and their corresponding algorithms.

Types of Boosting Algorithms

Basically, there are three types of boosting algorithms discussed as below:

1. Adaptive Boosting (AdaBoost)

Adaptive boosting is a technique used for binary classification. For implementing AdaBoost, we use short
decision trees as weak learners.

Steps for implementing AdaBoost:

1. Train the base model using the weighted training data

2. Then, add weak learners sequentially to make it a strong learner

3. Each weak learner consists of a decision tree; analyze the output of each decision tree and assign higher
weights to the misclassified results. This gives more significance to the prediction with higher weights.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
59

4. Continue the process until the model becomes capable of predicting the accurate result

2. Gradient Boosting

In Machine Learning, we use gradient boosting to solve classification and regression problems. It is a
sequential ensemble learning technique where the performance of the model improves over iterations. This
method creates the model in a stage-wise fashion. It infers the model by enabling the optimization of an
absolute differentiable loss function. As we add each weak learner, a new model is created that gives a more
precise estimation of the response variable.

The gradient boosting algorithm requires the below components to function:

1. Loss function: To reduce errors in prediction, we need to optimize the loss function. Unlike in AdaBoost,
the incorrect result is not given a higher weightage in gradient boosting. It tries to reduce the loss function by
averaging the outputs from weak learners.

2. Weak learner: In gradient boosting, we require weak learners to make predictions. To get real values as
output, we use regression trees. To get the most suitable split point, we create trees in a greedy manner, due
to this the model overfits the dataset.

3. Additive model: In gradient boosting, we try to reduce the loss by adding decision trees. Also, we can
minimize the error rate by cutting down the parameters. So, in this case, we design the model in such a way
that the addition of a tree does not change the existing tree.

Finally, we update the weights to minimize the error that is being calculated.

3. XGBoost

XGBoost algorithm is an extended version of the gradient boosting algorithm. It is basically designed to
enhance the performance and speed of a Machine Learning model.

Additionally, we have an XGBoosting library, which gives us frameworks of gradient boosting for various
languages such as R, Python, Java, etc.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
60

What features make XGBoost unique?


XGBoost is much faster than the gradient boosting algorithm. It improves and enhances the execution
process of the gradient boosting algorithm. There are more features that make XGBoost algorithm unique
and they are:
1. Fast: The execution speed of the XGBoost algorithm is high. We get a fast and efficient output due to its
parallel computation.
2. Cache optimization: To manage and utilize resources, it uses cache optimization.
3. Distributed computing: If we are employing large datasets for training the Machine Learning model, then
XGBoost provides us distributed computing, which helps combine multiple machines to enhance
performance.

Advantages
 The training and test error rates are both theoretically bounded
 Less ―Over fitting‖ in practice
 Many algorithms can be boosted
 Easy to implement
Disadvantages
 Learning is slow

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
61

Difference between Bagging, Boosting and Stacking


Bagging Boosting Stacking
Partitioning
Giving misclassified
of the data Random Various
samples higher preference
into subsets
Goal to
Minimize variance increase predictive force Both
achieve
Methods
where this Random Gradient descent Blending
is used
Function to
combine
Weighted average Weighted majority Logistic Regression
single
models

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
62

Dimensionality Reduction

In machine learning classification problems, there are often too many factors on the basis of which the
final classification is done. These factors are basically variables called features. The higher the number
of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these
features are correlated, and hence redundant. This is where dimensionality reduction algorithms come
into play. Dimensionality reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables. It can be divided into feature selection and
feature extraction.

Why is Dimensionality Reduction important in Machine Learning and Predictive Modeling?


An intuitive example of dimensionality reduction can be discussed through a simple e-mail classification
problem, where we need to classify whether the e-mail is spam or not. This can involve a large number
of features, such as whether or not the e-mail has a generic title, the content of the e-mail, whether the e-
mail uses a template, etc. However, some of these features may overlap. In another condition, a
classification problem that relies on both humidity and rainfall can be collapsed into just one underlying
feature, since both of the aforementioned are correlated to a high degree. Hence, we can reduce the
number of features in such problems. A 3-D classification problem can be hard to visualize, whereas a
2-D one can be mapped to a simple 2 dimensional space, and a 1-D problem to a simple line. The below
figure illustrates this concept, where a 3-D feature space is split into two 1-D feature spaces, and later, if
found to be correlated, the number of features can be reduced even further.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
63

Components of Dimensionality Reduction


There are two components of dimensionality reduction:

 Feature selection: In this, we try to find a subset of the original set of variables, or features, to get
a smaller subset which can be used to model the problem. It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower dimension space,
i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:

 Principal Component Analysis (PCA)


 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method used. The prime
linear method, called Principal Component Analysis, or PCA, is discussed below.

Principal Component Analysis


This method was introduced by Karl Pearson. It works on a condition that while the data in a higher
dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower
dimensional space should be maximum.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
64

It involves the following steps:

 Construct the covariance matrix of the data.


 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of
variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data loss in the
process. But, the most important variances should be retained by the remaining eigenvectors.

Linear Discriminant Analysis


Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function
Analysis is a dimensionality reduction technique which is commonly used for the supervised
classification problems. It is used for modeling differences in groups i.e. separating two or more classes.
It is used to project the features in higher dimension space into a lower dimension space.
For example, we have two classes and we need to separate them efficiently. Classes can have multiple
features. Using only a single feature to classify them may result in some overlapping as shown in the
below figure. So, we will keep on increasing the number of features for proper classification.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
65

Example:
Suppose we have two sets of data points belonging to two different classes that we want to classify. As
shown in the given 2D graph, when the data points are plotted on the 2D plane, there‘s no straight line
that can separate the two classes of the data points completely. Hence, in this case, LDA (Linear
Discriminant Analysis) is used which reduces the 2D graph into a 1D graph in order to maximize the
separability between the two classes.

Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and projects data
onto a new axis in a way to maximize the separation of the two categories and hence, reducing the 2D
graph into a 1D graph.
Two criteria are used by LDA to create a new axis:
1. Maximize the distance between means of the two classes.
2. Minimize the variation within each class.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
66

In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D graph such
that it maximizes the distance between the means of the two classes and minimizes the variation within
each class. In simple terms, this newly generated axis increases the separation between the dtla points of
the two classes. After generating this new axis using the above-mentioned criteria, all the data points of
the classes are plotted on this new axis and are shown in the figure given below.

But Linear Discriminate Analysis fails when the mean of the distributions are shared, as it becomes
impossible for LDA to find a new axis that makes both the classes linearly separable. In such cases, we
use non-linear discriminant analysis.

Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such as
splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the
variance (actually covariance), moderating the influence of different variables on LDA.
Applications:
1. Face Recognition: In the field of Computer Vision, face recognition is a very popular application
in which each face is represented by a very large number of pixel values. Linear discriminant
analysis (LDA) is used here to reduce the number of features to a more manageable number before
the process of classification. Each of the new dimensions generated is a linear combination of pixel
values, which form a template. The linear combinations obtained using Fisher‘s linear discriminant
are called Fisher faces.
2. Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient disease
state as mild, moderate or severe based upon the patient various parameters and the medical
treatment he is going through. This helps the doctors to intensify or reduce the pace of their
treatment.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
67

3. Customer Identification: Suppose we want to identify the type of customers which are most
likely to buy a particular product in a shopping mall. By doing a simple question and answers
survey, we can gather all the features of the customers. Here, Linear discriminant analysis will
help us to identify and select the features which can describe the characteristics of the group of
customers that are most likely to buy that particular product in the shopping mall.
General Discriminant Analysis (GDA)
General Discriminant Analysis (GDA) is called a "general" discriminant analysis because it applies the
methods of the general linear model (see also General Linear Models (GLM)) to the discriminant
function analysis problem. A general overview of discriminant function analysis, and the traditional
methods for fitting linear models with categorical dependent variables and continuous predictors, is
provided in the context of Discriminant Analysis. In GDA, the discriminant function analysis problem is
"recast" as a general multivariate linear model, where the dependent variables of interest are (dummy-)
coded vectors that reflect the group membership of each case. The remainder of the analysis is then
performed as described in the context of General Regression Models (GRM), with a few additional
features noted below.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes undesirable.
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice, some thumb rules are
applied.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
68

Principal Component Analysis (PCA)

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set
consisting of many variables correlated with each other, either heavily or lightly, while retaining the
variation present in the dataset, up to the maximum extent. The same is done by transforming the
variables to a new set of variables, which are known as the principal components (or simply, the PCs)
and are orthogonal, ordered such that the retention of variation present in the original variables decreases
as we move down in the order. So, in this way, the 1st principal component retains maximum variation
that was present in the original components. The principal components are the eigenvectors of a
covariance matrix, and hence they are orthogonal.

Importantly, the dataset on which PCA technique is to be used must be scaled. The results are also
sensitive to the relative scaling. As a layman, it is a method of summarizing data. Imagine some wine
bottles on a dining table. Each wine is described by its attributes like colour, strength, age, etc. But
redundancy will arise because many of them will measure related properties. So what PCA will do in
this case is summarize each wine in the stock with less characteristics.

Intuitively, Principal Component Analysis can supply the user with a lower-dimensional picture, a
projection or "shadow" of this object when viewed from its most informative viewpoint.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
69

 Dimensionality It is the number of random variables in a dataset or simply the number of features, or
rather more simply, the number of columns present in your dataset.
 Correlation: It shows how strongly two variable are related to each other. The value of the same ranges
for -1 to +1. Positive indicates that when one variable increases, the other increases as well, while
negative indicates the other decreases on increasing the former. And the modulus value of indicates the
strength of relation.
 Orthogonal: Uncorrelated to each other, i.e., correlation between any pair of variables is 0.
 Eigenvectors: Eigenvectors and Eigenvalues are in itself a big domain, let‘s restrict ourselves to the
knowledge of the same which we would require here. So, consider a non-zero vector v. It is an
eigenvector of a square matrix A, if Av is a scalar multiple of v. Or simply:
Av = ƛv
Here, v is the eigenvector and ƛ is the eigenvalue associated with it.
 Covariance Matrix: This matrix consists of the covariances between the pairs of variables. The (i,j)th
element is the covariance between i-th and j-th variable.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
70

Applications
PCA is predominantly used as a dimensionality reduction technique in domains like facial recognition,
computer vision and image compression. It is also used for finding patterns in data of high dimension in
the field of finance, data mining, bioinformatics, psychology, etc.
 Image processing
 Speech recognition
 Recommendation engines
 Text processing

Back propagation

Backpropagation is the essence of neural network training. It is the method of fine-tuning the weights of
a neural network based on the error rate obtained in the previous epoch (i.e., iteration). Proper tuning of
the weights allows you to reduce error rates and make the model reliable by increasing its
generalization. Backpropagation in neural network is a short form for "backward propagation of errors."
It is a standard method of training artificial neural networks. This method helps calculate the gradient of
a loss function with respect to all the weights in the network.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
71

Types of Backpropagation Networks

Two Types of Backpropagation Networks are:

 Static Back-propagation
 Recurrent Backpropagation

Static back-propagation:

It is one kind of backpropagation network which produces a mapping of a static input for static output. It
is useful to solve static classification issues like optical character recognition.

Recurrent Backpropagation:

Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After that, the
error is computed and propagated backward.

The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
72

Artificial Neural Networks

An Artificial Neuron Network (ANN), popularly known as Neural Network is a computational model
based on the structure and functions of biological neural networks. It is like an artificial human nervous
system for receiving, processing, and transmitting information in terms of Computer Science.

Basically, there are 3 different layers in a neural network :-

1. Input Layer (All the inputs are fed in the model through this layer)

2. Hidden Layers (There can be more than one hidden layers which are used for processing the inputs
received from the input layers)

3. Output Layer (The data after processing is made available at the output layer)

Following is the manner in which these layers are laid

Figure depicting the different layers of a neural network

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
73

Input Layer

The Input layer communicates with the external environment that presents a pattern to the neural
network. Its job is to deal with all the inputs only. This input gets transferred to the hidden layers which
are explained below. The input layer should represent the condition for which we are training the neural
network. Every input neuron should represent some independent variable that has an influence over the
output of the neural network

Hidden Layer

The hidden layer is the collection of neurons which has activation function applied on it and it is an
intermediate layer found between the input layer and the output layer. Its job is to process the inputs
obtained by its previous layer. So it is the layer which is responsible extracting the required features from
the input data. Many researches has been made in evaluating the number of neurons in the hidden layer
but still none of them was successful in finding the accurate result. Also there can be multiple hidden
layers in a Neural Network. So you must be thinking that how many hidden layers have to be used for
which kind of problem. Suppose that if we have a data which can be separated linearly, then there is no
need to use hidden layer as the activation function can be implemented to input layer which can solve the
problem. But in case of problems which deals with complex decisions, we can use 3 to 5 hidden layers
based on the degree of complexity of the problem or the degree of accuracy required. That certainly not
means that if we keep on increasing the number of layers, the neural network will give high accuracy! A
stage comes when the accuracy becomes constant or falls if we add an extra layer! Also, we should also
calculate the number of nuerons in each network. If the number of neurons are less as compared to the
complexity of the problem data then there will be very few neurons in the hidden layers to adequately
detect the signals in a complicated data set. If unnecessary more neurons are present in the network then
Overfitting may occur. Several methods are used till now which do not provide the exact formula for
calculating the number of hidden layer as well as number of neurons in each hidden layer.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
74

Output Layer

The output layer of the neural network collects and transmits the information accordingly in way it has
been designed to give. The pattern presented by the output layer can be directly traced back to the input
layer. The number of neurons in output layer should be directly related to the type of work that the neural
network was performing. To determine the number of neurons in the output layer, first consider the
intended use of the neural network.

Figure depicting the Activation function for ANN

Flowchart

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
75

Advantages
 Easy to conceptualize
 Capable of detecting complex relationships
 Large amount of academic research
 Used extensively in industry for many years
 Provide high speed calculations
 Can handle large number of feathers
 Can solve any machine learning problem
Disadvantages
 Neural networks are too much of a black box this makes them difficult to train
 There are alternatives that are simpler, faster, easier to train and perform better.
 Can not resolve all problems of learning machine
 Neural networks are not probabilistic
 Neural networks are not a substitute for understanding your problem
Application
 Handwriting recognition
 Image compression
 Signal processing
 Pattern recognition
 Traveling salesman problem
 Stock exchange prediction

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
76

Appropriate problems for neural network learning


ANN learning is well-suited to problems in which the training data correspondsto noisy, complex sensor
data, such as inputs from cameras and microphones.

how the image of a forward-mounted camera is mapped to 960 neuralnetwork inputs, which are fed forward
to 4 hidden units, connected to 30 output units. The figure on the right shows weight values forone of the
hidden units in this network. The 30 x 32 weights into the hidden unit are displayed inthe large matrix, with
white blocks indicating positive and black indicating negative weights. Theweights from this hidden unit to
the 30 output units are depicted by the smaller rectangular blockdirectly above the large block.
As can be seen from these output weights, activation of this particularhidden unit encourages a turn toward
the left. It is appropriatefor problems with the following characteristics:

 Instances are represented by many attribute-value pairs. The target function to be learned is
defined over instances that can be described by a vector of predefined features, such as the pixel

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
77

values in the ALVINN example. These input attributes may be highly correlated or independent
of one another.Input values can be any real values.
 The target function output may be discrete-valued, real-valued, or a vector of several real- or
discrete-valued attributes. For example, in the ALVINN system the output is a vector of 30
attributes, each corresponding to a recommendation regarding the steering direction. The value
of each output is some real number between 0 and 1, which in this case corresponds to the
confidence in predicting the corresponding steering direction. We can also train a single network
to output both the steering command and suggested acceleration, simply by concatenating the
vectors that encode these two outputpredictions.
 The training examples may contain errors. ANN learning methods are quiterobust to noise in the
training data.
 Long training times are acceptable. Network training algorithms typically require longer training
times than, say, decision tree learning algorithms. Training times can range from a few seconds
to many hours, depending on factors such as the number of weights in the network, the number
of training examples considered, and the settings of various learning algorithmparameters.
 Fast evaluation of the learned target function may be required. Although ANN learning times are
relatively long, evaluating the learned network, in order to apply it to a subsequent instance, is
typically very fast. For example, ALVINN applies its neural network several times per second to
continuallyupdate its steering command as the vehicle drives forward.
 The ability of humans to understand the learned target function is not important. The weights
learned by neural networks are often difficult for humans to interpret. Learned neural networks
are less easily communicated to humansthan learned rules.
We first consider several alternative designs for the primitive units that make up artificial neural
networks (percetrons, linear units, and sigmoid units), along with learning algorithms for training
single units. We then present the Back propagation algorithm for training multilayer networks of
such units and consider several general issues such as therepresentational capabilities of ANNs,
nature of the hypothesis space search, over fittingproblems, and alternatives to the Back propagation
algorithm. A detailedexample is also presented applying Back propagation algorithm
facerecognition, anddirections are provided for the reader to obtain the data and code to
experimentfurther with this application.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
78

Multilayer networks and the back propagation algorithm.


A single perceptron can only express linear decision surfaces. In contrast, the kind of multilayer
networks learned by the BACKPROPACATIONalgorithm are capable of expressing a rich variety of
nonlinear decisionsurfaces. For example, a typical multilayer network and decision surface is depicted.
Here the speech recognition task involves distinguishing among 10 possible vowels, all spoken in the
context of "h-d" (i.e., "hid," "had,""head," "hood," etc.). The input speech signal is represented by two
numerical parameters obtained from a spectral analysis of the sound, allowing us to easily visualize the
decision surface over the two-dimensional instance space. As shown in the figure, it is possible for the
multilayer network to represent highly nonlinear decision surfaces that are much more expressive than
the linear decision surfaces of single units shown earlier.

Decision regions of a multilayer feed forward network. The network shown here was trained to
recognize 1 of 10 vowel sounds occurring in the context "hd" (e.g., "had," "hid"). The network input
consists of two parameters, F1 and F2, obtained from a spectral analysis of the sound. The 10 network
outputs correspond to the 10 possible vowel sounds. The network prediction is the output whose value is
highest. The plot on the right illustrates the highly nonlinear decision surface represented by the learned
network. Points shown on the plot are test examples distinct from the examples used to train the
network.

We have toderived a gradient descent learning rule. However,multiple layers of cascaded linear units
still produce only linear functions,and we prefer networks capable of representing highly nonlinear
functions. Theperceptron unit is another possible choice, but its discontinuous threshold make situn
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
79

differentiable and hence unsuitable for gradient descent. What we need is unit whose output is a
nonlinear function of its inputs, but whose output is also differentiable function of its inputs. One
solution is the sigmoid unit-a unit very much like a perceptron, but based on a smoothed, differentiable
threshold function.The sigmoid unit is illustrated like the perceptron, the sigmoid unit first computes a
linear combination of its inputs, then applies a threshold to the result. In the case of the sigmoid unit,
however, the threshold output is continuous function of its input.

More precisely, the sigmoid unit computes itsoutput o as

a is often called the sigmoid function or, alternatively, the logistic function. Noteits output ranges
between 0 and 1, increasing monotonically with its input.

A Differentiable threshold unit

What type of unit as the basis for multilayer networks?

Perceptron: not differentiable  can‘t use gradient descent

Linear unit: multi-layers of linear units still produce only linear function

Sigmoid unit: smoothed, differentiable threshold function

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
80

Illustrate the back-propagation algorithm for learning


multilayerPerceptrons.
The BACK Propagation algorithm learns the weights for a multilayer network, given a network with a
fixed set of units and interconnections. It employs gradient descent to attempt to minimize the squared
error between the network output values and the target values for these outputs. This section presents the
Backpropagation algorithm, and the following section gives the derivation for the gradient descent
weight update rule used by Backpropagation . Because we are considering networks with multiple
output units rather than single units as before, we begin by redefining E to sum the errors over all of the
network output units

Where outputs is the set of output units in the network, and tkd and Okd are the I target and output
values associated with the kth output unit and training example d. The learning problem faced by
BACKPROPAGAT is to search a large hypothesis space defined by all possible weight values for all the
units in the network. The situation can be visualized in terms of an error surface similar to that shown
for linear units in Figure 4.4. The error in that diagram is replaced by our new definition of E, and the
other dimensions of the space correspond now to all of the weights associated with all of the units in the
network. As in the case of training a single unit, gradient descent can be used to attempt to find a
hypothesis to minimize E.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
81

Here the axes woand wlrepresent possible values for the two weights of a simple linear unit. The wo,
wlplane therefore represents the entire hypothesis space. The vertical axis indicates the error E relative
to some fixed set of training examples. The error surface shown in the figure thus summarizes the
desirability of every weight vector in the hypothesis space (we desire a hypothesis with minimum error).
Given the way in which we chose to define E, for linear units this error surface must always be parabolic
with a single global minimum. The specific parabola will depend, of course, on the particular set of
training examples.

Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET

You might also like