0% found this document useful (0 votes)
2 views

statistic inference unit 2 notes

The document discusses the two main categories of machine learning: supervised and unsupervised learning. Supervised learning involves using labeled datasets to train algorithms to predict outcomes, while unsupervised learning focuses on discovering hidden patterns in unlabeled data. Additionally, it covers techniques such as k-fold cross-validation, classification, data scaling, and the k-nearest neighbors algorithm, along with their applications and metrics.

Uploaded by

sixokic135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

statistic inference unit 2 notes

The document discusses the two main categories of machine learning: supervised and unsupervised learning. Supervised learning involves using labeled datasets to train algorithms to predict outcomes, while unsupervised learning focuses on discovering hidden patterns in unlabeled data. Additionally, it covers techniques such as k-fold cross-validation, classification, data scaling, and the k-nearest neighbors algorithm, along with their applications and metrics.

Uploaded by

sixokic135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Supervised & Unsupervised

Learning

Unit-2
Machine learning incorporates several hundred statistical-based
algorithms and choosing the right algorithm or combination of algorithms
for the job is a constant challenge for anyone working in this field.

The three overarching categories of machine learning categories are


1. Supervised,
2. Unsupervised, and
3. Reinforcement.
Supervised Learning

As the first branch of machine learning, supervised learning concentrates on


learning patterns through connecting the relationship between variables and known
outcomes and working with labeled datasets.

Supervised learning works by feeding the machine sample data with


various features (represented as “X”) and the correct value output of
the data (represented as “y”). The fact that the output and feature values
are known qualifies the dataset as “labeled.” The algorithm then
deciphers patterns that exist in the data and creates a model that can
reproduce the same underlying rules with new data.
Unsupervised Learning

In the case of unsupervised learning, not all variables and data


patterns are classified. Instead, the machine must uncover hidden
patterns and create labels through the use of unsupervised learning
algorithms.
The advantage of unsupervised learning is it enables you to discover
patterns in the data that you were unaware existed—such as the presence
of two major customer types.

Clustering techniques such as k-means clustering can also provide the


springboard for conducting further analysis after discrete groups have
been discovered.

In industry, unsupervised learning is particularly powerful in fraud


detection —where the most dangerous attacks are often those yet to be
classified.
SUPERVISED LEARNING
Supervised learning (SL) is the machine learning task of learning a function
that maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples. n
supervised learning, each example is a pair consisting of an input object (typically a
vector) and a desired output value (also called the supervisory signal). A supervised
learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples.

An optimal scenario will allow for the algorithm to correctly determine the class
labels for unseen instances. This requires the learning algorithm to generalize from
the training data to unseen situations in a "reasonable" way
To solve a given problem of supervised learning, one has to perform the following
steps:
1. Determine the type of training examples. Very first, the user should decide what
kind of data is to be used as a training set.

2. Gather a training set. The training set needs to be representative of the real-world
use of the function. Thus, a set of input objects is gathered and corresponding
outputs are also gathered.

3. Determine the input feature representation of the learned function. The accuracy
of the learned function depends strongly on how the input object is represented.
Typically, the input object is transformed into a feature vector, which contains a
number of features that are descriptive of the object. The number of features
should not be too large, because of the large dimensionality; but should contain
enough information to accurately predict the output.
4. Determine the structure of the learned function and corresponding learning
algorithm.

5. Run the learning algorithm on the gathered training set. Some supervised
learning algorithms require the user to determine certain control parameters.
These parameters may be adjusted by optimizing performance on a subset
(called a validation set) of the training set, or via cross-validation.

4. Evaluate the accuracy of the learned function. After parameter adjustment


and learning, the performance of the resulting function should be measured
on a test set that is separate from the training set.
k-Fold Cross-Validation
Cross-validation is a re-sampling procedure used to evaluate machine learning
models on a limited data sample.

The procedure has a single parameter called k that refers to the number of
groups that a given data sample is to be split into. As such, the procedure is often
called k-fold cross-validation.

When a specific value for k is chosen, it may be used in place of k in the reference
to the model, such as k=10 becoming 10-fold cross-validation.
The general procedure is as follows:
1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:
a. Take the group as a hold out or test data set
b. Take the remaining groups as a training data set
c. Fit a model on the training set and evaluate it on the test set
d. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation
scores
# scikit-learn k-fold cross-validation

from numpy import array

from sklearn.model_selection import KFold

# data sample

data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

# prepare cross validation

# splits a dataset into 3 folds, shuffles prior to the split, and uses a value of 1 for the

#pseudorandom number generator.

kfold = KFold(3, True, 1)

# enumerate splits

for train, test in kfold.split(data):

print('train: %s, test: %s' % (data[train], data[test]))


Classification
In machine learning, classification is a supervised learning concept which
basically categorizes a set of data into classes. classification refers to a
predictive modeling problem where a class label is predicted for a given
example of input data.

The most common classification problems are – speech recognition, face


detection, handwriting recognition, document classification, etc.
Classification is defined as the process of recognition, understanding, and
grouping of objects and ideas into preset categories a.k.a “sub-populations.”

With the help of these pre-categorized training datasets, classification in


machine learning programs leverage a wide range of algorithms to classify
future datasets into respective and relevant categories.

In short, classification is a form of “pattern recognition,”. Here, classification


algorithms applied to the training data find the same pattern (similar
number sequences, words or sentiments, and the like) in future data sets.
DATA SCALING

The machine learning models provide weights to the input variables


according to their data points and inferences for output. In that case, if
the difference between the data points is so high, the model will need to
provide the larger weight to the points and in final results, the model
with a large weight value is often unstable. This means the model can
produce poor results or can perform poorly during learning.

Normalization and Standardization are the two main methods for the
scaling of the data. Which are widely used in the algorithms where
scaling is required. Both of them can be implemented by the scikit-learn
libraries preprocess package
Data Normalization

In statistics, normalization is the method of rescaling data where we try to fit


all the data points between the range of 0 to 1 so that the data points can
become closer to each other.

Python Implementation

from sklearn.preprocessing import MinMaxScaler


Data Standardization

The basic concept behind the standardization function is to make data points
centred about the mean of all the data points presented in a feature with a unit
standard deviation. This means the mean of the data point will be zero and the
standard deviation will be 1. So in standardization, the data points are rescaled
by ensuring that after scaling they will be in a curve shape.

Python Implementation

from sklearn.preprocessing import StandardScaler


K NEAREST NEIGHBOURS
K-Nearest Neighbor is a classification and prediction algorithm that is
used to divide data into classes based on the distance between the data
points. K-Nearest Neighbor assumes that data points which are close
to one another must be similar and hence, the data point to be
classified will be grouped with the closest cluster.
The following two properties would define KNN well −

Lazy learning algorithm − KNN is a lazy learning algorithm


because it does not have a specialized training phase and uses all the
data for training while classification.

Non-parametric learning algorithm − KNN is also a non-


parametric learning algorithm because it doesn’t assume anything
about the underlying data.
k-Nearest Neighbors
Given a query item:
Find k closest matches
in a labeled dataset ↓
k-Nearest Neighbors
Given a query item: Return the most
Find k closest matches Frequent label
k-Nearest Neighbors
k = 3 votes for “cat”
k-Nearest Neighbors
2 votes for cat,
1 each for Buffalo, Cat wins…
Deer, Lion
kNN algorithm to classify one piece of data called my_X.

Pseudo code for this function would look like this:

For every point in our dataset:


1. calculate the distance between my_X and the current point
2. sort the distances in increasing order
3. take k items with lowest distances to my_X
4. find the majority class among these items
5. return the majority class as our prediction for the class of my_X
K-NN metrics
• Euclidean Distance: Simplest, fast to compute
𝑑 𝑥, 𝑦 = 𝑥 − 𝑦
• Cosine Distance: Good for documents, images, etc.
𝑥∙𝑦
𝑑 𝑥, 𝑦 = 1 −
𝑥 𝑦
• Jaccard Distance: For set data:
𝑋∩𝑌
𝑑 𝑋, 𝑌 = 1 −
𝑋∪𝑌
• Hamming Distance: For string data:
𝑛

𝑑 𝑥, 𝑦 = ෍ 𝑥𝑖 ≠ 𝑦𝑖
𝑖=1
K-NN metrics
• Manhattan Distance: Coordinate-wise distance
𝑛

𝑑 𝑥, 𝑦 = ෍ 𝑥𝑖 − 𝑦𝑖
𝑖=1
• Edit Distance: for strings, especially genetic data.

• Mahalanobis Distance: Normalized by the sample


covariance matrix – unaffected by coordinate
transformations.
Applications of KNN

The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for
loan approval? Does that individual have the characteristics similar to the
defaulters one?

Calculating Credit Ratings


KNN algorithms can be used to find an individual’s credit rating by comparing
with the persons having similar traits.

Politics
With the help of KNN algorithms, we can classify a potential voter into various
classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will
Vote to Party ‘BJP’.

Other areas in which KNN algorithm can be used are Speech Recognition,
Handwriting Detection, Image Recognition and Video Recognition.
k-NN Issues
The Data is the Model
• No training needed.
• Accuracy generally improves with more data.
• Matching is simple and fast (and single pass).
• Usually need data in memory, but can be run off disk.
Minimal Configuration:
• Only parameter is k (number of neighbors)
• Two other choices are important:
– Weighting of neighbors (e.g. inverse distance)
– Similarity metric

You might also like