statistic inference unit 2 notes
statistic inference unit 2 notes
Learning
Unit-2
Machine learning incorporates several hundred statistical-based
algorithms and choosing the right algorithm or combination of algorithms
for the job is a constant challenge for anyone working in this field.
An optimal scenario will allow for the algorithm to correctly determine the class
labels for unseen instances. This requires the learning algorithm to generalize from
the training data to unseen situations in a "reasonable" way
To solve a given problem of supervised learning, one has to perform the following
steps:
1. Determine the type of training examples. Very first, the user should decide what
kind of data is to be used as a training set.
2. Gather a training set. The training set needs to be representative of the real-world
use of the function. Thus, a set of input objects is gathered and corresponding
outputs are also gathered.
3. Determine the input feature representation of the learned function. The accuracy
of the learned function depends strongly on how the input object is represented.
Typically, the input object is transformed into a feature vector, which contains a
number of features that are descriptive of the object. The number of features
should not be too large, because of the large dimensionality; but should contain
enough information to accurately predict the output.
4. Determine the structure of the learned function and corresponding learning
algorithm.
5. Run the learning algorithm on the gathered training set. Some supervised
learning algorithms require the user to determine certain control parameters.
These parameters may be adjusted by optimizing performance on a subset
(called a validation set) of the training set, or via cross-validation.
The procedure has a single parameter called k that refers to the number of
groups that a given data sample is to be split into. As such, the procedure is often
called k-fold cross-validation.
When a specific value for k is chosen, it may be used in place of k in the reference
to the model, such as k=10 becoming 10-fold cross-validation.
The general procedure is as follows:
1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:
a. Take the group as a hold out or test data set
b. Take the remaining groups as a training data set
c. Fit a model on the training set and evaluate it on the test set
d. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation
scores
# scikit-learn k-fold cross-validation
# data sample
# splits a dataset into 3 folds, shuffles prior to the split, and uses a value of 1 for the
# enumerate splits
Normalization and Standardization are the two main methods for the
scaling of the data. Which are widely used in the algorithms where
scaling is required. Both of them can be implemented by the scikit-learn
libraries preprocess package
Data Normalization
Python Implementation
The basic concept behind the standardization function is to make data points
centred about the mean of all the data points presented in a feature with a unit
standard deviation. This means the mean of the data point will be zero and the
standard deviation will be 1. So in standardization, the data points are rescaled
by ensuring that after scaling they will be in a curve shape.
Python Implementation
𝑑 𝑥, 𝑦 = 𝑥𝑖 ≠ 𝑦𝑖
𝑖=1
K-NN metrics
• Manhattan Distance: Coordinate-wise distance
𝑛
𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖
𝑖=1
• Edit Distance: for strings, especially genetic data.
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for
loan approval? Does that individual have the characteristics similar to the
defaulters one?
Politics
With the help of KNN algorithms, we can classify a potential voter into various
classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will
Vote to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition,
Handwriting Detection, Image Recognition and Video Recognition.
k-NN Issues
The Data is the Model
• No training needed.
• Accuracy generally improves with more data.
• Matching is simple and fast (and single pass).
• Usually need data in memory, but can be run off disk.
Minimal Configuration:
• Only parameter is k (number of neighbors)
• Two other choices are important:
– Weighting of neighbors (e.g. inverse distance)
– Similarity metric