Similarity_Based_learning_(part_2_)__
Similarity_Based_learning_(part_2_)__
):
Handling Noisy Data :
the NN algo is a set of local models, each defined using a single instance
The K nearest neighbors model predicts the target level with the majority vote
from the set of K NN to the query q:
K-Nearest Neighbor assigns equal importance to both very far and near
examples in the final decision
K can now be set to the size of the entire dataset (or to a best value of k
selected experimentally )
By giving all the instances in the dataset a weighted vote, the impact of the
noisy instance is reduced
The dataset is very large: the computations using all the training
instances can become too expensive to be feasible
Data Normalization:
values of a feature much larger than values of different feature, makes the first
one dominates the computation of the Euclidean distance
solution for this is using Normalization: the equation for range normalization
into a new interval [low, high] :
Rankings based on features distance differ from those based on a sing feature
(sometimes)
Normalization prevents bias – It ensures that features with larger values don’t
dominate the distance metric.
Normalization is widely used – It’s necessary not just for k-nearest neighbors
(KNN) but for many machine learning algorithms.
. return the average value in the neighborhood rather than the majority target
level :
for datasets that only sparsely populate the feature space, kNN models usually
make more accurate predictions
Some models used measures of similarity that don’t meet all criteria , called
indexes
such that :
co-absence (CA) : how often a false value occurred for the same feature
in both the query data q and the data for the comparison user (d1 or d2 )
⇒
In simple terms, CP measures shared
yes values, while CA measures shared no values.
Russel-Rao:
one way of judging similarity is to focus solely in co-presence,
the Russel-Raw similarity index is measured in terms of the ratio between the
number of co-presences and the total number of binary features considered :
Sokal-michener:
in some domains co-absence is important
the consine similarity between two instances is the cosine of the inner angle
between the two vectors that extend from the origin to each instance :
Mahalanobis distance:
the mahalanobis distance utilized covariance to scale distances
this ensures that distances along directions where the dataset is spread out
are scaled down
The question in the slide asks if Mahalanobis distance is the same as normalizing
features and then using Euclidean distance.
Subset selection:
Subset selection picks the best set of features from the generated options.
One method is using a
filter, which selects the most predictive features based on evaluation. A more
common method is using a wrapper, which tests how well a model performs
with each feature set before choosing the best one.
termination condition
⇒ the search can move through the search space in a number of ways:
The goal of any feature selection approach is to identify the smallest subset of
descriptive features that maintains overall model performance
feature selection as a greedy local search problem, where each state in the
search space specifies a subset of possible features
Check if there could be the point with a radius equal to the current best
distance
the k-d tree is one of the best known of these indices, balanced binary tree in
which each of the nodes in the tree index one of the instances in a training
dataset
Ball Tree :
A data structure for speeding up nearest neighbor searches
How it works:
Benefits :
Ball tree :
divides space using hyperspheres (balls )
Summary:
Nearest neighbor models are very sensitive to noise in the target feature the
easiest way to solve this problem is to employ a k nearest neighbor