0% found this document useful (0 votes)
0 views

Similarity_Based_learning_(part_2_)__

The document discusses similarity-based learning, focusing on handling noisy data and the importance of normalization in K-nearest neighbor (KNN) models. It highlights the sensitivity of KNN to noise, the need for feature selection to alleviate dimensionality issues, and various measures of similarity. Additionally, it introduces efficient memory search techniques, such as k-d trees and ball trees, to improve the speed of predictions in KNN models.

Uploaded by

Houda Maarfi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Similarity_Based_learning_(part_2_)__

The document discusses similarity-based learning, focusing on handling noisy data and the importance of normalization in K-nearest neighbor (KNN) models. It highlights the sensitivity of KNN to noise, the need for feature selection to alleviate dimensionality issues, and various measures of similarity. Additionally, it introduces efficient memory search techniques, such as k-d trees and ball trees, to improve the speed of predictions in KNN models.

Uploaded by

Houda Maarfi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Similarity Based learning (part 2

):
Handling Noisy Data :
the NN algo is a set of local models, each defined using a single instance

Consequently, the alto is sensitive to noise

any errors in features / labeling of training data produce erroneous local


models and incorrect predictions

The K nearest neighbors model predicts the target level with the majority vote
from the set of K NN to the query q:

K-Nearest Neighbor assigns equal importance to both very far and near
examples in the final decision

The contribution of each neighbor is determined by the inverse of the squared


distance between the neighbor d and the query q :

Similarity Based learning (part 2 ) : 1


The weighted k nearest neighbor model is defined as:

K can now be set to the size of the entire dataset (or to a best value of k
selected experimentally )

By giving all the instances in the dataset a weighted vote, the impact of the
noisy instance is reduced

the decision boundary is much smoother with the weighting mechanism

two situations where the weighted kKK approach can problematic;

Imbalanced dataset : even with a weighting applied to the contribution of


the training instances, the majority target level may dominate

The dataset is very large: the computations using all the training
instances can become too expensive to be feasible

⇒ Imbalanced dataset : an imbalanced dataset is a dataset that contains


significantly more instances of one target level that another
IMPORTANT NOTE : CHECK THE TEXT BOOK PAGE 196 FOR EFFICIENT MEMORY
SEARCH (SOMTHING LOOKS USEFULL AND GOOD )

Data Normalization:

Similarity Based learning (part 2 ) : 2


⇒ check example slides 15 ~ 22

The odd wrong prediction is caused by features taking different ranges of


values, this is equivalent to features having different variances

values of a feature much larger than values of different feature, makes the first
one dominates the computation of the Euclidean distance

solution for this is using Normalization: the equation for range normalization
into a new interval [low, high] :

Normalization impact on rankings:


Both data instances and the query undergo normalization

Post-normalization, significant variation between features altering rankings ,

Rankings based on features distance differ from those based on a sing feature
(sometimes)

Data normalization is crucial for various ML algorithms, not just nearest


neighbor

Distance computations depend on feature values – If features have different


ranges, the distance calculation can be biased.

Normalization prevents bias – It ensures that features with larger values don’t
dominate the distance metric.

Equal contribution of features – Normalization makes sure all features are


treated fairly when computing distances.

Normalization is widely used – It’s necessary not just for k-nearest neighbors
(KNN) but for many machine learning algorithms.

Similarity Based learning (part 2 ) : 3


Predicting Continuous Targets:
It is relatively easy to adapt the KNN approach to handle continuous target
features

. return the average value in the neighborhood rather than the majority target
level :

⇒ check example slide 25 ~ 28

In a weighted k nearest neighbor the model prediction equation is changed to

in general, standard k nearest neighbor models and weighted k-nearest-


neighbor models will produce very similar results when a feature space is well
populated

for datasets that only sparsely populate the feature space, kNN models usually
make more accurate predictions

Similarity Based learning (part 2 ) : 4


⇒ from chatty :

Why is this important for k-NN?


In k-nearest neighbors (k-NN), predictions are made based on the closest data
points. If the feature space is sparsely populated:
✅ The nearest neighbors are farther apart, reducing the chance of overfitting.
✅ k-NN can work better because the distances between points are more
meaningful.
On the other hand, if the feature space were densely populated, the model might
struggle to generalize due to too many close data points.

⇒ check this about the k-d tree :https://www.youtube.com/watch?v=Y4ZgLlDfKDg

Other Measures of similarity:


Metrics Vs indexes :
Metrics must satisfy criteria: non-negativity, identity , symmetry, and
triangular inequality

Some models used measures of similarity that don’t meet all criteria , called
indexes

Certain techniques, like k-d trees, strictly require similarity measures to be


metrics

⇒ Check example slides 34 ~

Similarity indexes for binary descriptive features:


if the descriptive features in a dataset a re binary , it is often a good idea to
use a similarity index that define similarity between instances specifically in
terms of co-presence or co-absence of features, rather than an index based
on distance

such that :

Similarity Based learning (part 2 ) : 5


co-presence (CP) : how often a true value occurred for the same feature
in both the query data q and the data for the comparison user (d1 or d2)

co-absence (CA) : how often a false value occurred for the same feature
in both the query data q and the data for the comparison user (d1 or d2 )

Presence-absence (PA): how often a true value occurred in the query


data q when a false value occurred in the data for the comparison user (d1
or d2) for the same feature

absence-presence (AP): how often a false value occurred in the query


data q when a true value occurred in the data for the comparison user (d1
or d2 ) for the same feature


In simple terms, CP measures shared
yes values, while CA measures shared no values.

⇒ recheck example slide 36 (IMPORTANT)

Russel-Rao:
one way of judging similarity is to focus solely in co-presence,

the Russel-Raw similarity index is measured in terms of the ratio between the
number of co-presences and the total number of binary features considered :

Sokal-michener:
in some domains co-absence is important

Sokal-Michener is defined as the ratio between the total number of co-


presences and co-absences, and the total number of binary features
considered

Similarity Based learning (part 2 ) : 6


Jaccard index:
the jaccard index ignore co-absences :

Consine similarity: (index)


can be used as a measure of the similarity between instances with continuous
descriptive features

Cosine similarity is an especially useful measure of similarity when the


descriptive features describing instances in a dataset are related to each other

the consine similarity between two instances is the cosine of the inner angle
between the two vectors that extend from the origin to each instance :

we compute the cosine similarity between two instances as the normalized


dot product of the descriptive feature values of the instances, the dot product
is normalized by the product of the lengths of the descriptive feature value
vectors, the dot product of two instances :

Similarity Based learning (part 2 ) : 7


Geometrically, the dot product can be interpreted as equivalent to the cosine
of the angle between the two vectors multiplied by the length of the two
vectors:

Mahalanobis distance:
the mahalanobis distance utilized covariance to scale distances

this ensures that distances along directions where the dataset is spread out
are scaled down

Similarity Based learning (part 2 ) : 8


while distances along directions where the dataset is tightly packed are scaled
up

⇒ check textbook page 219 for clarification


The Mahalanobis distance is similar to the Euclidean distance, but it improves on
it in an important way.

Euclidean distance just measures straight-line distance, treating all features


equally.

Mahalanobis distance adjusts for the relationships between features


(correlations) and their different scales using a special matrix (the inverse
covariance matrix).

The question in the slide asks if Mahalanobis distance is the same as normalizing
features and then using Euclidean distance.

Answer: Not exactly. While normalizing helps by scaling features, Mahalanobis


distance does more—it also removes the effects of correlations between features,
making it better for cases where features are dependent on each other.

Similarity Based learning (part 2 ) : 9


Feature selection:
⇒ different classes of descriptive features:

Predictive: a predictive descriptive feature provides information that is useful


in estimating the correct value of a target feature

Interacting: by itself, an interacting descriptive feature is not informative about


the value of the target feature, in conjunction with one or more other features,
however it becomes informative

Redundant: a descriptive feature is redundant if it has a strong correlation


with another descriptive feature

irrelevant: an irrelevant descriptive feature does not provide information that


is useful in estimating the value of the target feature

⇒ When framed as a local search problem feature selection is defined in terms of


an iterative process consisting of the following stages :

Similarity Based learning (part 2 ) : 10


Subset Generation: this component generates a set candidate feature subsets
that are successors of the current best feature subset

Subset selection:
Subset selection picks the best set of features from the generated options.
One method is using a
filter, which selects the most predictive features based on evaluation. A more
common method is using a wrapper, which tests how well a model performs
with each feature set before choosing the best one.

termination condition

⇒ the search can move through the search space in a number of ways:

Forward sequential selection

backward sequential selection

The goal of any feature selection approach is to identify the smallest subset of
descriptive features that maintains overall model performance

the most popular and straightforward approach to feature selection is to rank


and prune

feature selection as a greedy local search problem, where each state in the
search space specifies a subset of possible features

Similarity Based learning (part 2 ) : 11


Efficient Memory search :
To speed up predictions in a nearest neighbor model, you can precompute an
index (a structured way to store data) instead of searching through all data points
every time.
One effective way to do this is using a k-d tree (short for k-dimensional tree). This
is a special data structure that helps quickly find the closest points, making
predictions much faster. It works best when the training data doesn’t change often

The k-d tree retrieval algorithm:


Start at the root

traverse the tree to the section of the new point

find the leaf; store it as the best

Traverse upward , and for each node:

if it is closer, it becomes the best (update best distance)

Check if there could be the point with a radius equal to the current best
distance

Similarity Based learning (part 2 ) : 12


if that sphere crosses the splitting plane associated with the considered
branch point, go down again on the other side (Descend)

otherwise, go up another level (ascend)

the k-d tree is one of the best known of these indices, balanced binary tree in
which each of the nodes in the tree index one of the instances in a training
dataset

⇒check example slides 70 ~ 76

Ball Tree :
A data structure for speeding up nearest neighbor searches

How it works:

Similarity Based learning (part 2 ) : 13


organizes data points into hierarchical spheres (balls)

Efficiently narrows down search space , avoiding unnecessary distance


computations

Benefits :

accelerates KNN searches, especially in high-dimensional spaces

work better for high dimensionality

Ball tree construction algorithm :

K-d tree Vs ball tree :


k-d tree :
Divides space along one dimension at a time

well-suited for lower-dimensional data

Ball tree :
divides space using hyperspheres (balls )

effective in high-dimensional spaces

Similarity Based learning (part 2 ) : 14


Considerations:
Choose based on data dimensionality

experiment with both for optimal performance

⇒ How ball tree partion the data :

Summary:
Nearest neighbor models are very sensitive to noise in the target feature the
easiest way to solve this problem is to employ a k nearest neighbor

Normalization techniques should almost always be applied when NN models


are used

it is easy to adapt a nearest neighbor model to continuous targets

there many different measures of similarity

feature selection is a particularly important process for NN algorithms it


alleviates the curse of dimensionality

as the number of instances becomes large a NN model will become slower-


techniques like the k-d tree can help with this issue

Similarity Based learning (part 2 ) : 15

You might also like