DSB- Unit3
DSB- Unit3
Similarity
Similarity refers to the degree to which two objects or entities are alike or
resemble each other in some aspect.
Euclidean Distance: Measures the straight-line distance between two points in Euclidean
space. It is commonly used for numerical data and continuous features.
Cosine Similarity: Measures the cosine of the angle between two vectors in a high-
dimensional space. It is commonly used for text data, document similarity, and collaborative
filtering.
Jaccard Similarity: Measures the size of the intersection of two sets divided by the size of
the union of the sets. It is commonly used for categorical data and binary features.
Hamming Distance: Measures the number of positions at which corresponding symbols are
different between two strings of equal length. It is commonly used for comparing sequences,
such as DNA sequences.
Correlation: Measures the linear relationship between two variables. It is commonly used for
analyzing relationships between continuous variables.
Euclidean Distance
• It is only really useful for comparing the similarity of one pair of instances
to that of another pair.
1. Color: yellow, very pale, pale, pale gold, gold, old gold, full gold, amber, etc. (14 values)
2. Nose: aromatic, peaty, sweet, light, fresh, dry, grassy, etc. (12 values)
3. Body: soft, medium, full, round, smooth, light, firm, oily. (8 values)
4. Palate: full, dry, sherry, big, fruity, grassy, smoky, salty, etc. (15 values)
5. Finish: full, dry, warm, light, smooth, clean, fruity, grassy, smoky, etc. (19 values)
Nearest Neighbors for Predictive Modeling
Nearest neighbor classification. The point to be
classified, labeled with a question mark, would
be classified + because the majority of its
nearest (three) neighbors are +.
combining function (like voting or averaging)
operating on the neighbors’ known target values.
The combining function will give us a prediction.
There are different types of combining functions used in nearest neighbors algorithms,
depending on the specific task and problem domain. Some common combining functions
include
• Majority Voting: In classification tasks, the majority voting combining function assigns the class label that
occurs most frequently among the nearest neighbors to the new data point. For example, in a binary
classification problem, if the majority of the nearest neighbors belong to class 1, the combining function
predicts class 1 for the new data point.
• Weighted Voting: Weighted voting assigns weights to each nearest neighbor based on their distance or
similarity to the new data point. Neighbors that are more similar or closer to the new data point may be
given higher weights, while neighbors that are less similar or farther away may be given lower weights.
The final prediction is then made by considering the weighted contributions of all neighbors.
• Regression: In regression tasks, the combining function can be a simple average or weighted average of the
target values of the nearest neighbors. For example, in a regression problem, the combining function may
calculate the average of the target values of the k nearest neighbors and use this average as the predicted value
for the new data point.
• Distance-based Weighting: Similar to weighted voting, distance-based weighting assigns weights to each
nearest neighbor based on their distance to the new data point. However, instead of using the weights
directly in the prediction, the combining function may use them to scale the contribution of each neighbor, such
as by inversely weighting the contribution based on distance (i.e., closer neighbors have higher weights).
Classification – (by voting-combining function)
The nearest neighbors (in this example,
Three of them) are retrieved and their
known target variables (classes) are
consulted.
• What is majority
voting here? Yes or No
• Second, should we treat all neighbors the same? Though all are called “nearest”
neighbors, some are nearer than others, and shouldn’t this influence how they’re used?
negative island
• much les
• s jagged.
Intelligibility
Intelligibility in the context of k-nearest neighbors (k-NN) refers to the ability to
understand and interpret the decisions or predictions made by the algorithm.
Unlike some other machine learning models, such as decision trees or linear
regression, k-NN does not provide explicit rules or coefficients that explain how
predictions are made. Instead, predictions are based on the similarity of the
query point to its nearest neighbors in the feature space.
Example:
Amazon presents recommendations with phrases like: “Customers
with similar searches purchased…” and “Related to Items You’ve
Viewed.”
• High Computational Complexity: Nearest-neighbor methods can be computationally
expensive, especially when dealing with large datasets. For each prediction, the algorithm
needs to calculate the distances between the query point and all data points in the dataset,
which can become impractical for datasets with millions of samples or high-dimensional
feature spaces.
• Memory Requirements: Storing the entire dataset in memory is necessary for efficient
nearest-neighbor search. As the dataset grows in size, memory requirements increase
proportionally, potentially leading to memory constraints on resource-limited systems.
• Sensitive to Noise and Outliers: Nearest-neighbor methods are sensitive to noisy or
irrelevant features in the dataset. Outliers or mislabeled data points can significantly affect
the performance of the algorithm, leading to suboptimal predictions.
• Curse of Dimensionality: In high-dimensional feature spaces, the notion of distance
becomes less meaningful, and the density of data points decreases exponentially with the
number of dimensions. This phenomenon, known as the curse of dimensionality, can
adversely affect the performance of nearest-neighbor methods by causing sparsity and
reducing the effectiveness of distance-based similarity measures.
• Need for Feature Scaling: Nearest-neighbor algorithms rely on distance-based
metrics to measure similarity between data points. Therefore, it's essential to scale
features to the same range or normalize them to ensure that features with larger
scales do not dominate the distance calculations.
• Inefficient for Large Datasets: While KNN is simple to implement and understand,
its prediction time grows linearly with the size of the dataset. This makes it
inefficient for real-time applications or scenarios where fast predictions are required.
• Class Imbalance: In classification tasks with imbalanced class distributions, KNN
may favor the majority class due to its reliance on local neighborhood information.
This can lead to biased predictions and poor performance on minority class samples.
• Optimal Choice of k: The choice of the hyper parameter k (number of nearest
neighbors) can significantly impact the performance of KNN. Selecting an
appropriate value for k requires experimentation and domain knowledge, and an
improper choice can lead to under fitting or over fitting.
Some Important Technical Details Relating to Similarities and Neighbors
Heterogeneous Attributes
Attributes with
numeric
Few more Attributes, categorical
Euclidean distance is probably the most widely used distance metric in data
science. It is general, intuitive and computationally very fast. Because it
employs the squares of the distances along each individual dimension, it is
sometimes called the L2 norm and sometimes represented by || ・ ||2.
The Manhattan distance or L1-norm is the sum of the (unsquared) pairwise
distances
Jaccard distance
Cosine distance Cosine distance is often used in text classification to measure the
similarity of two documents
Document A
contains seven occurrences of the word performance, three occurrences of
transition, and two occurrences of monetary A = <7,3,2>
Document B
contains two occurrences of performance, three occurrences of transition, and
no occurrences of monetary. B = <2,3,0>
The cosine distance of the two documents is
A = <7,3,2> , C = <70, 30, 20>.
Clustering is another application of our fundamental notion of similarity. The basic idea is
that we want to find groups of objects (consumers, businesses, whiskeys, etc.), where the
objects within groups are similar, but the objects in different groups are not so similar.
Supervised modeling Vs Unsupervised modeling
Hierarchical Clustering
Groups the points by their similarity
Hierarchical clustering is a method used in data analysis and data mining to
group similar data points into clusters based on their characteristics or features.
This clustering technique builds a hierarchy of clusters, where clusters at the
same level of the hierarchy are more similar to each other than clusters at
different levels
Initialization: Each data point is initially treated as a separate cluster.
Similarity Measurement: A similarity or dissimilarity metric is calculated between pairs of
data points. Common distance metrics include Euclidean distance, Manhattan distance, and
cosine similarity.
Cluster Fusion: The two most similar clusters are merged together to form a larger cluster.
This process is repeated iteratively until all data points belong to a single cluster, forming a
hierarchy of clusters.
Hierarchical Structure: The result of hierarchical clustering is represented as a dendrogram,
which is a tree-like structure that illustrates the hierarchical relationships between clusters.
The root of the dendrogram represents the single cluster containing all data points, while the
leaves represent individual data points.
• Agglomerative Clustering: This bottom-up approach starts with each data point as a
separate cluster and iteratively merges the most similar clusters until a single cluster
containing all data points is formed.
• Divisive Clustering: This top-down approach starts with all data points in a single
cluster and recursively splits the cluster into smaller clusters until each data point is in
its own cluster.
An advantage of
hierarchical clustering is
that
it allows the data analyst to
see the groupings—the
“landscape” of data
similarity—before deciding
on the number of clusters to
extract.
• Then clusters are merged iteratively until only a single cluster remains. The
clusters are merged based on the similarity or distance function that is chosen.
• So, for example, the linkage function could be “the Euclidean distance
between the closest points in each of the clusters,” which would apply to any
two clusters.
The phylogenetic Tree of
Life, a huge hierarchical
clustering of species,
displayed radially.
A portion of the Tree of Life.
Hierarchical clustering of Scotch whiskeys
Clustering Around Centroids
The most common method for focusing on the clusters themselves is to represent
each cluster by its “cluster center,” or centroid
Three clusters, whose instances are
represented by the circles. Each cluster
has a centroid, represented by the
solid-lined star
2.Number of Clusters:
1. K-means: Requires specifying the number of clusters (K) a priori. The algorithm aims to
minimize the within-cluster variance, but the quality of clustering may depend on the initial
choice of centroids.
2. Hierarchical: Does not require specifying the number of clusters in advance. The dendrogram
allows for exploring different levels of granularity in the clustering hierarchy, enabling the
selection of an appropriate number of clusters based on the data.
1.Scalability:
1. K-means: Generally more scalable and efficient for large datasets compared to hierarchical
clustering, particularly when using optimized algorithms like mini-batch K-means.
2.Cluster Shape:
1. K-means: Assumes that clusters are spherical and of equal size, making it less suitable for
data with non-linear or irregular cluster shapes.
2. Hierarchical: Can handle clusters of arbitrary shapes and sizes, as it does not make any
assumptions about the shape of the clusters.
3.Interpretability:
1. K-means: Provides easily interpretable results, as each data point is assigned to a single
cluster. However, the quality of clustering may depend on the initial choice of centroids.
as it needs to know the distances between all pairs of clusters on each iteration, which
at the start is all pairs of data points.
False positives
• negative instances classified as positive
• Classifier A often falsely predicts that customers
will churn when they will not
False negatives
• positives classified as negative
• classifier B makes many opposite errors of
predicting that customers will not churn
when in fact they will
Two churn models, A and
B, can make an equal
number of errors on a
balanced population used
for training
Very different number of errors
when tested against the true
population
Problems with Unequal Costs and Benefits
For Classification
Confusion matrix
For regression problem
Means square error- R2 value
A Key Analytical Framework: Expected Value
E(X)
xirepresents each possible outcome of the random variable X.
•P(X=xi) represents the probability of occurrence of the outcome xi.
•The sum is taken over all possible outcomes of X.
• Mean Outcome: The expected value represents the average outcome or value that one
would expect to occur in the long run, considering all possible outcomes and their
respective probabilities.
• Decision Making: Expected value calculations are used to make decisions under
uncertainty. For example, in business analytics, decision-makers may use expected value
analysis to evaluate the potential outcomes and risks associated with different strategies
or investments.
• Risk Assessment: Expected value calculations help quantify the risk associated with
uncertain events or variables. By considering the probabilities of different outcomes,
analysts can assess the potential impact of risk and uncertainty on business objectives.
• Comparison Metric: Expected value serves as a useful metric for comparing different
alternatives or scenarios. Decision-makers can compare the expected values of different
options to identify the most favorable or optimal course of action.
• Application in Machine Learning: In machine learning and predictive modeling,
expected value calculations are used to evaluate the performance of models and assess
their predictive accuracy. For example, in regression analysis, the expected value of the
predicted outcome is compared to the actual observed values to measure model
performance.
For example,
Now, what about VNR, the value to us if the consumer does not respond? We
still mailed the marketing materials, incurring a cost of $1 or equivalently a
benefit of -$1.
Now we are ready to say precisely whether we want to target this consumer:
do we expect to make a profit?
Technically, is the expected value (profit) of targeting greater than zero?
Mathematically, this is:
90 5
10 95
900 0
0 95
90 500
1000 95
https://
towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec
Using Expected Value to Frame Classifier Evaluation
we need to evaluate the set of decisions made by a model when applied to a set
of examples.
1. Does our data-driven model perform better than the hand-crafted model
suggested by the marketing group?
2. Does a classification tree work better than a linear discriminant model for a
particular problem?
3. Do any of the models do substantially better than a baseline “model,” such as
randomly choosing consumers to target?
• What we care about is, in aggregate, how well does each model do: what is its
expected value.
A training portion of a
dataset is taken as input
by an induction algorithm,
which produces the model
that we will evaluate
We can use the expected value framework just described to determine the best
decisions for each particular model, and then use the expected value in a different
way to compare the models
For example, what is the probability associated with the particular combination of
a consumer being predicted to churn and actually does not churn?
That would be estimated by the number of test-set consumers who fell into the
confusion matrix cell (Y,n), divided by the total number of test-set consumers
Each cell of the confusion matrix contains a count of the number of decisions
corresponding to the corresponding combination of (predicted, actual), which
we will express as count(h,a)
For the expected value calculation we reduce these counts to rates or estimated
probabilities, p(h,a). We do this by dividing each count by the total number of
instances
A false negative is a consumer who was predicted not to be a likely responder (so
was not offered the product), but would have bought it if offered. In this case, no
money was spent and nothing was gained, so b(N, p) = 0.
• A true positive is a consumer who is offered the product and buys it. The benefit in
this case is the profit from the revenue ($200) minus the product-related costs
($100) and the mailing costs ($1), so b(Y, p) = 99.
• A true negative is a consumer who was not offered a deal and who would not have
bought it even if it had been offered. The benefit in this case is zero (no profit but
no cost), so b(N, n) = 0.
A cost-benefit matrix for the targeted marketing example
The general form of an expected value calculation
All we need is to be able to compute the confusion matrices over a set of test
instances, and to generate the cost-benefit matrix.
A rule of basic probability is:
This says that the probability of two different events both occurring is equal to the probability
of one of them occurring times the probability of the other occurring if we know that the first
occurs
This expected value means that if we apply this model to a population of
prospective customers and mail offers to those it classifies as positive, we can
expect to make an average of about $50 profit per consumer