0% found this document useful (0 votes)
2 views

DSB- Unit3

The document discusses fundamental concepts of similarity, neighbors, and clustering in data science, emphasizing various similarity measures such as Euclidean distance, cosine similarity, and Jaccard similarity. It explains the application of these measures in clustering, recommendation systems, and predictive modeling using nearest neighbors algorithms. Additionally, it highlights challenges associated with nearest-neighbor methods, including computational complexity, sensitivity to noise, and the importance of feature scaling.

Uploaded by

Mr. Praneeth
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DSB- Unit3

The document discusses fundamental concepts of similarity, neighbors, and clustering in data science, emphasizing various similarity measures such as Euclidean distance, cosine similarity, and Jaccard similarity. It explains the application of these measures in clustering, recommendation systems, and predictive modeling using nearest neighbors algorithms. Additionally, it highlights challenges associated with nearest-neighbor methods, including computational complexity, sensitivity to noise, and the importance of feature scaling.

Uploaded by

Mr. Praneeth
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 87

Unit-III

Similarity, Neighbors, and clusters, decision analytic thinking:


What is a good model

Fundamental concepts: Calculating similarity of objects described by data;


Using similarity for prediction; Clustering as similarity-based segmentation.

Similarity
Similarity refers to the degree to which two objects or entities are alike or
resemble each other in some aspect.

In data science, similarity is often quantified using similarity measures or


metrics, which compute a numerical value representing the similarity
between two data points, samples, or entities
There are various similarity measures used in data science, depending on the nature of
the data and the specific problem domain. Some common similarity measures include:

Euclidean Distance: Measures the straight-line distance between two points in Euclidean
space. It is commonly used for numerical data and continuous features.
Cosine Similarity: Measures the cosine of the angle between two vectors in a high-
dimensional space. It is commonly used for text data, document similarity, and collaborative
filtering.
Jaccard Similarity: Measures the size of the intersection of two sets divided by the size of
the union of the sets. It is commonly used for categorical data and binary features.
Hamming Distance: Measures the number of positions at which corresponding symbols are
different between two strings of equal length. It is commonly used for comparing sequences,
such as DNA sequences.
Correlation: Measures the linear relationship between two variables. It is commonly used for
analyzing relationships between continuous variables.
Euclidean Distance

For two points p=(p1​,p2​,…,pn​) and q=(q1​,q2​,…,qn​) in an n-


dimensional space, the Euclidean distance d(p,q) is calculated as:
Cosine Similarity
Cosine similarity is a measure used to determine how similar two vectors are in a high-
dimensional space. Cosine similarity is commonly used in text mining, information retrieval,
and recommendation systems, where documents or items are represented as high-dimensional
vectors
Jaccard Similarity
Given two sets A and B, Jaccard similarity J(A,B) is calculated as the size of the intersection of the sets
divided by the size of their union:
Hamming distance is a measure used to quantify the difference between two strings of
equal length. It calculates the number of positions at which corresponding symbols are
different between the two strings.
Given two strings a and b of equal length n, the Hamming distance dH​(a,b) is
calculated as:
dH​(a,b)=number of positions i such that ai​ ≠ bi​. In other words, the
Hamming distance is the count of positions where the symbols (characters) of the
two strings differ.
Correlation
Similarity measures are used in various ways in data science for business
problems:
Clustering: Similarity measures are used to group similar data points together in clustering
algorithms such as k-means clustering or hierarchical clustering. Clustering helps in identifying
natural groupings or patterns in the data.
Recommendation Systems: Similarity measures are used to identify similar items or users in
recommendation systems. For example, in collaborative filtering-based recommendation
systems, cosine similarity is often used to measure the similarity between users or items.
Search and Retrieval: Similarity measures are used in information retrieval systems to rank
documents or search results based on their similarity to a query. Cosine similarity is commonly
used in text retrieval systems.
Anomaly Detection: Similarity measures are used to detect outliers or anomalies in the data by
measuring how dissimilar a data point is from the rest of the data. Anomalies are often detected
as data points that are significantly different from the majority of the data.
Classification and Regression: Similarity measures are used in classification and regression
tasks to measure the similarity between a new data point and the training data. Similarity-based
algorithms such as k-nearest neighbors (KNN) use the similarity between data points to make
predictions.
Similarity and distance

how two different model


types divide up an instance
space into regions based on
closeness of instances with
similar class labels

They have in common the view that


instances sharing a common region in
space should be similar; what differs
between the methods is how the
regions are represented and
discovered
• So the distance between these examples is about 19.
• This distance is just a number—it has no units, and no meaningful
interpretation.

• It is only really useful for comparing the similarity of one pair of instances
to that of another pair.

• It turns out that comparing similarities is extremely useful.


Whiskey Analytics
• Let’s take a data science approach. we first should think about the exact question we
would like to answer, and what are the appropriate data to answer it.

• How can we describe single malt Scotch whiskeys as feature vectors, in


such a way that we think similar whiskeys will have similar taste?
Tasting notes are published for many whiskeys
Michael Jackson’s Malt Whisky Companion: A Connoisseur’s Guide to the Malt Whiskies of
Scotland (Jackson, 1989).

which describes 109 different single malt Scotches of Scotland. Like


• Appetizing aroma of peat smoke
• almost incense-like
• heather honey with a fruity softness
Here is their description of Bunnahabhain
Define five general whiskey attributes, each with many possible values:

1. Color: yellow, very pale, pale, pale gold, gold, old gold, full gold, amber, etc. (14 values)

2. Nose: aromatic, peaty, sweet, light, fresh, dry, grassy, etc. (12 values)

3. Body: soft, medium, full, round, smooth, light, firm, oily. (8 values)

4. Palate: full, dry, sherry, big, fruity, grassy, smoky, salty, etc. (15 values)

5. Finish: full, dry, warm, light, smooth, clean, fruity, grassy, smoky, etc. (19 values)
Nearest Neighbors for Predictive Modeling
Nearest neighbor classification. The point to be
classified, labeled with a question mark, would
be classified + because the majority of its
nearest (three) neighbors are +.
combining function (like voting or averaging)
operating on the neighbors’ known target values.
The combining function will give us a prediction.
There are different types of combining functions used in nearest neighbors algorithms,
depending on the specific task and problem domain. Some common combining functions
include
• Majority Voting: In classification tasks, the majority voting combining function assigns the class label that
occurs most frequently among the nearest neighbors to the new data point. For example, in a binary
classification problem, if the majority of the nearest neighbors belong to class 1, the combining function
predicts class 1 for the new data point.
• Weighted Voting: Weighted voting assigns weights to each nearest neighbor based on their distance or
similarity to the new data point. Neighbors that are more similar or closer to the new data point may be
given higher weights, while neighbors that are less similar or farther away may be given lower weights.
The final prediction is then made by considering the weighted contributions of all neighbors.
• Regression: In regression tasks, the combining function can be a simple average or weighted average of the
target values of the nearest neighbors. For example, in a regression problem, the combining function may
calculate the average of the target values of the k nearest neighbors and use this average as the predicted value
for the new data point.
• Distance-based Weighting: Similar to weighted voting, distance-based weighting assigns weights to each
nearest neighbor based on their distance to the new data point. However, instead of using the weights
directly in the prediction, the combining function may use them to scale the contribution of each neighbor, such
as by inversely weighting the contribution based on distance (i.e., closer neighbors have higher weights).
Classification – (by voting-combining function)
The nearest neighbors (in this example,
Three of them) are retrieved and their
known target variables (classes) are
consulted.

In this this case, two examples are


positive and one is negative. What
should be our combining function?

A simple combining function in this case


would be majority vote, so the predicted
class would be positive.
Classification –credit card example

• What is majority
voting here? Yes or No

• how many neighbors


should we use?
• Should they have
equal weights in the
combining function?
How Many Neighbors and How Much Influence?
• First, why three neighbors, instead of just one, or five, or one hundred?

• Second, should we treat all neighbors the same? Though all are called “nearest”
neighbors, some are nearer than others, and shouldn’t this influence how they’re used?

Example k-NN – k nearest neighbors

3-NN – 3 nearest neighbors


Using as the scaling Weight the
reciprocal of the square of the
distance
𝟏
𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚𝑾𝒆𝒊𝒈𝒉𝒕= 𝟐
𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆
Final probability estimates for
david are 0.65 for yes and 0.35
for no
Geometric interpretation, overfitting, and complexity control
Boundaries created by a 1-NN classifier

negative island

• More generally, irregular concept


boundaries are characteristic of all
nearest-neighbor classifiers, because
they do not impose any particular
geometric form on the classifier.

• Instead, they form boundaries in


instance space tailored to the specific
data used for training
how should one choose k?
how should one choose k?
we can conduct cross-validation or other nested holdout testing on the training
set, for a variety of different values of k, searching for one that gives the best
performance on the training data. Then when we have chosen a value of k, we
build a k-NN model from the entire training set.
Classification
boundaries created
on a three-class
problem created by 1-
NN (single nearest
neighbor).
• Classification boundaries created
on a three-class problem created
by 30-NN (averaging 30 nearest
neighbors)

• 30 nearest neighbours are


averaged to form a classification

• Negative island effect is removed

• much les
• s jagged.

• boundaries for k-NN are more


strongly defined by the data.
Issues with Nearest-Neighbor Methods

Intelligibility
Intelligibility in the context of k-nearest neighbors (k-NN) refers to the ability to
understand and interpret the decisions or predictions made by the algorithm.
Unlike some other machine learning models, such as decision trees or linear
regression, k-NN does not provide explicit rules or coefficients that explain how
predictions are made. Instead, predictions are based on the similarity of the
query point to its nearest neighbors in the feature space.

Example:
Amazon presents recommendations with phrases like: “Customers
with similar searches purchased…” and “Related to Items You’ve
Viewed.”
• High Computational Complexity: Nearest-neighbor methods can be computationally
expensive, especially when dealing with large datasets. For each prediction, the algorithm
needs to calculate the distances between the query point and all data points in the dataset,
which can become impractical for datasets with millions of samples or high-dimensional
feature spaces.
• Memory Requirements: Storing the entire dataset in memory is necessary for efficient
nearest-neighbor search. As the dataset grows in size, memory requirements increase
proportionally, potentially leading to memory constraints on resource-limited systems.
• Sensitive to Noise and Outliers: Nearest-neighbor methods are sensitive to noisy or
irrelevant features in the dataset. Outliers or mislabeled data points can significantly affect
the performance of the algorithm, leading to suboptimal predictions.
• Curse of Dimensionality: In high-dimensional feature spaces, the notion of distance
becomes less meaningful, and the density of data points decreases exponentially with the
number of dimensions. This phenomenon, known as the curse of dimensionality, can
adversely affect the performance of nearest-neighbor methods by causing sparsity and
reducing the effectiveness of distance-based similarity measures.
• Need for Feature Scaling: Nearest-neighbor algorithms rely on distance-based
metrics to measure similarity between data points. Therefore, it's essential to scale
features to the same range or normalize them to ensure that features with larger
scales do not dominate the distance calculations.
• Inefficient for Large Datasets: While KNN is simple to implement and understand,
its prediction time grows linearly with the size of the dataset. This makes it
inefficient for real-time applications or scenarios where fast predictions are required.
• Class Imbalance: In classification tasks with imbalanced class distributions, KNN
may favor the majority class due to its reliance on local neighborhood information.
This can lead to biased predictions and poor performance on minority class samples.
• Optimal Choice of k: The choice of the hyper parameter k (number of nearest
neighbors) can significantly impact the performance of KNN. Selecting an
appropriate value for k requires experimentation and domain knowledge, and an
improper choice can lead to under fitting or over fitting.
Some Important Technical Details Relating to Similarities and Neighbors
Heterogeneous Attributes

Attributes with
numeric
Few more Attributes, categorical

It must be encoded numerically. For binary


variables, a simple encoding like M=0, F=1 may be sufficient, but if there are multiple values
for a categorical attribute this will not be good enough
The general principle at work is that care must be taken that the
similarity/distance computation is meaningful for the application

Euclidean distance is probably the most widely used distance metric in data
science. It is general, intuitive and computationally very fast. Because it
employs the squares of the distances along each individual dimension, it is
sometimes called the L2 norm and sometimes represented by || ・ ||2.
The Manhattan distance or L1-norm is the sum of the (unsquared) pairwise
distances

Jaccard distance

Cosine distance Cosine distance is often used in text classification to measure the
similarity of two documents
Document A
contains seven occurrences of the word performance, three occurrences of
transition, and two occurrences of monetary  A = <7,3,2>
Document B
contains two occurrences of performance, three occurrences of transition, and
no occurrences of monetary.  B = <2,3,0>
The cosine distance of the two documents is
A = <7,3,2> , C = <70, 30, 20>.

The cosine distance


Clustering
Supervised segmentation—finding groups of objects that differ with respect to
some target characteristic of interest. For example, find groups of customers that
differ with respect to their propensity to leave the company when their contracts
expire.

Idea of finding natural groupings in the data may be called unsupervised


segmentation, or more simply clustering

Clustering is another application of our fundamental notion of similarity. The basic idea is
that we want to find groups of objects (consumers, businesses, whiskeys, etc.), where the
objects within groups are similar, but the objects in different groups are not so similar.
Supervised modeling Vs Unsupervised modeling

• Supervised modeling involves discovering patterns to predict the value


of a specified target variable, based on data where we know the values of
the target variable.

• Unsupervised modeling does not focus on a target variable. Instead it


looks for other sorts of regularities in a set of data.

Hierarchical Clustering
Groups the points by their similarity
Hierarchical clustering is a method used in data analysis and data mining to
group similar data points into clusters based on their characteristics or features.
This clustering technique builds a hierarchy of clusters, where clusters at the
same level of the hierarchy are more similar to each other than clusters at
different levels
Initialization: Each data point is initially treated as a separate cluster.
Similarity Measurement: A similarity or dissimilarity metric is calculated between pairs of
data points. Common distance metrics include Euclidean distance, Manhattan distance, and
cosine similarity.
Cluster Fusion: The two most similar clusters are merged together to form a larger cluster.
This process is repeated iteratively until all data points belong to a single cluster, forming a
hierarchy of clusters.
Hierarchical Structure: The result of hierarchical clustering is represented as a dendrogram,
which is a tree-like structure that illustrates the hierarchical relationships between clusters.
The root of the dendrogram represents the single cluster containing all data points, while the
leaves represent individual data points.
• Agglomerative Clustering: This bottom-up approach starts with each data point as a
separate cluster and iteratively merges the most similar clusters until a single cluster
containing all data points is formed.

• Divisive Clustering: This top-down approach starts with all data points in a single
cluster and recursively splits the cluster into smaller clusters until each data point is in
its own cluster.

Hierarchical clustering offers several advantages, including:


• Intuitive visualization of cluster relationships through dendrograms.
• No need to specify the number of clusters in advance.
• Flexibility to explore different levels of granularity in the clustering
hierarchy
implicit hierarchy

The most general (highest-


level) clustering is just the
single cluster that contains
everything

The lowest-level clustering


is when we remove all the
circles
Dendrogram
hierarchy explicit

An advantage of
hierarchical clustering is
that
it allows the data analyst to
see the groupings—the
“landscape” of data
similarity—before deciding
on the number of clusters to
extract.

Whenever a single point merges high up in


a dendrogram, this is an indication that it
seems different from the rest, which we
might call an “outlier,”
• Hierarchical clusterings generally are formed by starting with each node as its
own cluster.

• Then clusters are merged iteratively until only a single cluster remains. The
clusters are merged based on the similarity or distance function that is chosen.

• So far we have discussed distance between instances. For hierarchical


clustering, we need a distance function between clusters, considering
individual instances to be the smallest clusters.

• This is sometimes called the linkage function.

• So, for example, the linkage function could be “the Euclidean distance
between the closest points in each of the clusters,” which would apply to any
two clusters.
The phylogenetic Tree of
Life, a huge hierarchical
clustering of species,
displayed radially.
A portion of the Tree of Life.
Hierarchical clustering of Scotch whiskeys
Clustering Around Centroids
The most common method for focusing on the clusters themselves is to represent
each cluster by its “cluster center,” or centroid
Three clusters, whose instances are
represented by the circles. Each cluster
has a centroid, represented by the
solid-lined star

The star is not necessarily one of the


instances; it is the geometric center
of a group of instances

The most popular centroid-based


clustering algorithm is called k-means
clustering
In k(3)-means the “means”
are the centroids, represented
by the arithmetic means
(averages) of the values
along each dimension for the
instances in the cluster.

Generally, the centroid is the


average of the values for each
feature of each example in
the cluster
k-means clustering
example using 90
points on a plane
k-means clustering
example using 90 points on
a plane and k=3 centroids.
1.Approach:
1. K-means: K-means clustering is a partitioning method that divides data into a pre-defined
number of clusters (K). It iteratively assigns data points to the nearest cluster centroid based
on a distance metric (typically Euclidean distance) and updates the centroids until
convergence.

2. Hierarchical: Hierarchical clustering builds a hierarchy of clusters using a bottom-up


(agglomerative) or top-down (divisive) approach. It does not require specifying the number of
clusters in advance and generates a dendrogram to visualize the hierarchical relationships
between clusters.

2.Number of Clusters:
1. K-means: Requires specifying the number of clusters (K) a priori. The algorithm aims to
minimize the within-cluster variance, but the quality of clustering may depend on the initial
choice of centroids.

2. Hierarchical: Does not require specifying the number of clusters in advance. The dendrogram
allows for exploring different levels of granularity in the clustering hierarchy, enabling the
selection of an appropriate number of clusters based on the data.
1.Scalability:
1. K-means: Generally more scalable and efficient for large datasets compared to hierarchical
clustering, particularly when using optimized algorithms like mini-batch K-means.

2. Hierarchical: Can be computationally intensive, especially for large datasets, as the


algorithm's time complexity is quadratic or worse in the number of data points.

2.Cluster Shape:
1. K-means: Assumes that clusters are spherical and of equal size, making it less suitable for
data with non-linear or irregular cluster shapes.

2. Hierarchical: Can handle clusters of arbitrary shapes and sizes, as it does not make any
assumptions about the shape of the clusters.

3.Interpretability:
1. K-means: Provides easily interpretable results, as each data point is assigned to a single
cluster. However, the quality of clustering may depend on the initial choice of centroids.

2. Hierarchical: Offers intuitive visualization through dendrograms, allowing for the


exploration of cluster relationships and substructures. However, determining the optimal
number of clusters can be subjective.
In terms of run time, the k-means algorithm is efficient. Even with multiple runs it is
generally relatively fast, because it only computes the distances between each data point
and the cluster centers on each interactions.

Hierarchical clustering is generally slower,

as it needs to know the distances between all pairs of clusters on each iteration, which
at the start is all pairs of data points.

A common concern with centroid algorithms such as k-means is how to


determine a good value for k. One answer is simply to experiment with
different k values and see which ones generate good results. but the minimum k
where the stabilization begins is often a good choice.
Decision Analytic Thinking I:
What Is a Good Model?
Often it is not possible to measure perfectly one’s ultimate goal, for example
because the systems are inadequate, or because it is too costly to gather the right
data, or because it is difficult to assess causality. So, we might conclude that we
need to measure some surrogate for what we’d really like to measure. It is
nonetheless crucial to think carefully about what we’d really like to measure.

If we have to choose a surrogate, we should do it via careful, data-analytic


thinking.
We cannot offer a single evaluation metric that is “right” for any
classification problem, or regression problem, or whatever problem you may
encounter.
Evaluating Classifiers
Consider binary classification, for which the classes often are simply called
“positive” and “negative.” How shall we evaluate how well such a model
performs?

Plain Accuracy and Its Problems

This is equal to 1–error rate


The Confusion Matrix
• To evaluate a classifier properly it is important to understand the notion of
class confusion and the confusion matrix, which is one sort of contingency
table.
• Confusion matrix for a problem involving n classes is an n × n matrix with
the columns labeled with actual classes and the rows labeled with predicted
classes. Correct predictions (main diagonal)
and errors (off-diagonal) entries.
The errors of the classifier are the

false positives (negative instances


classified as positive) and
false negatives (positives classified as
negative).
Problems with Unbalanced Classes
Bias Towards Majority Class:
The classifier tends to be biased towards the majority class, as it achieves high accuracy by
simply predicting the majority class for most instances. As a result, the minority class(es) may be
overlooked or misclassified, leading to poor predictive performance.
Difficulty in Learning Patterns of Minority Class:
With imbalanced data, the classifier may struggle to learn the patterns and characteristics of the
minority class, especially if the number of instances is significantly smaller than that of the
majority class. This is because the model has fewer examples to learn from and may not
adequately capture the variability within the minority class.
Evaluation Metrics Misleading:
Traditional evaluation metrics such as accuracy can be misleading in the presence of imbalanced
classes. A model that predicts the majority class for all instances may achieve high
accuracy, but it fails to capture the performance on the minority class, which is often the
class of interest. Evaluation metrics like precision, recall, F1-score, and ROC AUC (Receiver
Operating Characteristic Area Under the Curve) are more informative for imbalanced classes.
However, they may not provide a complete picture of the classifier's performance, especially if
In a training population of 1,000 customers

False positives
• negative instances classified as positive
• Classifier A often falsely predicts that customers
will churn when they will not

False negatives
• positives classified as negative
• classifier B makes many opposite errors of
predicting that customers will not churn
when in fact they will
Two churn models, A and
B, can make an equal
number of errors on a
balanced population used
for training
Very different number of errors
when tested against the true
population
Problems with Unequal Costs and Benefits

Consider a medical diagnosis domain where a patient is wrongly informed he has


cancer when he does not. This is a false positive error. The result would likely be
that the patient would be given further tests or a biopsy, which would eventually
disconfirm the initial diagnosis of cancer. This mistake might be expensive,
inconvenient, and stressful for the patient, but it would not be life threatening.
Compare this with the opposite error: a patient who has cancer but she is wrongly
told she does not. This is a false negative. This second type of error would mean a
person with cancer would miss early detection, which could have far more serious
consequences. These two errors are very different, should be counted separately,
and should have different costs.
Once aggregated, these will produce an expected profit (or expected benefit
or expected cost) estimate for the classifier.
Generalizing Beyond Classification

For Classification
Confusion matrix
For regression problem
Means square error- R2 value
A Key Analytical Framework: Expected Value

The expected value computation provides a framework that is extremely useful


in organizing thinking about data-analytic problems.
Specifically, it decomposes data-analytic thinking into
(i) the structure of the problem,
(ii) the elements of the analysis that can be extracted from the data, and
(iii) the elements of the analysis that need to be acquired from other sources
The expected value (EV) of a random variable X, denoted as E(X), is calculated by
summing the product of each possible outcome of X and its corresponding
probability of occurrence

Mathematically, it is expressed as:

E(X)
xi​represents each possible outcome of the random variable X.
•P(X=xi​) represents the probability of occurrence of the outcome xi​.
•The sum is taken over all possible outcomes of X.
• Mean Outcome: The expected value represents the average outcome or value that one
would expect to occur in the long run, considering all possible outcomes and their
respective probabilities.
• Decision Making: Expected value calculations are used to make decisions under
uncertainty. For example, in business analytics, decision-makers may use expected value
analysis to evaluate the potential outcomes and risks associated with different strategies
or investments.
• Risk Assessment: Expected value calculations help quantify the risk associated with
uncertain events or variables. By considering the probabilities of different outcomes,
analysts can assess the potential impact of risk and uncertainty on business objectives.
• Comparison Metric: Expected value serves as a useful metric for comparing different
alternatives or scenarios. Decision-makers can compare the expected values of different
options to identify the most favorable or optimal course of action.
• Application in Machine Learning: In machine learning and predictive modeling,
expected value calculations are used to evaluate the performance of models and assess
their predictive accuracy. For example, in regression analysis, the expected value of the
predicted outcome is compared to the actual observed values to measure model
performance.
For example,

if the outcomes represent different possible levels of profit,


• an expected profit calculation weights heavily the highly likely levels of profit,
while
• unlikely levels of profit are given little weight.
Consider that we have an offer for a product that, for simplicity, is only
available via this offer.
If the offer is not made to a consumer, the consumer will not buy the product.

We have a model, mined from historical data, that gives an estimated


probability of response PR(X)for any consumer whose feature vector description
x is given as input.
Expected benefit of targeting

= PR(X) ・ VR + [1 - PR(X) ] ・ VNR


• where VR is the value we get from a response and
• VNR is the value we get from no response.

estimate of the probability of not responding


To be concrete, let’s say that a consumer buys the product for $200 and our
productrelated costs are $100.
To target the consumer with the offer, we also incur a cost. Let’s
say that we mail some flashy marketing materials, and the overall cost
including postage is $1,

yielding a value (profit) of VR = $99 if the consumer responds (buys the


product).

Now, what about VNR, the value to us if the consumer does not respond? We
still mailed the marketing materials, incurring a cost of $1 or equivalently a
benefit of -$1.
Now we are ready to say precisely whether we want to target this consumer:
do we expect to make a profit?
Technically, is the expected value (profit) of targeting greater than zero?
Mathematically, this is:

we should target the consumer as long as the estimated probability of


responding is greater than 1%.
Suppose we have a confusion matrix:

90 5
10 95

900 0
0 95

90 500
1000 95
https://
towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec
Using Expected Value to Frame Classifier Evaluation
we need to evaluate the set of decisions made by a model when applied to a set
of examples.
1. Does our data-driven model perform better than the hand-crafted model
suggested by the marketing group?
2. Does a classification tree work better than a linear discriminant model for a
particular problem?
3. Do any of the models do substantially better than a baseline “model,” such as
randomly choosing consumers to target?

• What we care about is, in aggregate, how well does each model do: what is its
expected value.
A training portion of a
dataset is taken as input
by an induction algorithm,
which produces the model
that we will evaluate
We can use the expected value framework just described to determine the best
decisions for each particular model, and then use the expected value in a different
way to compare the models

For example, what is the probability associated with the particular combination of
a consumer being predicted to churn and actually does not churn?

That would be estimated by the number of test-set consumers who fell into the
confusion matrix cell (Y,n), divided by the total number of test-set consumers
Each cell of the confusion matrix contains a count of the number of decisions
corresponding to the corresponding combination of (predicted, actual), which
we will express as count(h,a)
For the expected value calculation we reduce these counts to rates or estimated
probabilities, p(h,a). We do this by dividing each count by the total number of
instances

A sample confusion matrix with counts.


A cost-benefit matrix
A false positive occurs when we classify a consumer as a likely responder and therefore
target her, but she does not respond. We’ve said that the cost of preparing and mailing the
marketing materials is a fixed cost of $1 per consumer. The benefit in this case is negative:
b(Y, n) = –1

A false negative is a consumer who was predicted not to be a likely responder (so
was not offered the product), but would have bought it if offered. In this case, no
money was spent and nothing was gained, so b(N, p) = 0.

• A true positive is a consumer who is offered the product and buys it. The benefit in
this case is the profit from the revenue ($200) minus the product-related costs
($100) and the mailing costs ($1), so b(Y, p) = 99.

• A true negative is a consumer who was not offered a deal and who would not have
bought it even if it had been offered. The benefit in this case is zero (no profit but
no cost), so b(N, n) = 0.
A cost-benefit matrix for the targeted marketing example
The general form of an expected value calculation

All we need is to be able to compute the confusion matrices over a set of test
instances, and to generate the cost-benefit matrix.
A rule of basic probability is:

This says that the probability of two different events both occurring is equal to the probability
of one of them occurring times the probability of the other occurring if we know that the first
occurs
This expected value means that if we apply this model to a population of
prospective customers and mail offers to those it classifies as positive, we can
expect to make an average of about $50 profit per consumer

You might also like