instance-survey
instance-survey
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY
[email protected]
6.1 Introduction
Most classification methods are based on building a model in the training phase, and then using
this model for specific test instances, during the actual classification phase. Thus, the classification
process is usually a two-phase approach that is cleanly separated between processing training and
test instances. As discussed in the introduction chapter of this book, these two phases are as follows:
• Training Phase: In this phase, a model is constructed from the training instances.
• Testing Phase: In this phase, the model is used to assign a label to an unlabeled test instance.
Examples of models that are created during the first phase of training are decision trees, rule-based
methods, neural networks, and support vector machines. Thus, the first phase creates pre-compiled
abstractions or models for learning tasks. This is also referred to as eager learning, because the
models are constructed in an eager way, without waiting for the test instance. In instance-based
157
158 Data Classification: Algorithms and Applications
learning, this clean separation between the training and testing phase is usually not present. The
specific instance, which needs to be classified, is used to create a model that is local to a specific test
instance. The classical example of an instance-based learning algorithm is the k-nearest neighbor
classification algorithm, in which the k nearest neighbors of a classifier are used in order to create a
local model for the test instance. An example of a local model using the k nearest neighbors could
be that the majority class in this set of k instances is reported as the corresponding label, though
more complex models are also possible. Instance-based learning is also sometimes referred to as
lazy learning, since most of the computational work is not done upfront, and one waits to obtain the
test instance, before creating a model for it [9]. Clearly, instance-based learning has a different set
of tradeoffs, in that it requires very little or no processing for creating a global abstraction of the
training data, but can sometimes be expensive at classification time. This is because instance-based
learning typically has to determine the relevant local instances, and create a local model from these
instances at classification time. While the obvious way to create a local model is to use a k-nearest
neighbor classifier, numerous other kinds of lazy solutions are possible, which combine the power
of lazy learning with other models such as locally-weighted regression, decision trees, rule-based
methods, and SVM classifiers [15, 36, 40, 77]. This chapter will discuss all these different scenarios.
It is possible to use the traditional “eager” learning methods such as Bayes methods [38], SVM
methods [40], decision trees [62], or neural networks [64] in order to improve the effectiveness of
local learning algorithms, by applying them only on the local neighborhood of the test instance at
classification time.
It should also be pointed out that many instance-based algorithms may require a pre-processing
phase in order to improve the efficiency of the approach. For example, the efficiency of a nearest
neighbor classifier can be improved by building a similarity index on the training instances. In spite
of this pre-processing phase, such an approach is still considered lazy learning or instance-based
learning since the pre-processing phase is not really a classification model, but a data structure that
enables efficient implementation of the run-time modeling for a given test instance.
Instance-based learning is related to but not quite the same as case-based reasoning [1, 60, 67],
in which previous examples may be used in order to make predictions about specific test instances.
Such systems can modify cases or use parts of cases in order to make predictions. Instance-based
methods can be viewed as a particular kind of case-based approach, which uses specific kinds of
algorithms for instance-based classification. The framework of instance-based algorithms is more
amenable for reducing the computational and storage requirements, noise and irrelevant attributes.
However, these terminologies are not clearly distinct from one another, because many authors use
the term “case-based learning” in order to refer to instance-based learning algorithms. Instance-
specific learning can even be extended to distance function learning, where instance-specific dis-
tance functions are learned, which are local to the query instance [76].
Instance-based learning methods have several advantages and disadvantages over traditional
learning methods. The lazy aspect of instance-based learning is its greatest advantage. The global
pre-processing approach of eager learning algorithms is inherently myopic to the characteristics
of specific test instances, and may create a model, which is often not optimized towards specific
instances. The advantage of instance-based learning methods is that they can be used in order to
create models that are optimized to specific test instances. On the other hand, this can come at a
cost, since the computational load of performing the classification can be high. As a result, it may
often not be possible to create complex models because of the computational requirements. In some
cases, this may lead to oversimplification. Clearly, the usefulness of instance-based learning (as in
all other class of methods) depends highly upon the data domain, size of the data, data noisiness and
dimensionality. These aspects will be covered in some detail in this chapter.
This chapter will provide an overview of the basic framework for instance-based learning, and
the many algorithms that are commonly used in this domain. Some of the important methods such as
nearest neighbor classification will be discussed in more detail, whereas others will be covered at a
much higher level. This chapter is organized as follows. Section 6.2 introduces the basic framework
Instance-Based Learning: A Survey 159
for instance-based learning. The most well-known instance-based method is the nearest neighbor
classifier. This is discussed in Section 6.3. Lazy SVM classifiers are discussed in Section 6.4. Lo-
cally weighted methods for regression are discussed in section 6.5. Locally weighted naive Bayes
methods are introduced in Section 6.6. Methods for constructing lazy decision trees are discussed
in Section 6.7. Lazy rule-based classifiers are discussed in Section 6.8. Methods for using neural
networks in the form of radial basis functions are discussed in Section 6.9. The advantages of lazy
learning for diagnostic classification are discussed in Section 6.10. The conclusions and summary
are discussed in Section 6.11.
However, a broader and more powerful principle to characterize such methods would be:
Similar instances are easier to model with a learning algorithm, because of the simplification of
the class distribution within the locality of a test instance.
Note that the latter principle is a bit more general than the former, in that the former principle
seems to advocate the use of a nearest neighbor classifier, whereas the latter principle seems to sug-
gest that locally optimized models to the test instance are usually more effective. Thus, according
to the latter philosophy, a vanilla nearest neighbor approach may not always obtain the most accu-
rate results, but a locally optimized regression classifier, Bayes method, SVM or decision tree may
sometimes obtain better results because of the simplified modeling process [18, 28, 38, 77, 79]. This
class of methods is often referred to as lazy learning, and often treated differently from traditional
instance-based learning methods, which correspond to nearest neighbor classifiers. Nevertheless,
the two classes of methods are closely related enough to merit a unified treatment. Therefore, this
chapter will study both the traditional instance-based learning methods and lazy learning methods
within a single generalized umbrella of instance-based learning methods.
The primary output of an instance-based algorithm is a concept description. As in the case of
a classification model, this is a function that maps instances to category values. However, unlike
traditional classifiers, which use extensional concept descriptions, instance-based concept descrip-
tions may typically contain a set of stored instances, and optionally some information about how the
stored instances may have performed in the past during classification. The set of stored instances
can change as more instances are classified over time. This, however, is dependent upon the under-
lying classification scenario being temporal in nature. There are three primary components in all
instance-based learning algorithms.
160 Data Classification: Algorithms and Applications
1. Similarity or Distance Function: This computes the similarities between the training in-
stances, or between the test instance and the training instances. This is used to identify a
locality around the test instance.
2. Classification Function: This yields a classification for a particular test instance with the use
of the locality identified with the use of the distance function. In the earliest descriptions
of instance-based learning, a nearest neighbor classifier was assumed, though this was later
expanded to the use of any kind of locally optimized model.
3. Concept Description Updater: This typically tracks the classification performance, and makes
decisions on the choice of instances to include in the concept description.
Traditional classification algorithms construct explicit abstractions and generalizations (e.g., de-
cision trees or rules), which are constructed in an eager way in a pre-processing phase, and are
independent of the choice of the test instance. These models are then used in order to classify test
instances. This is different from instance-based learning algorithms, where instances are used along
with the training data to construct the concept descriptions. Thus, the approach is lazy in the sense
that knowledge of the test instance is required before model construction. Clearly the tradeoffs
are different in the sense that “eager” algorithms avoid too much work at classification time, but
are myopic in their ability to create a specific model for a test instance in the most accurate way.
Instance-based algorithms face many challenges involving efficiency, attribute noise, and signifi-
cant storage requirements. A work that analyzes the last aspect of storage requirements is discussed
in [72].
While nearest neighbor methods are almost always used as an intermediate step for identifying
data locality, a variety of techniques have been explored in the literature beyond a majority vote
on the identified locality. Traditional modeling techniques such as decision trees, regression model-
ing, Bayes, or rule-based methods are commonly used to create an optimized classification model
around the test instance. It is the optimization inherent in this localization that provides the great-
est advantages of instance-based learning. In some cases, these methods are also combined with
some level of global pre-processing so as to create a combination of instance-based and model-
based algorithms [55]. In any case, many instance-based methods combine typical classification
generalizations such as regression-based methods [15], SVMs [54, 77], rule-based methods [36], or
decision trees [40] with instance-based methods. Even in the case of pure distance-based methods,
some amount of model building may be required at an early phase for learning the underlying dis-
tance functions [75]. This chapter will also discuss such techniques within the broader category of
instance-based methods.
point of view. It has been shown in [31] that the nearest neighbor rule provides at most twice the
error as that provided by the local Bayes probability.
Such an approach may sometimes not be appropriate for imbalanced data sets, in which the rare
class may not be present to a significant degree among the nearest neighbors, even when the test
instance belongs to the rare class. In the case of cost-sensitive classification or rare-class learning the
majority class is determined after weighting the instances with the relevant costs. These methods
will be discussed in detail in Chapter 17 on rare class learning. In cases where the class label is
continuous (regression modeling problem), one may use the weighted average numeric values of
the target class. Numerous variations on this broad approach are possible, both in terms of the
distance function used or the local model used for the classification process.
• The choice of the distance function clearly affects the behavior of the underlying classifier. In
fact, the problem of distance function learning [75] is closely related to that of instance-based
learning since nearest neighbor classifiers are often used to validate distance-function learning
algorithms. For example, for numerical data, the use of the euclidian distance assumes a
spherical shape of the clusters created by different classes. On the other hand, the true clusters
may be ellipsoidal and arbitrarily oriented with respect to the axis system. Different distance
functions may work better in different scenarios. The use of feature-weighting [69] can also
change the distance function, since the weighting can change the contour of the distance
function to match the patterns in the underlying data more closely.
• The final step of selecting the model from the local test instances may vary with the appli-
cation. For example, one may use the majority class as the relevant one for classification, a
cost-weighted majority vote, or a more complex classifier within the locality such as a Bayes
technique [38, 78].
One of the nice characteristics of the nearest neighbor classification approach is that it can be used
for practically any data type, as long as a distance function is available to quantify the distances
between objects. Distance functions are often designed with a specific focus on the classification
task [21]. Distance function design is a widely studied topic in many domains such as time-series
data [42], categorical data [22], text data [56], and multimedia data [58] or biological data [14].
Entropy-based measures [29] are more appropriate for domains such as strings, in which the dis-
tances are measured in terms of the amount of effort required to transform one instance to the other.
Therefore, the simple nearest neighbor approach can be easily adapted to virtually every data do-
main. This is a clear advantage in terms of usability. A detailed discussion of different aspects of
distance function design may be found in [75].
A key issue with the use of nearest neighbor classifiers is the efficiency of the approach in the
classification process. This is because the retrieval of the k nearest neighbors may require a running
time that is linear in the size of the data set. With the increase in typical data sizes over the last few
years, this continues to be a significant problem [13]. Therefore, it is useful to create indexes, which
can efficiently retrieve the k nearest neighbors of the underlying data. This is generally possible
for many data domains, but may not be true of all data domains in general. Therefore, scalability
is often a challenge in the use of such algorithms. A common strategy is to use either indexing of
the underlying instances [57], sampling of the data, or aggregations of some of the data points into
smaller clustered pseudo-points in order to improve accuracy. While the indexing strategy seems
to be the most natural, it rarely works well in the high dimensional case, because of the curse of
dimensionality. Many data sets are also very high dimensional, in which case a nearest neighbor
index fails to prune out a significant fraction of the data points, and may in fact do worse than a
sequential scan, because of the additional overhead of indexing computations.
Such issues are particularly challenging in the streaming scenario. A common strategy is to use
very fine grained clustering [5, 7] in order to replace multiple local instances within a small cluster
(belonging to the same class) with a pseudo-point of that class. Typically, this pseudo-point is the
162 Data Classification: Algorithms and Applications
centroid of a small cluster. Then, it is possible to apply a nearest neighbor method on these pseudo-
points in order to obtain the results more efficiently. Such a method is desirable, when scalability is
of great concern and the data has very high volume. Such a method may also reduce the noise that is
associated with the use of individual instances for classification. An example of such an approach is
provided in [7], where classification is performed on a fast data stream, by summarizing the stream
into micro-clusters. Each micro-cluster is constrained to contain data points only belonging to a
particular class. The class label of the closest micro-cluster to a particular instance is reported as the
relevant label. Typically, the clustering is performed with respect to different time-horizons, and a
cross-validation approach is used in order to determine the time-horizon that is most relevant at a
given time. Thus, the model is instance-based in a dual sense, since it is not only a nearest neighbor
classifier, but it also determines the relevant time horizon in a lazy way, which is specific to the
time-stamp of the instance. Picking a smaller time horizon for selecting the training data may often
be desirable when the data evolves significantly over time. The streaming scenario also benefits
from laziness in the temporal dimension, since the most appropriate model to use for the same test
instance may vary with time, as the data evolves. It has been shown in [7], that such an “on demand”
approach to modeling provides more effective results than eager classifiers, because of its ability to
optimize for the test instance from a temporal perspective. Another method that is based on the
nearest neighbor approach is proposed in [17]. This approach detects the changes in the distribution
of the data stream on the past window of instances and accordingly re-adjusts the classifier. The
approach can handle symbolic attributes, and it uses the Value Distance Metric (VDM) [60] in order
to measure distances. This metric will be discussed in some detail in Section 6.3.1 on symbolic
attributes.
A second approach that is commonly used to speed up the approach is the concept of instance
selection or prototype selection [27, 41, 72, 73, 81]. In these methods, a subset of instances may
be pre-selected from the data, and the model is constructed with the use of these pre-selected in-
stances. It has been shown that a good choice of pre-selected instances can often lead to improvement
in accuracy, in addition to the better efficiency [72, 81]. This is because a careful pre-selection of
instances reduces the noise from the underlying training data, and therefore results in better clas-
sification. The pre-selection issue is an important research issue in its own right, and we refer the
reader to [41] for a detailed discussion of this important aspect of instance-based classification. An
empirical comparison of the different instance selection algorithms may be found in [47].
In many rare class or cost-sensitive applications, the instances may need to be weighted differ-
ently corresponding to their importance. For example, consider an application in which it is desirable
to use medical data in order to diagnose a specific condition. The vast majority of results may be nor-
mal, and yet it may be costly to miss a case where an example is abnormal. Furthermore, a nearest
neighbor classifier (which does not weight instances) will be naturally biased towards identifying
instances as normal, especially when they lie at the border of the decision region. In such cases,
costs are associated with instances, where the cost associated with an abnormal instance is the same
as the relative cost of misclassifying it (false negative), as compared to the cost of misclassifying
a normal instance (false positive). The weights on the instances are then used for the classification
process.
Another issue with the use of nearest neighbor methods is that it does not work very well when
the dimensionality of the underlying data increases. This is because the quality of the nearest neigh-
bor decreases with an increasing number of irrelevant attributes [45]. The noise effects associated
with the irrelevant attributes can clearly degrade the quality of the nearest neighbors found, espe-
cially when the dimensionality of the underlying data is high. This is because the cumulative effect
of irrelevant attributes often becomes more pronounced with increasing dimensionality. For the
case of numeric attributes, it has been shown [2], that the use of fractional norms (i.e. L p -norms for
p < 1) provides superior quality results for nearest neighbor classifiers, whereas L∞ norms provide
the poorest behavior. Greater improvements may be obtained by designing the distance function
Instance-Based Learning: A Survey 163
more carefully, and weighting more relevant ones. This is an issue that will be discussed in detail in
later subsections.
In this context, the issue of distance-function design is an important one [50]. In fact, an entire
area of machine learning has been focussed on distance function design. Chapter 18 of this book
has been devoted entirely to distance function design, and an excellent survey on the topic may be
found in [75]. A discussion of the applications of different similarity methods for instance-based
classification may be found in [32]. In this section, we will discuss some of the key aspects of
instance-function design, which are important in the context of nearest neighbor classification.
Here, the parameter q can be chosen either on an ad hoc basis, or in a data-driven manner. This
choice of metric has been shown to be quite effective in a variety of instance-centered scenarios
[36, 60]. Detailed discussions of different kinds of similarity functions for symbolic attributes may
be found in [22, 30].
V ( j) = ∑ f (di ) (6.2)
i:i∈Sk ,ci = j
Here f (·) is either an increasing or decreasing function of its argument, depending upon when di
represents similarity or distance, respectively. It should be pointed out that if the appropriate weight
is used, then it is not necessary to use the k nearest neighbors, but simply to perform this average
over the entire collection.
164 Data Classification: Algorithms and Applications
Y
Y
X X
(a) Axis-Parallel (b) Arbitrarily Oriented
FIGURE 6.1: Illustration of importance of feature weighting for nearest neighbor classification.
The nearest neighbors are computed on this set of distances, and the majority vote among the k
nearest neighbors is reported as the class labels. This approach tends to work well because it picks
the k nearest neighbor group in a noise-resistant way. For test instances that lie on the decision
boundaries, it tends to discount the instances that lie on the noisy parts of a decision boundary, and
instead picks the k nearest neighbors that lie away from the noisy boundary. For example, consider a
data point Xi that lies reasonably close to the test instance, but even closer to the decision boundary.
Furthermore, since Xi lies almost on the decision boundary, it lies extremely close to another training
example in a different class. In such a case the example Xi should not be included in the k-nearest
neighbors, because it is not very informative. The small value of ri will often ensure that such an
example is not picked among the k-nearest neighbors. Furthermore, such an approach also ensures
that the distances are scaled and normalized by the varying nature of the patterns of the different
classes in different regions. Because of these factors, it has been shown in [65] that this modification
often yields more robust results for the quality of classification.
to be zero, that attribute is eliminated completely. This can be considered an implicit form of fea-
ture selection. Thus, for two d-dimensional records X = (x1 . . . xd ) and Y = (y1 . . . yd ), the feature
weighted distance d(X,Y ,W ) with respect to a d-dimensional vector of weights W = (w1 . . . wd ) is
defined as follows:
d
d(X ,Y ,W ) = ∑ wi · (xi − yi )2 . (6.4)
i=1
For example, consider the data distribution, illustrated in Figure 6.1(a). In this case, it is evident
that the feature X is much more discriminative than feature Y , and should therefore be weighted to
a greater degree in the final classification. In this case, almost perfect classification can be obtained
with the use of feature X , though this will usually not be the case, since the decision boundary is
noisy in nature. It should be pointed out that the Euclidian metric has a spherical decision boundary,
whereas the decision boundary in this case is linear. This results in a bias in the classification pro-
cess because of the difference between the shape of the model boundary and the true boundaries in
the data. The importance of weighting the features becomes significantly greater, when the classes
are not cleanly separated by a decision boundary. In such cases, the natural noise at the decision
boundary may combine with the significant bias introduced by an unweighted Euclidian distance,
and result in even more inaccurate classification. By weighting the features in the Euclidian dis-
tance, it is possible to elongate the model boundaries to a shape that aligns more closely with the
class-separation boundaries in the data. The simplest possible weight to use would be to normalize
each dimension by its standard deviation, though in practice, the class label is used in order to deter-
mine the best feature weighting [48]. A detailed discussion of different aspects of feature weighting
schemes is provided in [70].
In some cases, the class distribution may not be aligned neatly along the axis system, but may
be arbitrarily oriented along different directions in the data, as in Figure 6.1(b). A more general
distance metric is defined with respect to a d × d matrix A rather than a vector of weights W .
d(X ,Y , A) = (X − Y )T · A · (X − Y ). (6.5)
The matrix A is also sometimes referred to as a metric. The value of A is assumed to be the inverse
of the covariance matrix of the data in the standard definition of the Mahalanobis distance for un-
supervised applications. Generally, the Mahalanobis distance is more sensitive to the global data
distribution and provides more effective results.
The Mahalanobis distance does not, however take the class distribution into account. In su-
pervised applications, it makes much more sense to pick A based on the class distribution of the
underlying data. The core idea is to “elongate” the neighborhoods along less discriminative direc-
tions, and to shrink the neighborhoods along more discriminative dimensions. Thus, in the modified
metric, a small (unweighted) step along a discriminative direction, would result in relatively greater
distance. This naturally provides greater importance to more discriminative directions. Numerous
methods such as the linear discriminant [51] can be used in order to determine the most discrim-
inative dimensions in the underlying data. However, the key here is to use a soft weighting of the
different directions, rather than selecting specific dimensions in a hard way. The goal of the matrix A
is to accomplish this. How can A be determined by using the distribution of the classes? Clearly, the
matrix A should somehow depend on the within-class variance and between-class variance, in the
context of linear discriminant analysis. The matrix A defines the shape of the neighborhood within a
threshold distance, to a given test instance. The neighborhood directions with low ratio of inter-class
variance to intra-class variance should be elongated, whereas the directions with high ratio of the
inter-class to intra-class variance should be shrunk. Note that the “elongation” of a neighborhood
direction is achieved by scaling that component of the distance by a larger factor, and therefore
de-emphasizing that direction.
166 Data Classification: Algorithms and Applications
Let D be the full database, and Di be the portion of the data set belonging to class i. Let µ
represent the mean of the entire data set. Let pi = |Di |/|D | be the fraction of records belonging to
class i, µi be the d-dimensional row vector of means of Di , and Σi be the d × d covariance matrix of
Di . Then, the scaled1 within class scatter matrix Sw is defined as follows:
k
Sw = ∑ pi · Σi . (6.6)
i=1
Note that the matrix Sb is a d × d matrix, since it results from the product of a d × 1 matrix with
a 1 × d matrix. Then, the matrix A (of Equation 6.5), which provides the desired distortion of the
distances on the basis of class-distribution, can be shown to be the following:
It can be shown that this choice of the metric A provides an excellent discrimination between the dif-
ferent classes, where the elongation in each direction depends inversely on ratio of the between-class
variance to within-class variance along the different directions. The aforementioned description is
based on the discussion in [44]. The reader may find more details of implementing the approach in
an effective way in that work.
A few special cases of the metric of Equation 6.5 are noteworthy. Setting A to the identity
matrix corresponds to the use of the Euclidian distance. Setting the non-diagonal entries of A entries
to zero results in a similar situation to a d-dimensional vector of weights for individual dimensions.
Therefore, the non-diagonal entries contribute to a rotation of the axis-system before the stretching
process. For example, in Figure 6.1(b), the optimal choice of the matrix A will result in greater
importance being shown to the direction illustrated by the arrow in the figure in the resulting metric.
In order to avoid ill-conditioned matrices, especially in the case when the number of training data
points is small, a parameter ε can be used in order to perform smoothing.
−1/2 −1/2 −1/2 −1/2
A = Sw · (Sw · Sb · Sw + ε · I) · Sw . (6.9)
Here ε is a small parameter that can be tuned, and the identity matrix is represented by I. The
use of this modification assumes that any particular direction does not get infinite weight. This is
quite possible, when the number of data points is small. The use of this parameter ε is analogous to
Laplacian smoothing methods, and is designed to avoid overfitting.
Other heuristic methodologies are also used in the literature for learning feature relevance. One
common methodology is to use cross-validation in which the weights are trained using the original
instances in order to minimize the error rate. Details of such a methodology are provided in [10,48].
It is possible to also use different kinds of search methods such as Tabu search [61] in order to
improve the process of learning weights. This kind of approach has also been used in the context
of text classification, by learning the relative importance of the different words, for computing the
similarity function [43]. Feature relevance has also been shown to be important for other domains
such as image retrieval [53]. In cases where domain knowledge can be used, some features can
be eliminated very easily with tremendous performance gains. The importance of using domain
knowledge for feature weighting has been discussed in [26].
1 The unscaled version may be obtained by multiplying S with the number of data points. There is no difference from
w
the final result, whether the scaled or unscaled version is used, within a constant of proportionality.
Instance-Based Learning: A Survey 167
Y Y B
A B
A
X X
(a) Axis-Parallel (b) Arbitrarily Oriented
FIGURE 6.2: Illustration of importance of local adaptivity for nearest neighbor classification.
The most effective distance function design may be performed at query-time, by using a weight-
ing that is specific to the test instance [34, 44]. This can be done by learning the weights only on the
basis of the instances that are near a specific test instance. While such an approach is more expen-
sive than global weighting, it is likely to be more effective because of local optimization specific to
the test instance, and is better aligned with the spirit and advantages of lazy learning. This approach
will be discussed in detail in later sections of this survey.
It should be pointed out that many of these algorithms can be considered rudimentary forms of
distance function learning. The problem of distance function learning [3, 21, 75] is fundamental to
the design of a wide variety of data mining algorithms including nearest neighbor classification [68],
and a significant amount of research has been performed in the literature in the context of the
classification task. A detailed survey on this topic may be found in [75]. The nearest neighbor
classifier is often used as the prototypical task in order to evaluate the quality of distance functions
[2, 3, 21, 45, 68, 75], which are constructed using distance function learning techniques.
in the form of different labels in order to make inferences about the relevance of the different di-
mensions. A recent method for distance function learning [76] also constructs instance-specific
distances with supervision, and shows that the use of locality provides superior results. However,
the supervision in this case is not specifically focussed on the traditional classification problem,
since it is defined in terms of similarity or dissimilarity constraints between instances, rather than
labels attached to instances. Nevertheless, such an approach can also be used in order to construct
instance-specific distances for classification, by transforming the class labels into similarity or dis-
similarity constraints. This general principle is used frequently in many works such as those dis-
cussed by [34, 39, 44] for locally adaptive nearest neighbor classification.
Labels provide a rich level of information about the relative importance of different dimensions
in different localities. For example, consider the illustration of Figure 6.2(a). In this case, the feature
Y is more important in locality A of the data, whereas feature X is more important in locality B
of the data. Correspondingly, the feature Y should be weighted more, when using test instances in
locality A of the data, whereas feature X should be weighted more in locality B of the data. In the
case of Figure 6.2(b), we have shown a similar scenario as in Figure 6.2(a), except that different
directions in the data should be considered more or less important. Thus, the case in Figure 6.2(b)
is simply an arbitrarily oriented generalization of the challenges faced in Figure 6.2(a).
One of the earliest methods for performing locally adaptive nearest neighbor classification was
proposed in [44], in which the matrix A (of Equation 6.5) and the neighborhood of the test data point
X are learned iteratively from one another in a local way. The major difference here is that the matrix
A will depend upon the choice of test instance X , since the metric is designed to be locally adaptive.
The algorithm starts off by setting the d × d matrix A to the identity matrix, and then determines the
k-nearest neighbors Nk , using the generalized distance described by Equation 6.5. Then, the value of
A is updated using Equation 6.8 only on the neighborhood points Nk found in the last iteration. This
procedure is repeated till convergence. Thus, the overall iterative procedure may be summarized as
follows:
1. Determine Nk as the set of the k nearest neighbors of the test instance, using Equation 6.5 in
conjunction with the current value of A.
2. Update A using Equation 6.8 on the between-class and within-class scatter matrices of Nk .
At completion, the matrix A is used in Equation 6.8 for k-nearest neighbor classification. In practice,
Equation 6.9 is used in order to avoid giving infinite weight to any particular direction. It should
be pointed out that the approach is [44] is quite similar in principle to an unpublished approach
proposed previously in [39], though it varies somewhat on the specific details of the implementation,
and is also somewhat more robust. Subsequently, a similar approach to [44] was proposed in [34],
which works well with limited data. It should be pointed out that linear discriminant analysis is
not the only method that may be used for “deforming” the shape of the metric in an instance-
specific way to conform better to the decision boundary. Any other model such as an SVM or neural
network that can find important directions of discrimination in the data may be used in order to
determine the most relevant metric. For example, an interesting work discussed in [63] determines
a small number of k-nearest neighbors for each class, in order to span a linear subspace for that
class. The classification is done based not on distances to prototypes, but the distances to subspaces.
The intuition is that the linear subspaces essentially generate pseudo training examples for that
class. A number of methods that use support-vector machines will be discussed in the section on
Support Vector Machines. A review and discussion of different kinds of feature selection and feature
weighting schemes are provided in [8]. It has also been suggested in [48] that feature weighting
may be superior to feature selection, in cases where the different features have different levels of
relevance.
Instance-Based Learning: A Survey 169
the label i, and E 0 (T, i) be the event that the test instance T does not contain the label i. Then, in
order to determine whether or not the label i is included in test instance T , the maximum posterior
principle is used:
b = argmaxb∈{0,1}{P(E b (T, i)|C)}. (6.10)
In other words, we wish to maximize between the probability of the events of label i being included
or not. Therefore, the Bayes rule can be used in order to obtain the following:
P(E b (T, i)) · P(C|E b (T, i))
b = argmaxb∈{0,1} . (6.11)
P(C)
Since the value of P(C) is independent of b, it is possible to remove it from the denominator, without
affecting the maximum argument. This is a standard approach used in all Bayes methods. Therefore,
the best matching label may be expressed as follows:
b = argmaxb∈{0,1} P(E b (T, i)) · P(C|E b (T, i)) . (6.12)
The prior probability P(E b (T, i)) can be estimated as the fraction of the labels belonging to a par-
ticular class. The value of P(C|E b (T, i)) can be estimated by using the naive Bayes rule.
n
P(C|E b (T, i)) = ∏ P(C( j)|E b (T, i)). (6.13)
j=1
Each of the terms P(C( j)|E b (T, i)) can be estimated in a data driven manner by examining among
the instances satisfying the value b for class label i, the fraction that contains exactly the count C( j)
for the label j. Laplacian smoothing is also performed in order to avoid ill conditioned probabilities.
Thus, the correlation between the labels is accounted for by the use of this approach, since each of
the terms P(C( j)|E b (T, i)) indirectly measures the correlation between the labels i and j.
This approach is often popularly understood in the literature as a nearest neighbor approach,
and has therefore been discussed in the section on nearest neighbor methods. However, it is more
similar to a local naive Bayes approach (discussed in Section 6.6) rather than a distance-based
approach. This is because the statistical frequencies of the neighborhood labels are used for local
Bayes modeling. Such an approach can also be used for the standard version of the classification
problem (when each instance is associated with exactly one label) by using the statistical behavior
of the neighborhood features (rather than label frequencies). This yields a lazy Bayes approach for
classification [38]. However, the work in [38] also estimates the Bayes probabilities locally only over
the neighborhood in a data driven manner. Thus, the approach in [38] sharpens the use of locality
even further for classification. This is of course a tradeoff, depending upon the amount of training
data available. If more training data is available, then local sharpening is likely to be effective. On
the other hand, if less training data is available, then local sharpening is not advisable, because it
will lead to difficulties in robust estimations of conditional probabilities from a small amount of
data. This approach will be discussed in some detail in Section 6.6.
If desired, it is possible to combine the two methods discussed in [38] and [78] for multi-label
learning in order to learn the information in both the features and labels. This can be done by
using both feature and label frequencies for the modeling process, and the product of the label-
based and feature-based Bayes probabilities may be used for classification. The extension to that
case is straightforward, since it requires the multiplication of Bayes probabilities derived form two
different methods. An experimental study of several variations of nearest neighbor algorithms for
classification in the multi-label scenario is provided in [59].
Instance-Based Learning: A Survey 171
A number of optimizations such as caching have been proposed in [77] in order to improve the
efficiency of the approach. Local SVM classifiers have been used quite successfully for a variety of
applications such as spam filtering [20].
172 Data Classification: Algorithms and Applications
KNN L lit
KNNͲLocality
KNNͲLocality
Y Y
B
A
X X
Therefore, the weight decreases linearly with the distance, but cannot decrease beyond 0, once the
k-nearest neighbor has been reached. The naive Bayes method is applied in a standard way, except
that the instance weights are used in estimating all the probabilities for the Bayes classifier. Higher
values of k will result in models that do not fluctuate much with variations in the data, whereas very
small values of k will result in models that fit the noise in the data. It has been shown in [38] that
the approach is not too sensitive to the choice of k within a reasonably modest range of values of
k. Other schemes have been proposed in the literature, which use the advantages of lazy learning in
conjunction with the naive Bayes classifier. The work in [79] fuses a standard rule-based learner with
naive Bayes models. A technique discussed in [74] lazily learns multiple Naive Bayes classifiers,
and uses the classifier with the highest estimated accuracy in order to decide the label for the test
instance.
The work discussed in [78] can also be viewed as an unweighted local Bayes classifier, where
the weights used for all instances are 1, rather than the weighted approach discussed above. Further-
more, the work in [78] uses the frequencies of the labels for Bayes modeling, rather than the actual
features themselves. The idea in [78] is that the other labels themselves serve as features, since the
same instance may contain multiple labels, and sufficient correlation information is available for
learning. This approach is discussed in detail in Section 6.3.7.
many relevant features, only a small subset of them may be used for splitting. When a data set
contains N data points, a decision tree is allowed only O(log(N)) (approximately balanced) splits,
and this may be too small in order to use the best set of features for a particular test instance. Clearly,
the knowledge of the test instance allows the use of more relevant features for construction of the
appropriate decision path at a higher level of the tree construction.
The additional knowledge of the test instance helps in the recursive construction of a path in
which only relevant features are used. One method proposed in [40] is to use a split criterion, which
successively reduces the size of the training associated with the test instance, until either all in-
stances have the same class label, or the same set of features. In both cases, the majority class
label is reported as the relevant class. In order to discard a set of irrelevant instances in a particular
iteration, any standard decision tree split criterion is used, and only those training instances, that
satisfy the predicate in the same way as the test instance will be selected for the next level of the
decision path. The split criterion is decided using any standard decision tree methodology such as
the normalized entropy or the gini-index. The main difference from the split process in the tradi-
tional decision tree is that only the node containing the test instance is relevant in the split, and the
information gain or gini-index is computed on the basis of this node. One challenge with the use of
such an approach is that the information gain in a single node can actually be negative if the original
data is imbalanced. In order to avoid this problem, the training examples are re-weighted, so that the
aggregate weight of each class is the same. It is also relatively easy to deal with missing attributes
in test instances, since the split only needs to be performed on attributes that are present in the test
instance. It has been shown [40] that such an approach yields better classification results, because
of the additional knowledge associated with the test instance during the decision path construction
process.
A particular observation here is noteworthy, since such decision paths can also be used to con-
struct a robust any-time classifier, with the use of principles associated with a random forest ap-
proach [25]. It should be pointed out that a random forest translates to a random path created by
a random set of splits from the instance-centered perspective. Therefore a natural way to imple-
ment the instance-centered random forest approach would be to discretize the data into ranges. A
test instance will be relevant to exactly one range from each attribute. A random set of attributes is
selected, and the intersection of the ranges provides one possible classification of the test instance.
This approach can be repeated in order to provide a very efficient lazy ensemble, and the number
of samples provides the tradeoff between running time and accuracy. Such an approach can be used
in the context of an any-time approach in resource-constrained scenarios. It has been shown in [4]
how such an approach can be used for efficient any-time classification of data streams.
4 ≤ xi ≤ 4. If x1 is symbolic and its value is a, then the corresponding condition in the antecedent is
x1 = a.
As in the case of instance-centered methods, a distance is defined between a test instance and a
rule. Let R = (A1 . . . Am ,C) be a rule with the m conditions A1 . . . Am in the antecedent, and the class
C in the consequent. Let X = (x1 . . . xd ) be a d-dimensional example. Then, the distance Δ(X , R)
between the instance X and the rule R is defined as follows.
m
Δ(X , R) = ∑ δ(i)s . (6.16)
j=1
Here s is a real valued parameter such as 1,2,3, etc., and δ(i) represents the distance on the ith
conditional. The value of δ(i) is equal to the distance of the instance to the nearest end of the range
for the case of a numeric attribute and the value difference metric (VDM) of Equation 6.1 for the
case of a symbolic attribute. This value of δ(i) is zero, if the corresponding attribute value is a match
for the antecedent condition. The class label for a test instance is defined by the label of the nearest
rule to the test instance. If two or more rules have the same accuracy, then the one with the greatest
accuracy on the training data is used.
The set of rules in the RISE system are constructed as follows. RISE constructs good rules by
using successive generalizations on the original set of instances in the data. Thus, the algorithm starts
off with the training set of examples. RISE examines each rule one by one, and finds the nearest
example of the same class that is not already covered by the rule. The rule is then generalized in order
to cover this example, by expanding the corresponding antecedent condition. For the case of numeric
attributes, the ranges of the attributes are increased minimally so as to include the new example, and
for the case of symbolic attributes, a corresponding condition on the symbolic attribute is included.
If the effect of this generalization on the global accuracy of the rule is non-negative, then the rule
is retained. Otherwise, the generalization is not used and the original rule is retained. It should be
pointed out that even when generalization does not improve accuracy, it is desirable to retain the
more general rule because of the desirable bias towards simplicity of the model. The procedure is
repeated until no rule can be generalized in a given iteration. It should be pointed out that some
instances may not be generalized at all, and may remain in their original state in the rule set. In the
worst case, no instance is generalized, and the resulting model is a nearest neighbor classifier.
A system called DeEPs has been proposed in [49], which combines the power of rules and lazy
learning for classification purposes. This approach examines how the frequency of an instance’s sub-
set of features varies among the training classes. In other words, patterns that sharply differentiate
between the different classes for a particular test instance are leveraged and used for classifica-
tion. Thus, the specificity to the instance plays an important role in this discovery process. Another
system, HARMONY, has been proposed in [66], which determines rules that are optimized to the
different training instances. Strictly speaking, this is not a lazy learning approach, since the rules are
optimized to training instances (rather than test instances) in a pre-processing phase. Nevertheless,
the effectiveness of the approach relies on the same general principle, and it can also be generalized
for lazy learning if required.
function is computed at the instance using these centers. The combination of functions computed
from each of these centers is computed with the use of a neural network. Radial basis functions can
be considered three-layer feed-forward networks, in which each hidden unit computes a function of
the form:
fi (x) = e−||x−xi || /2·σi .
2 2
(6.17)
Here σ2i represents the local variance at center xi . Note that the function fi (x) has a very similar form
to that commonly used in kernel density estimation. For ease in discussion, we assume that this is
a binary classification problem, with labels drawn from {+1, −1}, though this general approach
extends much further, even to the extent of regression modeling. The final function is a weighted
combination of these values with weights ci .
N
f ∗ (x) = ∑ wi · fi (x). (6.18)
i=1
Here x1 . . . xN represent the N different centers, and wi denotes the weight of center i, which is
learned in the neural network. In classical instance-based methods, each data point xi is an individual
training instance, and the weight wi is set to +1 or −1, depending upon its label. However, in RBF
methods, the weights are learned with a neural network approach, since the centers are derived from
the underlying training data, and do not have a label directly attached to them.
The N centers x1 . . . xN are typically constructed with the use of an unsupervised approach
[19, 37, 46], though some recent methods also use supervised techniques for constructing the cen-
ters [71]. The unsupervised methods [19, 37, 46] typically use a clustering algorithm in order to
generate the different centers. A smaller number of centers typically results in smaller complexity,
and greater efficiency of the classification process. Radial-basis function networks are related to
sigmoidal function networks (SGF), which have one unit for each instance in the training data. In
this sense sigmoidal networks are somewhat closer to classical instance-based methods, since they
do not have a first phase of cluster-based summarization. While radial-basis function networks are
generally more efficient, the points at the different cluster boundaries may often be misclassified.
It has been shown in [52] that RBF networks may sometimes require ten times as much training
data as SGF in order to achieve the same level of accuracy. Some recent work [71] has shown how
supervised methods even at the stage of center determination can significantly improve the accu-
racy of these classifiers. Radial-basis function networks can therefore be considered an evolution of
nearest neighbor classifiers, where more sophisticated (clustering) methods are used for prototype
selection (or re-construction in the form of cluster centers), and neural network methods are used
for combining the density values obtained from each of the centers.
1 1
* Test Instance * Test Instance
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1
0.1
0
0 2
2
1 1.5 1 1.5
1 1
0 0.5 0 0.5
0 0
−1 −0.5 −1 −0.5
−1 −1
−2 −1.5 −1.5
−2 −2 −2
Attribute X Attribute Y Attribute X Attribute Y
by X1 . . . XN . Let us further assume that the k classes in the data are denoted by C1 . . . Ck . The number
of points belonging to the class Ci is ni , so that ∑ki=1 ni = N. The data set associated with the class i is
denoted by Di . This means that ∪ki=1 Di = D . The probability density at a given point is determined
by the sum of the smoothed values of the kernel functions Kh (·) associated with each point in the
data set. Thus, the density estimate of the data set D at the point x is defined as follows:
1
n X∑
f (x, D ) = · Kh (x − Xi ). (6.19)
∈D i
The kernel function is a smooth unimodal distribution such as the Gaussian function:
||x−Xi || 2
1 −
Kh (x − Xi ) = √ · e 2h2 . (6.20)
2π · h
The kernel function is dependent on the use of a parameter h, which is the level of smoothing. The
accuracy of the density estimate depends upon this width h, and several heuristic rules have been
proposed for estimating the bandwidth [6].
The value of the density f (x, D ) may differ considerably from f (x, Di ) because of the difference
in distributions of the different classes. Correspondingly, the accuracy density A (x, Ci , D ) for the
class Ci is defined as follows:
ni · f (x, Di )
A (x, Ci , D ) = . (6.21)
∑i=1 ni · f (x, Di )
k
The above expression always lies between 0 and 1, and is simply an estimation of the Bayesian
posterior probability of the test instance belonging to class Ci . It is assumed that the a-priori proba-
bility of the test instance belonging to class Ci , (without knowing any of its feature values) is simply
equal to its fractional presence ni /N in class Ci . The conditional probability density after knowing
the feature values x is equal to f (x, Di ). Then, by applying the Bayesian formula for posterior prob-
abilities,the condition of Equation 6.21 is obtained. The higher the value of the accuracy density,
the greater the relative density of Ci compared to the other classes.
Another related measure is the interest density I (x, Ci , D ), which is the ratio of the density of
the class Ci to the overall density of the data.
f (x, Di )
I (x, Ci , D ) = . (6.22)
f (x, D )
178 Data Classification: Algorithms and Applications
The class Ci is over-represented at x, when the interest density is larger than one. The dominant
class at the coordinate x is denoted by C M (x, D ), and is equal to argmaxi∈{1,...k} I (x, Ci , D ). Cor-
respondingly, the maximum interest density at x is denoted by I M (x, D ) = maxi∈{1,...k} I (x, Ci , D ).
Both the interest and accuracy density are valuable quantifications of the level of dominance of the
different classes. The interest density is more effective at comparing among the different classes at
a given point, whereas the accuracy density is more effective at providing an idea of the absolute
accuracy at a given point.
So far, it has been assumed that all of the above computations are performed in the full dimen-
sional space. However, it is also possible to project the data onto the subspace E in order to perform
this computation. Such a calculation would quantify the discriminatory power of the subspace E at
x. In order to denote the use of the subspace E in any computation, the corresponding expression
will be superscripted with E. Thus the density in a given subspace E is denoted by f E (·, ·), the
accuracy density by A E (·, ·, ·), and the interest density by I E (·, ·, ·). Similarly, the dominant class
is defined using the subspace-specific interest density at that point, and the accuracy density profile
is defined for that particular subspace. An example of the accuracy density profile (of the dominant
class) in a 2-dimensional subspace is illustrated in Figure 6.4(a). The test instance is also labeled in
the same figure in order to illustrate the relationship between the density profile and test instance.
It is desired to find those projections of the data in which the interest density value I M (t, D )
E
is the maximum. It is quite possible that in some cases, different subspaces may provide different
information about the class behavior of the data; these are the difficult cases in which a test instance
may be difficult to classify accurately. In such cases, the user may need to isolate particular data
localities in which the class distribution is further examined by a hierarchical exploratory process.
While the density values are naturally defined over the continuous space of quantitative attributes, it
has been shown in [6] that intuitively analogous values can be defined for the interest and accuracy
densities even when categorical attributes are present.
For a given test example, the end user is provided with unique options in exploring various
characteristics that are indicative of its classification. A subspace determination process is used on
the basis of the highest interest densities at a given test instance. Thus, the subspace determination
process finds the appropriate local discriminative subspaces for a given test example. These are the
various possibilities (or branches) of the decision path, which can be utilized in order to explore the
regions in the locality of the test instance. In each of these subspaces, the user is provided with a
visual profile of the accuracy density. This profile provides the user with an idea of which branch is
likely to lead to a region of high accuracy for that test instance. This visual profile can also be utilized
in order to determine which of the various branches are most suitable for further exploration. Once
such a branch has been chosen, the user has the option to further explore into a particular region
of the data that has high accuracy density. This process of data localization can quickly isolate an
arbitrarily shaped region in the data containing the test instance. This sequence of data localizations
creates a path (and a locally discriminatory combination of dimensions) that reveals the underlying
classification causality to the user.
In the event that a decision path is chosen that is not strongly indicative of any class, the user has
the option to backtrack to a higher level node and explore a different path of the tree. In some cases,
different branches may be indicative of the test example belonging to different classes. These are
the “ambiguous cases” in which a test example could share characteristics from multiple classes.
Many standard modeling methods may classify such an example incorrectly, though the subspace
decision path method is much more effective at providing the user with an intensional knowledge
of the test example because of its exploratory approach. This can be used in order to understand the
causality behind the ambiguous classification behavior of that instance.
The overall algorithm for decision path construction is illustrated in Figure 6.5. The details of
the subroutines in the procedure are described in [6], though a summary description is provided
here. The input to the system is the data set D , the test instance t for which one wishes to find the
Instance-Based Learning: A Survey 179
diagnostic characteristics, a maximum branch factor bmax , and a minimum interest density irmin . In
addition, the maximum dimensionality l of any subspace utilized in data exploration is utilized as
user input. The value of l = 2 is especially interesting because it allows for the use of visual profile
of the accuracy density. Even though it is natural to use 2-dimensional projections because of their
visual interpretability, the data exploration process along a given path reveals a higher dimensional
combination of dimensions, which is most suitable for the test instance. The branch factor bmax is
the maximum number of possibilities presented to the user, whereas the value of irmin is the corre-
sponding minimum interest density of the test instance in any subspace presented to the user. The
value of irmin is chosen to be 1, which is the break-even value for interest density computation. This
break-even value is one at which the interest density neither under-estimates nor over-estimates the
accuracy of the test instance with respect to a class. The variable PATH consists of the pointers to the
sequence of successively reduced training data sets, which are obtained in the process of interactive
decision tree construction. The list PATH is initialized to a single element, which is the pointer to the
original data set D . At each point in the decision path construction process, the subspaces E1 . . . Eq
are determined, which have the greatest interest density (of the dominant class) in the locality of
the test instance t. This process is accomplished by the procedure ComputeClassificationSubspaces.
Once these subspaces have been determined, the density profile is constructed for each of them by
180 Data Classification: Algorithms and Applications
the procedure ConstructDensityProfile. Even though one subspace may have higher interest density
at the test instance than another, the true value of a subspace in separating the data locality around
the test instance is often a subjective judgement that depends both upon the interest density of the
test instance and the spatial separation of the classes. Such a judgement requires human intuition,
which can be harnessed with the use of the visual profile of the accuracy density profiles of the var-
ious possibilities. These profiles provide the user with an intuitive idea of the class behavior of the
data set in various projections. If the class behavior across different projections is not very consis-
tent (different projections are indicative of different classes), then such a node is not very revealing
of valuable information. In such a case, the user may choose to backtrack by specifying an earlier
node on PATH from which to start further exploration.
On the other hand, if the different projections provide a consistent idea of the class behavior, then
the user utilizes the density profile in order to isolate a small region of the data in which the accuracy
density of the data in the locality of the test instance is significantly higher for a particular class.
This is achieved by the procedure IsolateData. This isolated region may be a cluster of arbitrary
shape depending upon the region covered by the dominating class. However, the use of the visual
profile helps to maintain the interpretability of the isolation process in spite of the arbitrary contour
of separation. An example of such an isolation is illustrated in Figure 6.4(b), and is facilitated by
the construction of the visual density profile. The procedure returns the isolated data set L along
with a number of called the accuracy significance p(L , Ci ) of the class Ci . The pointer to this new
data set L is added to the end of PATH. At that point, the user decides whether further exploration
into that isolated data set is necessary. If so, the same process of subspace analysis is repeated on
this node. Otherwise, the process is terminated and the most relevant class label is reported. The
overall exploration process also provides the user with a good diagnostic idea of how the local data
distributions along different dimensions relate to the class label.
Bibliography
[1] A. Aamodt and E. Plaza. Case-based reasoning: Foundational issues, methodological varia-
tions, and system approaches, AI communications, 7(1):39–59, 1994.
[2] C. Aggarwal, A. Hinneburg, and D. Keim. On the surprising behavior of distance metrics in
high dimensional space, Data Theory—ICDT Conference, 2001; Lecture Notes in Computer
Science, 1973:420–434, 2001.
[3] C. Aggarwal. Towards systematic design of distance functions for data mining applications,
Proceedings of the KDD Conference, pages 9–18, 2003.
[4] C. Aggarwal and P. Yu. Locust: An online analytical processing framework for high dimen-
sional classification of data streams, Proceedings of the IEEE International Conference on
Data Engineering, pages 426–435, 2008.
[5] C. Aggarwal and C. Reddy. Data Clustering: Algorithms and Applications, CRC Press, 2013.
[7] C. Aggarwal, J. Han, J. Wang, and P. Yu. A framework for on demand classification of evolving
data streams. In IEEE TKDE Journal, 18(5):577–589, 2006.
[8] D. Aha, P. Clark, S. Salzberg, and G. Blix. Incremental constructive induction: An instance-
based approach, Proceedings of the Eighth International Workshop on Machine Learning,
pages 117–121, 1991.
[10] D. Aha. Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms,
International Journal of Man-Machine Studies, 36(2):267–287, 1992.
[11] D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms, Machine Learning,
6(1):37–66, 1991.
[12] D. Aha. Lazy learning: Special issue editorial, Artificial Intelligence Review, 11:7–10, 1997.
[13] F. Angiulli. Fast nearest neighbor condensation for large data sets classification, IEEE Trans-
actions on Knowledge and Data Engineering, 19(11):1450–1464, 2007.
[14] M. Ankerst, G. Kastenmuller, H.-P. Kriegel, and T. Seidl. Nearest Neighbor Classification in
3D Protein Databases, ISMB-99 Proceedings, pages 34–43, 1999.
[15] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning, Artificial Intelligence Re-
view, 11(1–5):11–73, 1997.
[16] S. D. Bay. Combining nearest neighbor classifiers through multiple feature subsets. Proceed-
ings of ICML Conference, pages 37–45, 1998.
[17] J. Beringer and E Hullermeier. Efficient instance-based learning on data streams, Intelligent
Data Analysis, 11(6):627–650, 2007.
[18] M. Birattari, G. Bontempi, and M. Bersini. Lazy learning meets the recursive least squares
algorithm, Advances in neural information processing systems, 11:375–381, 1999.
182 Data Classification: Algorithms and Applications
[19] C. M. Bishop. Improving the generalization properties of radial basis function neural networks,
Neural Computation, 3(4):579–588, 1991.
[20] E. Blanzieri and A. Bryl. Instance-based spam filtering using SVM nearest neighbor classifier,
FLAIRS Conference, pages 441–442, 2007.
[21] J. Blitzer, K. Weinberger, and L. Saul. Distance metric learning for large margin nearest neigh-
bor classification, Advances in neural information processing systems: 1473–1480, 2005.
[22] S. Boriah, V. Chandola, and V. Kumar. Similarity measures for categorical data: A comparative
evaluation. Proceedings of the SIAM Conference on Data Mining, pages 243–254, 2008.
[23] L. Bottou and V. Vapnik. Local learning algorithms, Neural COmputation, 4(6):888–900,
1992.
[24] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees,
Wadsworth, 1984.
[25] L. Breiman. Random forests, Machine Learning, 48(1):5–32, 2001.
[26] T. Cain, M. Pazzani, and G. Silverstein. Using domain knowledge to influence similarity judge-
ments, Proceedings of the Case-Based Reasoning Workshop, pages 191–198, 1991.
[27] C. L. Chang. Finding prototypes for nearest neighbor classifiers: IEEE Transactions on Com-
puters, 100(11):1179–184, 1974.
[28] W. Cheng and E. Hullermeier. Combining instance-based learning and logistic regression for
multilabel classification, Machine Learning, 76 (2–3):211–225, 2009.
[29] J. G. Cleary and L. E. Trigg. K ∗ : An instance-based learner using an entropic distance measure,
Proceedings of ICML Conference, pages 108–114, 1995.
[30] S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic
features, Machine Learning, 10(1):57–78, 1993.
[31] T. Cover and P. Hart. Nearest neighbor pattern classification, IEEE Transactions on Informa-
tion Theory, 13(1):21–27, 1967.
[32] P. Cunningham, A taxonomy of similarity mechanisms for case-based reasoning, IEEE Trans-
actuions on Knowledge and Data Engineering, 21(11):1532–1543, 2009.
[33] B. V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE
Computer Society Press, 1990.
[34] C. Domeniconi, J. Peng, and D. Gunopulos. Locally adaptive metric nearest-neighbor classi-
fication, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1281–1285,
2002.
[35] C. Domeniconi and D. Gunopulos. Adaptive nearest neighbor classification using support vec-
tor machines, Advances in Neural Information Processing Systems, 1: 665–672, 2002.
[37] G. W. Flake. Square unit augmented, radially extended, multilayer perceptrons, In Neural
Networks: Tricks of the Trade, pages 145–163, 1998.
Instance-Based Learning: A Survey 183
[38] E. Frank, M. Hall, and B. Pfahringer. Locally weighted Naive Bayes, Proceedings of the Nine-
teenth Conference on Uncertainty in Artificial Intelligence, pages 249–256, 2002.
[39] J. Friedman. Flexible Nearest Neighbor Classification, Technical Report, Stanford University,
1994.
[40] J. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees, Proceedings of the National Confer-
ence on Artificial Intelligence, pages 717–724, 1996.
[41] S. Garcia, J. Derrac, J. Cano, and F. Herrera. Prototype selection for nearest neighbor classi-
fication: Taxonomy and empirical study, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 34(3):417–436, 2012.
[42] D. Gunopulos and G. Das. Time series similarity measures. In Tutorial notes of the sixth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 243–
307, 2000.
[43] E. Han, G. Karypis, and V. Kumar. Text categorization using weight adjusted k-nearest neigh-
bor classification, Proceedings of the Pacific-Asia Conference on Knowledge Discovery and
Data Mining, pages 53–65, 2001.
[44] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 18(6):607–616, 1996.
[45] A. Hinneburg, C. Aggarwal, and D. Keim. What is the nearest neighbor in high dimensional
spaces? Proceedings of VLDB Conference, pages 506–515, 2000.
[46] Y. S. Hwang and S. Y. Bang. An efficient method to construct a radial basis function neural
network classifier, Neural Networks, 10(8):1495–1503, 1997.
[47] N. Jankowski and M. Grochowski, Comparison of instance selection algorithms, I algorithms
survey, Lecture Notes in Computer Science, 3070:598–603, 2004.
[48] R. Kohavi, P. Langley, and Y. Yun. The utility of feature weighting in nearest-neighbor algo-
rithms, Proceedings of the Ninth European Conference on Machine Learning, pages 85–92,
1997.
[49] J. Li, G. Dong, K. Ramamohanarao, and L. Wong. Deeps: A new instance-based lazy discovery
and classification system, Machine Learning, 54(2):99–124, 2004.
[50] D. G. Lowe. Similarity metric learning for a variable-kernel classifier, Neural computation,
7(1):72–85, 1995.
[51] G. J. McLachlan. Discriminant analysis and statistical pattern recognition, Wiley-
Interscience, Vol. 544, 2004.
[52] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units, Neural
computation, 1(2), 281–294, 1989.
[53] J. Peng, B. Bhanu, and S. Qing. Probabilistic feature relevance learning for content-based
image retrieval, Computer Vision and Image Understanding, 75(1):150–164, 1999.
[54] J. Peng, D. Heisterkamp, and H. Dai. LDA/SVM driven nearest neighbor classification, Com-
puter Vision and Pattern Recognition Conference, 1–58, 2001.
[55] J. R. Quinlan. Combining instance-based and model-based learning. Proceedings of ICML
Conference, pages 236–243, 1993.
184 Data Classification: Algorithms and Applications
[56] G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983.
[57] H. Samet. Foundations of multidimensional and metric data structures, Morgan Kaufmann,
2006.
[58] M. Sonka, H. Vaclav, and R. Boyle. Image Processing, Analysis, and Machine Vision. Thom-
son Learning, 1999.
[59] E. Spyromitros, G. Tsoumakas, and I. Vlahavas. An empirical study of lazy multilabel classi-
fication algorithms, In Artificial Intelligence: Theories, Models and Applications, pages 401–
406, 2008.
[60] C. Stanfil and D. Waltz. Toward memory-based reasoning, Communications of the ACM,
29(12):1213–1228, 1986.
[61] M. Tahir, A. Bouridane, and F. Kurugollu. Simultaneous feature selection and feature weight-
ing using Hybrid Tabu Search in K-nearest neighbor classifier, Pattern Recognition Letters,
28(4):483–446, 2007.
[62] P. Utgoff. Incremental induction of decision trees, Machine Learning, 4(2):161–186, 1989.
[63] P. Vincent and Y. Bengio. K-local hyperplane and convex distance nearest algorithms, Neural
Information Processing Systems, pages 985–992, 2001.
[64] D. Volper and S. Hampson. Learning and using specific instances, Biological Cybernetics,
57(1–2):57–71, 1987.
[65] J. Wang, P. Neskovic, and L. Cooper. Improving nearest neighbor rule with a simple adaptive
distance measure, Pattern Recognition Letters, 28(2):207–213, 2007.
[66] J. Wang and G. Karypis. On mining instance-centric classification rules, IEEE Transactions
on Knowledge and Data Engineering, 18(11):1497–1511, 2006.
[67] I. Watson and F. Marir. Case-based reasoning: A review, Knowledge Engineering Review,
9(4):327–354, 1994.
[68] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neigh-
bor classification, NIPS Conference, MIT Press, 2006.
[69] D. Wettschereck, D. Aha, and T. Mohri. A review and empirical evaluation of feature weight-
ing methods for a class of lazy learning algorithms, Artificial Intelligence Review, 11(1–
5):273–314, 1997.
[70] D. Wettschereck and D. Aha. Weighting features. Case-Based Reasoning Research and De-
velopment, pp. 347–358, Springer Berlin, Heidelberg, 1995.
[71] D. Wettschereck and T. Dietterich. Improving the performance of radial basis function net-
works by learning center locations, NIPS, Vol. 4:1133–1140, 1991.
[72] D. Wilson and T. Martinez. Reduction techniques for instance-based learning algorithms, Ma-
chine Learning 38(3):257–286, 2000.
[73] D. Wilson and T. Martinez. An integrated instance-based learning algorithm, Computers and
Intelligence, 16(1):28–48, 2000.
[74] Z. Xie, W. Hsu, Z. Liu, and M. L. Lee. Snnb: A selective neighborhood based naive Bayes for
lazy learning, Advances in Knowledge Discovery and Data Mining, pages 104–114, Springer,
2002.
Instance-Based Learning: A Survey 185
[76] D.-C. Zhan, M. Li, Y.-F. Li, and Z.-H. Zhou. Learning instance specific distances using metric
propagation, Proceedings of ICML Conference, pages 1225–1232, 2009.
[77] H. Zhang, A. Berg, M. Maire, and J. Malik. SVM-KNN: Discriminative nearest neighbor
classification for visual category recognition, Computer Vision and Pattern Recognition, pages
2126–2136, 2006.
[78] M. Zhang and Z. H. Zhou. ML-kNN: A lazy learning approach to multi-label learning, Pattern
Recognition, 40(7): 2038–2045, 2007.
[79] Z. Zheng and G. Webb. Lazy learning of Bayesian rules. Machine Learning, 41(1):53–84,
2000.
[80] Z. H. Zhou and Y. Yu. Ensembling local learners through multimodal perturbation, IEEE
Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 35(4):725–735, 2005.
[81] X. Zhu and X. Wu. Scalable representative instance selection and ranking, Proceedings of
IEEE International Conference on Pattern Recognition, Vol 3:352–355, 2006.