Algorithms and Applications For Spatial Data Mining PDF
Algorithms and Applications For Spatial Data Mining PDF
1 Introduction
Due to the computerization and the advances in scientific data collection we are faced with a
large and continuously growing amount of data which makes it impossible to interpret all this data
manually. Therefore, the development of new techniques and tools that support the human in transforming data into useful knowledge has been the focus of the relatively new and interdisciplinary
research area knowledge discovery in databases.
Knowledge discovery in databases (KDD) has been defined as the non-trivial process of discovering valid, novel, potentially useful and ultimately understandable patterns from data, a pattern is
an expression in some language describing a subset of the data or a model applicable to that subset
(Fayyad et al., 1996). The process of KDD is interactive and iterative, involving several steps such
as data selection, data reduction, data mining, and the evaluation of the data mining results. The
heart of the process, however, is the data mining step which consists of the application of data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce
a particular enumeration of patterns over the data (Fayyad et al., 1996).
While a lot of research has been conducted on knowledge discovery and data mining in relational databases (see e.g. (Chen et al., 1996) or (Fayyad, 1997) for an overview), only a few works
deal with knowledge discovery in spatial databases (see (Gueting, 1994) for an introduction to spatial databases, (Koperski et al., 1996) for an overview of spatial data mining). Finding implicit regularities, rules or patterns hidden in spatial databases is an important task, e.g. for geo-marketing,
traffic control or environmental studies.
A spatial database contains objects which are characterized by a spatial location and/or extension as well as by several non-spatial attributes. Figure 7.1 illustrates a spatial database on Bavaria
as an example. Depicted is the relation Communities containing polygons which represent communities in a geographic information system. This spatial database on Bavaria - referred to as the BAVARIA database - is used in some of the following sections as test database for our algorithms. The
database contains the ATKIS 500 data and the Bavarian part of the statistical data obtained by the
-1-
population
unemployment
foreigners
...
Munich
1.300.000
0.06
0.15
...
...
...
...
...
...
spatial
...
-2-
framework. Furthermore, example applications are discussed for these algorithms. The last section
gives a short summary and shows some directions for future research.
C north A
A disjoint B
B northeast A
A overlap B
B
A
A distance=c B
rep(A)
A distance<c B
Figure 7.2. Illustration of some topological, some distance and some direction relations
B, A covered-by B, A contains B, A inside B. A formal definition has been given by Egenhofer
(1991).
Distance relations compare the distance of two objects with a given constant using one of the
arithmetic comparison operators. If dist is a distance function, is one of the arithmetic predicates
<, > or = , and c is a real number, then a distance relation O1 distance c O2 between the two spatial
objects O1 and O2 holds if distance(O1, O2) c.
To define the direction relations, e.g. O2 south O1, we consider one representative point of the
object O1 as the origin of a virtual coordinate system whose quadrants and half-planes define the
directions. To fulfil the direction predicate, all points of O2 have to be located in the respective area
of the plane. Figure 7.2 illustrates the definition of some direction relations using 2D polygons.
Obviously, the directions are not uniquely defined but there is always a smallest direction relation for two objects A and B, called the exact direction relation of A and B, which is uniquely determined. In figure 7.2, for instance, A and B satisfy the direction relations northeast and east but
the exact direction relation of A and B is northeast.
By combining basic spatial relations via logical operators it is possible to define more complex
spatial relations, e.g. O1 being north of O2 and no more than 5 km away. Each such spatial relation induces a spatial neighbourhood graph as defined in the following definition.
-4-
G neighbor = ( N, E )
DB
edge e = (n1, n2) exists iff neighbour(n1,n2) holds. A neighbourhood path of length k is defined as
a sequence of nodes [n1, n2, . . ., nk], where neighbour(ni, ni+1) holds for all n i N, 1 i < k .
We assume the standard operations from relational algebra like selection, union, intersection
and difference to be available for sets of objects and sets of neighbourhood paths (e.g., the operation selection(set, predicate) returns the set of all elements of a set satisfying the predicate predicate). In addition, we introduce some operations which are specific to neighbourhood graphs and
paths and which are designed to support spatial data mining. The following operations are briefly
described:
starlike
variable starlike
vertical starlike
Object-ID
Neighbour
Distance
Direction
Topology
o1
o2
2.7
southwest
disjoint
o1
o3
northwest
overlap
...
...
...
...
...
B+ tree
A simple implementation of a neighbourhood index using a B+-tree on the key attribute ObjectID is illustrated in figure 7.4.
-7-
A neighbourhood index supports not only one but a set of neighbourhood graphs. We call a
neighbourhood index applicable for a given neighbourhood graph if the index contains an entry for
each of the edges of the graph. To find the neighbourhood indices applicable for some neighbourhood graph, we introduce the notion of the critical distance of a neighbourhood relation r. Intuitively, the critical distance of a neighbourhood relation r is the maximum possible distance for a pair
of objects O1 and O2 satisfying O1 r O2. The following lemma allows us to calculate the critical
distance for any neighbourhood relation. The critical distance is calculated recursively along the
composition of a neighbourhood relation.
Lemma 1: The following equation holds for the critical distance of a neighbourhood relation r:
c-distance(r) =
if r is a direction relation, the relation distance>d
or disjoint
min ( c-distance ( r 1 ), c-distance ( r 2 ) ) if r = r1 r2
DB
DB
bourhood indices I c1 and I c2 with c1 < c2 are available and applicable, using I c1 is more efficient
DB
because in general it has less entries than I c2 . The smallest applicable neighbourhood index for
some neighbourhood graph is the applicable neighbourhood index with the smallest critical distance.
In figure 5, we sketch the algorithm for processing the neighbours operation which makes use
of the smallest applicable neighbourhood index. The first step, the index selection, selects a neighbourhood index. If there is no applicable neighbourhood index, then the standard approach of using
some spatial index structure is followed. The filter step returns a set of candidate objects (which
may satisfy the specified neighbourhood relation) with a cardinality significantly smaller than the
database size. In the last step, the refinement step, for all these candidates the neighbourhood relation as well as the additional predicate pred are evaluated and all objects passing this test are returned as the resulting neighbours.
To implement the extensions operation, we perform a depth-first search. Thus, a path buffer of
size O(max-length) is sufficient to store the intermediate results. On the other hand, a breadth-first
-8-
DB
To create a neighbourhood index I max , a spatial join on DB with respect to the neighbourhood
relation ( O 1 dis tan ce = dist O 2 dist max ) is performed. A spatial join can be efficiently processed by using a spatial index structure. For each pair of objects returned by the spatial join, we
then have to determine the exact distance, the direction relation and the topological relation. The
resulting tuples of the form ( O 1, O 2, Dis tan ce, Direction, Topo log y ) are stored in a relation
which is indexed by a B-tree on the attribute O1.
Updates of a database, i.e. insertions or deletions of spatial objects, require updates of the deDB
rived neighbourhood indices. Fortunately, the update of a neighbourhood index I max is restricted
to the neighbourhood of the respective object defined by the neighbourhood relation A
distance< max B. This neighbourhood can be efficiently retrieved by using either a neighbourhood
index (in case of a deletion) or by using a spatial index structure (in case of an insertion). That
means, it is not required to re-build a neighbourhood index from scratch when an update occurs.
Instead, entries can incrementally be inserted into or deleted from a neighbourhood index.
The database primitives were implemented on top of the commercial DBMS Illustra, Version
3.3, using its 2D spatial data blade which provides R-trees. Neighbourhood indices were realized
in the DBMS by tables and B-trees as described above (see figure 7.4) and the geographic database
-9-
of Bavaria was used for an experimental performance evaluation. The results of our performance
evaluation (Ester et al., 2000) demonstrate a significant speed-up for the neighbours operation with
compared to without neighbourhood index. The neighbours operation is crucial to the efficient
DBMS support of our database primitives since the implementation of the operations for extending
neighbourhood paths is based on the neighbours operation.
In particular, a neighbourhood index is very efficient for complex spatial objects (i.e. large numbers of vertices per spatial object) and an average number of neighbours per spatial object which
is not too large. This is a typical setting, e.g., for geographic information systems when using intersects as the neighbourhood predicate. If the average number of of neighbours is very large for a
spatial neighbourhood relation, we also have to consider the size of the the neighbourhood index.
In the worst case, depending on the critical distance c-distance(r) and the distribution of the spatial
objects, the storage requirements for the materialization of a neighbourhood index can be of O(n2).
Ester et al. (2000) present a cost model for the neighbours operation. Based on system parameters
(e.g. average execution time for a page access) and information about the neigbhorhood relation
(e.g. average number of neighbours) the model can be used to predict the performance gain of a
neighbourhood index and whether this time gain is worth the additional storage requirements.
3 Spatial Clustering
Clustering is the task of grouping the objects of a database into meaningful subclasses (that is,
clusters) so that the members of a cluster are as similar as possible whereas the members of different clusters differ as much as possible from each other. Applications of clustering in spatial databases are, e.g., the detection of seismic faults by grouping the entries of an earthquake catalog or
the creation of thematic maps in geographic information systems by clustering feature vectors. We
can support clustering algorithms by our database primitives if the clustering algorithm is based on
a local cluster condition, i.e. if it constructs clusters by analyzing a restricted neighbourhood of
the objects. Examples are the density-based clustering algorithm DBSCAN (Ester et al., 1996) as
well as its generalized version GDBSCAN (Sander et al., 1998) which is discussed in the following.
- 10 -
- 11 -
Density-based clusters
Clustering of polygons
on a neighbourhood graph G NPred and which uses the neighbours operation introduced above in
our framework for spatial data mining. To find a density-connected set, GDBSCAN starts with an
arbitrary object p and retrieves all objects density-reachable from p with respect to NPred and
MinWeight. Density-reachable objects are retrieved by performing successive NPred-neighbourhood queries and checking the minimum weight of the respective results. If p is a core object, this
procedure yields a density-connected set with respect to NPred and MinWeight. If p is not a core
object, no objects are density-reachable from p and p is assigned to NOISE. This procedure is iteratively applied to each object p which has not yet been classified. Thus, a density-based decomposition of a dataset is constructed.
In figure 7.7, we present the basic version of the algorithm GDBSCAN based on a neighbourhood graph with respect to the neighbourhood predicate NPred:
- 12 -
DB
- 13 -
surface of
the earth
(12),(17.5)
(8.5),(18.7)
Channel 1
Cluster 1
Cluster 2
12
10
1112
1122
3232
3333
feature
space
Cluster 3
8
16.5 18.0 20.0 22.0 Channel 2
- 14 -
The result is shown in figure 7.10. Each cluster was coded by a different color. Then each point
in the image of the surface of the earth was colored according to the identificator of the cluster containing the corresponding 5-dimensional vector. A high degree of correspondence between the obtained image and a physical map of California can easily be seen. A detailed discussion of this correspondence is beyond the scope of this chapter.
This is a simplified description of the application. In practice, to obtain high quality results,
there are several pre- and post-processing steps to the clustering of the feature vectors. Especially,
the clustering of the feature vectors should be followed by a spatial smoothing because feature vectors which are close to each other in space are also likely to belong to the same class.
Application 2: Geography (2D polygons)
In the following, we present a simple method for detecting influence regions in a geographic database. GDBSCAN is used to extract density-connected sets of neighbouring objects having a similar value of the non-spatial attribute(s). To define the similarity on an attribute, its domain is partitioned into a number of disjoint classes and values in the same class are considered similar to each
other. The sets with the highest or lowest attribute value(s) are most interesting and are called influence regions, i.e. the maximal neighbourhood of a centre having a similar value in the non-spatial
attribute(s) as the centre itself. For economic geography, the resulting influence region may be further analyzed by comparing them to a circular region representing the expected theoretical shape to
obtain a possible deviation. Different methods may be used for this comparison. A difference-based
method calculates the difference of both, the observed influence region and the theoretical circular
region, thus returning some region indicating the location of a possible deviation. An approxima-
Figure 7.10. Visualization of the clustering result for the SEQUIOA 2000 raster data
- 15 -
tion-based method calculates the optimal approximating ellipsoid of the observed influence region.
If the two main axes of the ellipsoid differ in length significantly, then the longer one is returned indicating the direction of a deviation. Ester et al. (1997) present a detailed description of this application.
GDBSCAN can be used to extract the influence regions from a spatial database. We define
NPred(X,Y) as intersect(X,Y) attr-class(X) = attr-class(Y) and use cardinality as wCard function. Furthermore, we set MinCard to 2 in order to exclude sets of less than 2 objects. Some results
of this approach for the BAVARIA database are illustrated in figure 7.11 .
4 Spatial Characterization
The task of characterization is to find a compact description for a selected subset of the database. In this section, we discuss the task of characterization in the context of spatial databases and
present two relevant methods.
4.1 Algorithms
The task of mining association rules has been introduced by Agrawal and Srikant (1994). An
association rule is a rule I1 I2 where I1 and I2 are disjoint sets of items. The support of the rule
is given by the number of database tuples containing all elements of I1 and the confidence is given
by the number of tuples containing all elements of both I1 and I2. For a database DB of transactions
(i.e. records contain sets of items bought by some customer in one transaction), all association rules
should be discovered having a support of at least minsupp and a confidence of at least minconf in
DB.
Figure 7.11. Illustration of the influence regions of Ingolstadt (left) and Munich (right)
and their deviation from the expected shape
- 16 -
Extending the general concept of association rules, Koperski and Han (1995) introduce spatial
association rules which describe associations between objects based on spatial neighbourhood relations. For instance, a user may want to discover the spatial associations of towns in British Columbia with roads, waters or boundaries having some specified support and confidence.
Figure 7.12 depicts the specification of this spatial data mining task.
discover spatial association rules
inside British_Columbia
from road R, water W, mines M, boundary B
in relevance to town T
where close-to(T.geo, X.geo) and X in {R, W, M, B}
having minsupp = 5 % and minconf = 80 %
Figure 7.12. Example specification for mining spatial association rules
Then, the following spatial association rule may be discovered:
X DB Y DB: is-a(X,town) close-to(X,Y) is-a(Y,water) (80%)
This rule states that 80% of the selected towns are close to water, i.e. the rule characterizes
towns in British Columbia as generally being close to some lake, river etc.
The input for mining spatial association rules specifies a relation of n tuples with a spatial attribute, a spatial neighbourhood relation, a concept hierarchy for each of the attributes, a selection
of relevant object types, the minimum support, and the mininmum confidence.
The proposed algorithm consists of five steps. Step 2 (coarse spatial computation) and step
4 (refined spatial computation) involve spatial aspects of the objects and thus are examined in
the following. Step 2 computes spatial joins of the object type to be characterized (such as
town) with each of the other specified object types (such as water, road, boundary or mine) using the neighbourhood relation (such as close-to). For each of the candidates obtained from
step 2 which passed step 3, in step 4 the exact spatial relation, for example overlap, is determined. Finally, a relation such as the one depicted in figure 7.13 results which is the input for the
final, non-spatial, Apriori-like step of rule generation. To implement this algorithm using our database primitives it is sufficient to replace step 2 by the following procedure. The spatial join can
be replaced by calling a neighbours operation for each target object selected in step 1. The under-
- 17 -
lying neighbourhood graph in this case is defined by the user-specified neighbourhood relation
(e.g. close-to).
Town
Saanich
Water
<meet, J.FucaStrait>
PrinceGeorge
Petincton
...
Road
<overlap,highway1>,
<close-to,highway17>
Boundary
<close-to,US>
<overlap, highway97>
<meet,OkanaganLake>
<overlap, highway97>
...
...
<close-to,US>
...
- 18 -
t arg ets
DB
DB
freq
( prop )- freq
( prop )
f t arg ets ( prop ) = --------------------------------------- ----------------------------------
card ( DB )
Let significance and proportion be real numbers and let max-neighbours be a natural number.
Let neighbors Gi ( s ) denote the set of all objects reachable from one of the elements of s by traversing at most i of the edges of the neighbourhood graph G. Then, the task of spatial characterization
is to discover each property prop and each natural number n max-neighbours such that (1) the set
objects = neighborsG ( targets ) as well as (2) the sets objects = neighbors G ( { t } ) for at least pron
or
1
DB
f objects -------------------------------significance
In point (1) the union of the neighbourhood of all target objects is considered simultaneously,
whereas in point (2) the neighbourhood of each targets is considered separately. The parameter
proportion specifies the minimum confidence required for the characterization rules and the frequency factors of the properties provide a measure of their interestingness with respect to the target
objects.
Figure 7.15 presents the algorithm for discovering spatial characterizations. The parameter proportion is relevant only for the last step of the algorithm, i.e. for the generation of a rule. Note the
- 19 -
DB
rate of foreigners =
rate of academics =
- 21 -
.
.
.
.. .... ..
.
difference
difference
difference
..
. .. .. .
.. . .
distance
distance
.. ...
.. ....
.
distance
(c) no trend
- 22 -
In a naive approach, one observation could be made for each element o2 DB and a regression
analysis would be conducted once for this whole set of observations. This approach, however,
would fail to discover a spatial trend in the following situations:
a trend is present only locally in the neighbourhood but not globally in the database
a trend is present not in all directions but only in some directions.
We argue that such situations are very common in real databases. We use the concept of neighbourhood paths to overcome these problems and consider only such objects o2 which are located
on one of the neighbourhood paths starting from o1. We stop extending a neighbourhood path as
soon as no more significant trend can be found and, thus, provide a means to restrict the search to
the neighbourhood of the source object. Furthermore, it is enough that some, not all, neighbourhood paths show a spatial trend of the specified type.
Definition 4: (spatial trend detection): Let g be a neighbourhood graph, o an object (node) in g
and a be a subset of all non-spatial attributes. Let t be a type of function, e.g. linear or exponential,
used for the regression and let filter be one of the filters for neighbourhood paths. Let min-conf be
a real number and let min-length as well as max-length be natural numbers. The task of spatial
trend detection is to discover the set of all neighbourhood paths in g starting from o and having a
trend of type t in attributes a with a correlation of at least min-conf. The paths have to satisfy the
filter and their length must be between min-length and max-length.
Definition 7.4 allows different specializations. Either the set of all discovered neighbourhood
paths or each of its elements must have a trend of the specified type. For each of these specializations, we present an algorithm to discover such spatial trends. Both algorithms require the same
input parameters but they use different methods to search the set of all relevant neighbourhood
paths.
The first algorithm discovers global trends, i.e. trends for the whole set of all neighbourhood
paths with source o having a length in the specified interval. Algorithm detect-global-trends performs a breadth-first search of the set of all neighbourhood paths starting from o and it is outlined
in figure 7.19.
Beginning from o, the algorithm creates all neighbourhood paths of the same length simultaneaously - starting with min-length and continuing until max-length. The regression is performed
- 23 -
DB
- 24 -
The second algorithm discovers local trends, i.e. trends for a single neighbourhood path with
source o having a length in the specified interval. Algorithm detect-local-trends creates the neighbourhood paths starting from o in a depth-first manner which is presented in figure 7.20. A regression is performed once for each of the neighbourhood paths with max-length length min-length
and a path is only extended further if it has a significant trend (i.e. abs(correlation) of the resulting
regression function min-conf). The algorithm returns two sets of paths showing a significant spatial trend: a first set of positive trends and a second set of negative trends.
DB
- 25 -
The existence of a global trend for a start object o indicates that if considering all objects on all
paths starting from o the values for the specified attribute(s) in general tend to increase (decrease)
with increasing distance. Figure 7.21 (a) depicts the result of algorithm global-trend for the attribute average rent and the city of Regensburg as a start object.
direction of decreasing values of trend attribute average rent
Regensburg
(b) local-trends
(a) global-trends
- 26 -
First, detect centres for a given attribute, i.e. communities with a significantly decreasing
spatial trend of this attribute. Algorithm global-trend can be used for this task.
Second, apply the spatial characterization to these centres to find their common properties.
Ester et al. (1998) report the results of the following experiment. In a first step, centres for
attribute average rent were detected: minimum correlation was set to 0.7 and only those communities were selected where the slope of the trend was less than -10-4 and the path length was not
smaller than 5, i.e. only linear trends that are noticeably decreasing were detected.
With this definition, 24 centres out of the 2043 communities were found. The characterization
rule discovered for these centres contains the following properties:
community is a centre
rate of academics = high (n = 1, f(prop) = 9.1)
average number of persons per household = low (n = 1, f(prop) = 2.5)
rate of foreigners = low (n = 1, f(prop) = 2.8)
Note that none of the attributes was significant for n = 0, i.e. without considering the neighbourhood of the target object. Only if we extend the target regions by one neighbour, we can see characteristic properties. Thus, this result could not be found by a non-spatial characterization algorithm.
6 Spatial Classification
The task of classification is to assign an object to a class from a given set of classes based on
the attribute values of the object. In spatial classification the attribute values of neighbouring objects may also be relevant for the membership of objects and therefore have to be considered as
well.
Ester et al. (1997) proposed a spatial classification algorithm based on the well-known ID3 algorithm (Quinlan, 1986) which was designed for relational databases. The extension to spatial attributes is to consider also the attribute of objects on a neighbourhood path starting from the current
object. Thus, we define generalized attributes for a neighbourhood path p = [o1, . . ., ok] as tuples
(attribute-name, index) where index is a valid position in p representing the attribute with attribute-
- 27 -
name of object oindex. The generalized attribute (economic-power,2), e.g., represents the attribute
economic-power of some (direct) neighbour of object o1.
population of city
medium
low
high
type of neighbour of city
amount of taxes
of city
city
very
low low
. . .
high
road
airport
very
high
economic power
of city = high
type of
neighbour of neighbour of city
airport
(87 %)
road
city
. . .
economic power
of city = high
(95 %)
. . .
Figure 7.22. Sample decision tree and rules discovered by the classification algorithm
Because it is reasonable to assume that the influence of neighbouring objects and their attributes
decreases with increasing distance, we can limit the length of the relevant neighbourhood paths by
an input parameter max-length. Furthermore, the classification algorithm allows the input of a
predicate to focus the search for classification rules on the objects of the database fulfilling this
predicate. Figure 7.22 depicts a sample decision tree and two rules derived from it. Economic
power has been chosen as the class attribute and the focus is on all objects of type city.
In figure 7.23, we present the pseudo code of the algorithm for spatial classification which finds
all paths from the root to one of the leaves of a decision tree, where all attributes yield an informa-
- 28 -
tion gain of at least . Note that the algorithm uses a predicate larger_distance, i.e.distance(n1,
ni+1) > distance(n1, ni) to restrict the creation of neighbourhood paths.
spatial_classification_rules(db:Set_of_Objects; class_attr:Attribute;
NPred:NeighbourhoodRelation; max_length:Int)
NPredGraph:= create the neighbourghood graph for db w.r.t. NPred;
paths:=extensions(NPredGraph, {db}, max_length, larger_distance);
classify(class_attr, NPredGraph, EMPTY_RULE, paths, max_length);
classify(class_attr:Attribute; rule:ClassificationRule; paths:set_of_paths; max_length:Int)
max_info_gain:=0.0;
max_attr:=NULL;
for i from 1 to max_length do
for each generalized attribute (Aj,i) not used in rule do
info-gain:=calculate_information_gain(Aj, class_attr, i, paths);
if info_gain > max_info_gain then
max_attr:=Aj;
max_neighbours:=i;
max_info_gain:=info_gain;
if max_attr NULL and max_info_gain > then
for each value of max_attr do
extended_rule:=rule + max_attr,max_neighbours,value;
classify(class_attr, extended_rule, paths, max_length);
calculate_information_gain(attr,class_attr: Attribute; index:Int; paths:set_of_paths);
for each path in paths do
consider attr of the index-th object of path and class_attr of the first object of path
for the calculation of the information gain
Figure 7.23. Algorithm spatial_classification_rules
Another algorithm for spatial classification is presented by Koperski et al. (1998). It works as
follows: The relevant attributes are extracted by comparing the attribute values of the target objects
with the attribute values of their nearest neighbours. The determination of relevant attributes is
based on the concept of the nearest hit (the nearest neighbour belonging to the same class) and the
nearest miss (the nearest neighbour belonging to a different class). In the construction of the decision tree, the neighbours of target objects are not considered individually. Instead, so-called buffers
are created around the target objects and the non-spatial attribute values are aggregated over all objects contained in the buffer. For instance, in the case of shopping malls a buffer may represent the
- 29 -
area where its customers live or work. The size of the buffer yielding the maximum information
gain is chosen and this size is applied to compute the aggregates for all relevant attributes.
Whereas the property of being a nearest neighbour cannot be directly expressed by our neighbourhood relations, it is possible to extend our set of neighbourhood relations accordingly. The
proposed database primitives are, however, sufficient to express the creation of buffers for spatial
classification by using a distance-based neigborhood predicate.
7 Conclusions
In this chapter, we introduced a database-oriented framework for spatial data mining which is
based on the concepts of neighbourhood graphs and paths. A small set of basic operations on these
graphs and paths were defined as database primitives for spatial data mining. Furthermore, techniques to efficiently support the database primitives by a commercial DBMS were presented. In
the following sections, we covered the main tasks of spatial data mining: spatial clustering, spatial
characterization, spatial trend detection and spatial classification. For each of these tasks, we presented algorithms as well as prototypical applications in domains such as the earth sciences and
geography. Thus, we demonstrated the practical impact of these algorithms of spatial data mining.
The following issues indicate interesting directions for future research. The database primitives
were implemented on top of the commercial DBMS Illustra. Since the system overhead imposed
by this DBMS is rather large, techniques of improving the efficiency should be investigated. For
example, techniques for processing sets of related neighbours operations which provide more information to the DBMS can be used to improve the overall efficiency of mining algorithms using
the database primitives.
In some spatial databases the dimension of time plays an important role: the history of the relevant part of the world is stored for the purpose of analysis, for example raster images of the same
area of the surface of the earth taken at different times. Data mining in such spatio-temporal databases is a promising area of future research. For example, geographers may be interested in learning spatio-temporal rules describing the process of growth of urban landuse.
- 30 -
References
Agrawal R., Imielinski T., and Swami A. 1993 Database Mining: A Performance
Perspective, IEEE Transactions on Knowledge and Data Engineering, Vol. 5, No. 6, pp. 914925.
Agrawal R., and Srikant R. 1994 Fast Algorithms for Mining Association Rules in Large
Databases Proc. 20th Int. Conf.Very Large Data Bases (VLDB 94), pp. 478-499.
Chen M.-S., Han J., and Yu P. S. 1996 Data Mining: An Overview from a Database
Perspective, IEEE Trans. on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 866-883.
Egenhofer M. J. 1991 Reasoning about Binary Topological Relations, Proc. 2nd Int. Symp.
on Large Spatial Databases, Zurich, Switzerland, pp. 143-160.
Ester M., Frommelt A., Kriegel H.-P., and Sander J. 1998 Algorithms for Characterization and
Trend Detection in Spatial Databases, Proc. 4th Int. Conf. on Knowledge Discovery and Data
Mining, New York City, NY, pp. 44-50.
Ester M., Frommelt A., Kriegel H.-P., and Sander J. 2000 Spatial Data Mining: Database
Primitives, Algorithms and Efficient DBMS Support, Data Mining and Knowledge
Discovery, an International Journal, Kluwer Academic Publishers, Vol. 4, No. 2/3.
Ester M., Kriegel H.-P., and Sander J. 1997 Spatial Data Mining: A Database Approach,
Proc. 5th Int. Symp. on Large Spatial Databases, Berlin, Germany, pp. 47-66.
Ester M., Kriegel H.-P., Sander J., and Xu X. 1996 A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise. Proc. 2nd Int. Conf. on
Knowledge Discovery and Data Mining. Portland, Oregon, AAAI Press, Menlo Park,
California, pp. 226-231.
Ester M., Kriegel H.-P., Sander J., and Xu X. 1997: Density-Connected Sets and their
Application for Trend Detection in Spatial Databases. Proc. 3nd Int. Conf. on Knowledge
Discovery and Data Mining. Newport Beach, California, AAAI Press, Menlo Park, California.
Fayyad U. M. 1997 Knowledge Discovery in Databases: An Overview, ILP 1997, pp. 3-16.
Fayyad U. M., Piatetsky-Shapiro G., and Smyth P. 1996 Knowledge Discovery and Data
Mining: Towards a Unifying Framework, Proc. 2nd Int. Conf. on Knowledge Discovery and
Data Mining. Portland, Oregon, AAAI Press, Menlo Park, California, pp. 82 - 88.
Gueting R. H. 1994 An Introduction to Spatial Database Systems, Special Issue on Spatial
Database Systems of the VLDB Journal, Vol. 3, No. 4.
Guttman A. 1984 R-trees: A Dynamic Index Structure for Spatial Searching, Proc. ACM
SIGMOD Int. Conf. on Management of Data, pp. 47-54.
Han J., Koperski K., and N. Stefanovic 1997 GeoMiner: A System Prototype for Spatial Data
Mining, Proc. ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD '97), Tucson,
AR.
- 31 -
Koperski K., Adhikary J., Han J. 1996 Knowledge Discovery in Spatial Databases: Progress
and Challenges, Proc. SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery, Technical Report 96-08, University of British Columbia, Vancouver,
Canada.
Koperski K., and Han J. 1995 Discovery of Spatial Association Rules in Geographic
Information Databases, Proc. 4th Int. Symp. on Large Spatial Databases (SSD 95), Portland,
ME, pp 47-66.
Koperski K., Han J., and Stefanovic N. 1998 An Efficient Two-Step Method for Classification
of Spatial Data, Proc. Symposium on Spatial Data Handling (SDH 98), Vancouver, Canada.
Lu W., and Han J. 1992 Distance-Associated Join Indices for Spatial Range Search, Proc. 8th
Int. Conf. on Data Engineering, Phoenix, AZ, pp. 284-292.
Ng R. T., and Han J. 1994 Efficient and Effective Clustering Methods for Spatial Data
Mining, Proc. 20th Int. Conf. on Very Large Data Bases, Santiago, Chile, 1994, pp. 144-155.
Niemann H. 1990 Pattern Analysis and Understanding. Springer-Verlag, Berlin.
Quinlan J.R. 1986 Induction of Decision Trees, Machine Learning, Vol.1, pp. 81-106.
Richards A.J. 1983. Remote Sensing Digital Image Analysis. An Introduction. Berlin:
Springer Verlag.
Rotem D. 1991 Spatial Join Indices, Proc. 7th Int. Conf. on Data Engineering, Kobe, Japan,
pp. 500-509.
Sander J., Ester M., Kriegel H.-P., and Xu X. 1998 Density-Based Clustering in Spatial
Databases: A New Algorithm and its Applications, Data Mining and Knowledge Discovery,
an International Journal, Kluwer Academic Publishers, Vol.2, No. 2.
Valduriez P. 1987 Join Indices, ACM Transactions on Database Systems, Vol. 12, No. 2,
pp. 218-246.
- 32 -