R-Trees - Paper
R-Trees - Paper
R-TREES
JÜRGEN TREML
2005-06-08
TABLE OF CONTENTS ii
Table of Contents
1 Introduction.............................................................................................. 3
2 R-tree Index Structure ............................................................................. 4
2.1 General Structure of an R-tree .......................................................... 4
2.2 Structure of R-tree Nodes.................................................................. 5
2.3 Parameters m and M ...................................................................... 6
3 Algorithms on R-trees ............................................................................. 8
3.1 Motivation .......................................................................................... 8
3.2 Searching .......................................................................................... 8
3.2.1 Algorithm Description.............................................................. 8
3.2.2 Example.................................................................................. 9
3.3 Insertion............................................................................................11
3.3.1 Algorithm Description.............................................................11
3.3.2 Example.................................................................................13
3.4 Deletion ............................................................................................16
3.4.1 Algorithm Description.............................................................16
3.4.2 Example.................................................................................18
3.5 Node Splitting ...................................................................................20
3.5.1 Exhaustive Approach .............................................................20
3.5.2 Quadratic-Cost Algorithm.......................................................21
3.5.3 Linear-Cost Version ...............................................................23
3.6 Updates and Further Operations ......................................................24
4 Performance Tests and Benchmarking ................................................25
5 R-tree Modifications ...............................................................................30
6 Conclusion ..............................................................................................32
7 Glossary ..................................................................................................33
8 Figures.....................................................................................................34
9 References ..............................................................................................35
10 Index ........................................................................................................36
1 INTRODUCTION 3
1 Introduction
With the speed of computers growing rapidly over the last 20 years, they
have become indispensable in the field of geo-analysis, cartography, images
processing and many more subjects that have one thing in common: the
processing of spatial data. Finally with this being not only quite important but
even the base of certain applications such as computer aided design or
virtual reality, something has shown up clearly. Existing index structures for
one or two dimensional data are not sufficient nor in any way efficient for
saving and processing spatial, multi-dimensional data.
R1 R2
(1) Every leaf node contains between m and M index records unless it is
the root.
(2) For each index record (I , tuple - identifier) in a leaf node, I is the
smallest rectangle that spatially contains the n-dimensional data object
represented by the indicated tuple.
(3) Every non-leaf node has between m and M children unless it is the
root.
(4) For each entry (I , child - pointer ) in a non-leaf node, I is the smallest
rectangle that spatially contains the rectangles in the child node.
(5) The root node has at least two children unless it is a leaf.
(6) All leaves appear on the same level.
( I n , CP )
to child nodes, with CP standing for child pointer and thus being a reference
to a child node on a lower level of the tree structure. I n is the n -dimensional
rectangle forming the MBR of all the child node’s rectangles.
( I n , TID)
where TID stands for tuple identifier and refers to a tuple of data objects in
the database. I n again is an n -dimensional rectangle, this time making up
the MBR of the referenced data objects.
2 R-TREE INDEX STRUCTURE 6
In both cases
I n = ( I 0 , I1 ,..., I n )
M
m≤
2
specifies the minimum number of entries for a single node. Thus the height of
an R-tree containing N index records calculates to at most
⎡log m N ⎤ − 1
m
M
per node.
The latter especially shows that a bigger m will decrease space usage.
This is quite clear as almost all the space will be consumed by leaf nodes
containing MBRs of the indexed data objects and only a few ordinary nodes
containing MBRs of their child nodes. Furthermore m decisively influences
that number of occurring underflows when deleting indexed data.
2 R-TREE INDEX STRUCTURE 7
3 Algorithms on R-trees
3.1 Motivation
Just like it is with B-trees and all other kind of trees, the most important idea
behind R-trees is not to store data but to structure it. In fact, this is why we
call it a structure and not storage. If it was just to store data there would be
by far more efficient ways to do so. But what we want when we use R-trees is
a structured storage that allows us to efficiently search and modify data. And
this is why we need a good set of algorithms to search an R-tree as well as
insert, delete and update the indexed data.
Step by step we will now get to know basic implementations of a few quite
useful algorithms for search, insert and delete operations on R-trees.
Furthermore we will concentrate on node-splitting as an answer to over- and
underflows that might occur when inserting or deleting data from the index.
As there are quite a few more or less simple ways to deal with this and as
they differ extremely in the matter of cost we will explain and compare three
different implementations for this case. Last but not least a few words shall
also be lost about update and other useful operations which are basically just
modified version of the above mentioned algorithms. To round all this up a
little, the formal description of each algorithm is followed by concrete
application of the referenced algorithm to sample data at the end of each
chapter.
3.2 Searching
3.2.1 Algorithm Description
The basic search algorithm on R-trees, similar to search operations on B-
trees, traverses the tree from the root to its leaf nodes. However, as there is
no rule that prohibits overlapping rectangles within the same node of an R-
tree the search algorithm may need to search more than just one subtree of a
node visited. This is the reason why it is not possible to give a guarantee on
good worst-case performance, although this should seldom be the case due
to update algorithms designed to keep the number of overlapping regions
small. But now let’s take a closer look at the algorithm itself.
Algorithm SEARCH
(1) [Search subtrees] If T is not a leaf, check each entry E to determine
whether E.I overlaps S . For all overlapping entries, invoke SEARCH
on the tree whose root node is pointed to by E.P .
(2) [Search leaf node] If T is a leaf, check all entries E to determine
whether E.I overlaps S . If so, E is a qualifying record.
3.2.2 Example
In order to make things a little clearer we will now go through the algorithm
step by step as we apply it to 2-dimensional sample data represented by the
following spatial structure, where all the inner rectangles are MBRs of objects
in the database and all other rectangles are MBRs of inner rectangles.
R131 R13
R11
R112 R2 R21
R132
R211
R111
R133
R113
R212
R121
R221
R222
R223 R22
R12 R122
R1
R1 R2
R111 R112 R113 R121 R122 R131 R132 R133 R211 R212 R221 R222 R223
Right now we will use the search algorithm presented in the last chapter to
find all data objects overlapping an object described by the hatched MBR in
the following diagram.
R131 R13
R11
R112 R2 R21
R132
R211
R111
R133
R113
R212
R121
R221
R222
R223 R22
R12 R122
R1
The algorithm of course starts its work at the root node. As the object’s
MBR overlaps R1 as well as R2, search will continue in the subtrees of both
of the two. Now let’s first take a look at the subtree of R2. The algorithm
checks R21 and R22 for overlapping with the search rectangle. It will find
only R22 overlapping and therefore continue to search its subtree. Being on
the leaf level now, it will find none of the referenced rectangles overlapping
3 ALGORITHMS ON R-TREES 11
the search object, which means that search in the subtree of R2 has come to
an end without bringing up any results. Now the algorithm will continue with
R1’s child node. Rectangles R11 and R13 are found to overlap the search
object and their subtrees therefore are going to be searched. Finally search
will return nodes R112, R113, R131, R132 and R133 overlapping the search
rectangle and therefore the objects referenced by those rectangles
respectively by the tuples containing those rectangles make up the result of
our search. At last, the following graph is meant to visualize the path of this
search through the tree structure.
R1 R2
R111 R112 R113 R121 R122 R131 R132 R133 R211 R212 R221 R222 R223
3.3 Insertion
3.3.1 Algorithm Description
Being able to search an R-tree, now let’s take a closer look on ways to alter
the indexed data. A good start to do so might be to discuss an algorithm that
can be used for inserting new data into the tree.
In detail the algorithm for inserting a new entry E into a given R-tree looks
like this:
3 ALGORITHMS ON R-TREES 12
Algorithm INSERT
(1) [Find position for a new record] Invoke CHOOSELEAF to select a leaf
node L in which to place E .
(2) [Add record to leaf node] If L has room for another entry, install E .
Otherwise invoke SPLITNODE to obtain L and LL containing E and
all the old entries of L .
(3) [Propagate changes upward] Invoke ADJUSTTREE on L , also
passing LL if a split was performed.
(4) [Grow tree taller] If node split propagation caused the root to split,
create a new root whose children are the two resulting nodes.
Algorithm CHOOSELEAF
(1) [Initialize] Set N to be the root node.
(2) [Leaf check] If N is a leaf, return N .
(3) [Choose subtree] If N is not a leaf, let F be the entry in N whose
rectangle F.I needs least enlargement to include E.I . Resolve ties by
choosing the entry with the rectangle of smallest area.
(4) [Descend until a leaf is reached] Set N to be the child node pointed to
by F.P and repeat from (2).
Algorithm ADJUSTTREE
(1) [Initialize] Set N = L . If L was split previously, set NN to be the
resulting second node.
(2) [Check if done] If N is the root, stop.
(3) [Adjust covering rectangle in parent entry] Let P be the parent node of
N , and let E N be N ’s entry in P . Adjust E N so that I tightly encloses
all entry rectangles in N .
(4) [Propagate node split upward] If N has a partner NN resulting from
an earlier split, create a new entry E NN with E NN .P pointing to NN and
E NN .I enclosing all rectangles in NN . Add E NN to P if there is room.
Otherwise, invoke SPLITNODE to produce P and PP containing E NN
and all P ’s old entries.
(5) [Move up to next level] Set N = P and set NN = PP if a split occurred.
Repeat from (2).
3.3.2 Example
As we already did when introducing a search algorithm on R-trees, let’s again
walk through a concrete example of an insert operation step by step and see
what happens to our sample R-tree structure already used in chapter 3.2.2.
R131 R13
R11
R112 R2 R21
R132
R211
R111
R133
R113
R212
R121
R221
R222
R223 R22
R12 R122
R1
Calling the INSERT algorithm now will first of all lead to a call to
CHOOSELEAF. This algorithm traverses the tree from the root to its leafs
and finally identifies R11 as the rectangle which would need the least
extension to contain the object we’d like to insert. The algorithms path
through the tree is shown by the following figure.
R1 R2
R111 R112 R113 R121 R122 R131 R132 R133 R211 R212 R221 R222 R223
Having now identified the place where to insert the new object into the
tree, the main part of the insert algorithm will find that there is not enough
space left to insert the new object as a leaf of R11. Thus SPLITNODE is
called on R11 to split it into two nodes R11 and R11’ with the resulting node’s
MBRs being minimal. This process of splitting the node is illustrated in the
figure below.
3 ALGORITHMS ON R-TREES 15
R11 R11
R112 R112
R111 R111
R113 R113
R11’
Now, as nodes have been split and the new object has been inserted into
the tree, a call to ADJUSTTREE is made to ensure that all nodes preceding
the node that’s been changed are accordingly changed themselves. In this
case, it means that the algorithm tries to insert the new node R11’ into R1,
which, as you can see, is already at its limit. So again SPLITNODE is called
and R1 split up into R1 and R1’ with R1 containing R11’ and R1’ containing
R11. As there is still space left in the root the resulting call to ADJUSTTREE
this time just inserts R1’ into the root. No further split is required and
therefore no new root will be created. Finally we will find the resulting tree
containing the new object looking as shown below.
R1’ R1 R2
R111 R112 R113 R121 R122 R131 R132 R133 R211 R212 R221 R222 R223
Last but not least the following figure shows the geometric structure
represented by the altered tree.
3 ALGORITHMS ON R-TREES 16
R131 R13
R11
R112 R2 R21
R132
R211
R111
R133
R113
R1’
R11’ R212
R121
R221
R222
R223 R22
R12 R122
R1
3.4 Deletion
3.4.1 Algorithm Description
Having discussed search an insert operations on R-tress we will now focus
on the third and last basic function a simple R-tree implementation should
provide. That is the deletion of indexed objects.
Although the last chapters all started pointing out that R-tree operations
are quite similar to their corresponding operations on B-trees, this time we
will at least to some extent miss those similarities. The differences concern
the treatment of under-full nodes which may occur when deleting objects
from the index. In B-trees, this matter is being dealt with by just merging two
or more adjacent nodes. Now on a first quick look, technically, nothing would
keep you from doing the same when deleting entries in an R-tree.
Nevertheless Guttmann gives two good reasons why this might not be the
best approach to deal with underflows and why deleting under-full nodes and
reinserting their entries may be considered a better solution. The first one is
that deciding on the delete-reinsert version allows us to re-use the insert
routine introduced in the chapter before while accomplishing the same thing
as when simply merging under-full nodes with sibling ones. This argument of
course can only matter if performance of both implementations is about the
same. But as Guttmann points out that this will be the case as pages visited
3 ALGORITHMS ON R-TREES 17
during re-insertion are basically the same visited during the preceding search
and therefore should be in memory already. As a second reason for this
alternative implementation Guttmann accounts that re-insertions
incrementally refine the spatial structure of the tree.
Algorithm DELETE
(1) [Find node containing record] Invoke FINDLEAF to locate the leaf
node L containing E . Stop if the record was not found.
(2) [Delete record] Remove E from L .
(3) [Propagate changes] Invoke CONDENSETREE, passing L .
(4) [Shorten tree] If the root node has only one child after the tree has
been adjusted, make the child the new root.
Herein the following algorithm is used to identify the leaf node containing
index entry E in an R-tree with root T .
Algorithm FINDLEAF
(1) [Search subtrees] If T is not a leaf, check each entry F in T to
determine if F.I overlaps E.I . For each such entry invoke FINDLEAF
on the tree whose root is pointed to by F.P until E is found or all
entries have been checked.
(2) [Search leaf node for record] If T is a leaf, check each entry to see if it
matches E . If E is found return T .
Algorithm CONDENSETREE
(1) [Initialize] Set N = L . Set Q , the set of eliminated nodes, to be empty.
(2) [Find parent entry] If N is the root, go to (6). Otherwise let P be the
parent of N , and let E N be N ’s entry in P .
3.4.2 Example
As with the algorithms before we will now again take a look at an exemplary
delete operation on our sample tree. Nevertheless, to make this example a
non-trivial one, we will modify our sample structure’s parameters and set
M = 4 and m = 2 . The resulting tree will look as shown below with the
hatched entry R212 being the one to be deleted.
R1 R2
R21 R22
Our initial call to algorithm DELETE on entry R212 will first of all lead to a
call to FINDLEAF in order to locate the node containing the entry pointing to
3 ALGORITHMS ON R-TREES 19
R212. Returning R21 as a result, the main algorithm will delete R212’s entry
in R21 and call CONDENSETREE on this node. Considering the new m = 2 ,
R21 is now facing an underflow. Therefore it will be added to the empty set
Q with its entry R12 in R2 being deleted. Calling CONDENSETREE again on
R2 and given the fact that this node is facing an underflow itself now, R2 will
be added to Q as well and its entry R2 in the parental node deleted. The last
call to CONDENSETREE will now find itself at the root level of the tree and
according to the algorithm’s definition not apply any changes. Therefore the
root node remains unchanged.
Right now all entries in Q will be re-inserted into the tree, starting with
those entries that have been records in non-leaf nodes before their removal
from the tree. These entries will all be inserted on a level such that there
leafs will be on the general leaf-node level of the tree. In our case this applies
to node R22 only. As all former records of non-leaf nodes haven been
inserted, the algorithm now continues with former records of leaf nodes
inserting them at the leaf-level. Therefore R211 will be inserted into node
R22 and CONDENSETREE will finish. The following figures are to illustrate
these operations.
R1 R2
Q R22
Figure 12: Saving nodes whose references have been removed from tree
3 ALGORITHMS ON R-TREES 20
Q R22
R1
R111 R112 R113 R121 R122 R131 R132 R133 R221 R222 R223 R211
Eventually a last check turns out the root node to contain only a single
entry which will now become the new root of the tree and we’ll see the new
R-tree structure without the deleted entry R212 as shown below.
R111 R112 R113 R121 R122 R131 R132 R133 R221 R222 R223 R211
and naïve yet simple approach to this problem. The basic and quite
comprehensible idea behind this implementation is to simply try all possible
split-ups and then simply select the best one. Now this brings up two
questions, the first of which refers to what we mean by “the best one”.
According to Guttmann a good split minimizes the total area of the MBRs
generated by the split. This issue is to be illustrated by the follow figure.
Now the second question is about the efficiency of the above mentioned
algorithm. It should be easy to understand that there are 2 M −1 different
possibilities to split up a node containing M entries. Common sense tells us
that this is not an option for an efficient implementation of an R-tree structure,
which is why we are going to get to know two more advanced algorithms to
deal with the splitting no R-tree nodes.
The idea behind this implementation is to pick the two entries out of the
M entries in the original node together with the new entry that would
consume the most space if put together in a node (subtracting their own
area), that is that would waste the most space it they were part of the same
node. One of these two entries is then put into the first node, whereas the
3 ALGORITHMS ON R-TREES 22
other one is put into the second one. For all the remaining entries, the
algorithm will step by step pick the one which creates the biggest difference
in area when added to one of the new nodes and finally assigns it to the node
which gains less in area by adding this entry. This is repeated till all of the
remaining entries are assigned two one of the nodes and we finally get two
new nodes containing all the entries of the original node plus the new entry
that was to be inserted (assuming we were performing an insert operation).
In case we were performing a delete operation things are but slightly
different.
Given this short and intuitive explanation we can now put it into a more
formal and exact description of a quadratic-cost algorithm dividing M + 1
index records into two groups, which would look like the following.
Algorithm QUADRATICSPLIT
(1) [Pick first entry for each group] Apply algorithm PICKSEEDS to choose
two entries to be the first elements of the groups. Assign each to a
group.
(2) [Check if done] If all entries have been assigned, stop. If one group
has so few entries that all the rest must be assigned to it in order for it
to have the minimum number m , assign them and stop.
(3) [Select entry to assign] Invoke algorithm PICKNEXT to choose the
next entry to assign. Add it to the group whose covering rectangle will
have to be enlarged least to accommodate it. Resolve ties by adding
the entry to the group with smaller area, then to the one with fewer
entries, then to either. Repeat from (2).
Algorithm PICKSEEDS
(1) [Calculate inefficiency of grouping entries together] For each pair of
entries E1 and E2 , compose a rectangle J including E1.I and E2 .I .
Calculate d = area ( J ) − [area ( E1.I ) + area ( E2 .I )] .
(2) [Chose the most wasteful pair] Choose the pair with the largest d .
Last but not least the second algorithm is used to select one of the
remaining entries for classification in a group.
Algorithm PICKNEXT
(1) [Determine cost of putting each entry in each group] Be G1 and G2 the
two new groups to which the entries are to be assigned. For each
entry E not yet in a group, compose a rectangle J including G1.I and
E.I and calculate d1 = area( J ) − area( E.I ) . Calculate d 2 similarly for
the second group.
(2) [Find entry with greatest preference for one group] Choose any entry
with the maximum difference between d1 and d 2 .
Algorithm PICKSEEDS
(1) [Find extreme rectangles along all dimensions] Along each dimension,
find the entry whose rectangle has the highest low side, and the one
with lowest high side. Record the separation.
(2) [Adjust for shape of the rectangle cluster] Normalize the separations
by dividing by the width of the entire set along the corresponding
dimension.
(3) [Select the most extreme pair] Choose the pair with the greatest
normalized separation along any dimension.
In the end one could say that on the matter of supported algorithms R-
trees are quite as well extensible and efficient for spatial data as B-trees in
the subject of one-dimensional.
4 PERFORMANCE TESTS AND BENCHMARKING 25
128 6
256 12
512 25
1024 50
2048 102
As discussing Guttmann’s testing methods in detail would lead way too far
at this point we will have a look at a few diagrams resulting from these tests
and point out only the most important aspects they show up. We’ll start by
taking a look at insert and delete operations. The two figures below show the
time needed to perform a certain insert and delete operation depending on
the page size, using three different versions of the SPLITNODE algorithm. As
expected linear implementation is the fastest. Nevertheless you will also
notice that CPU cost for the linear algorithm is to a very small extent only
depending on the page size and parameter m , strongly in contrast to the
quadratic and exhaustive implementation when performing an insert
operation. Concerning deletions you can see that they too a greatly
influenced by the chosen m .
4 PERFORMANCE TESTS AND BENCHMARKING 26
Figure 17: CPU cost of inserting records Figure 18: CPU cost of deleting records
Figure 19: Search performance – Pages Figure 20: Search performance – CPU
touched cost
Figure 22: CPU cost of inserts and deletes vs. amount of data
4 PERFORMANCE TESTS AND BENCHMARKING 28
Figure 23: Search performance vs. Figure 24: Search performance vs.
amount of data – Pages touched amount of data – CPU cost
Last but not least we will again take a quick look at space efficiency of an
R-tree. Therefore we’ll draw the space required fort he tree structure as a
function of the amount of data indexed. This is exactly what is shown by the
diagram below. As you can see the results are quite straight lines without
jumps. The reason for this is the fact that most of the space in an R-tree is
consumed by its leafs. For the linear benchmark the structure used up 40
bytes per data item compared to 33 bytes with the quadratic-cost
implementation. 20 bytes thereof were consumed by the index entry itself.
4 PERFORMANCE TESTS AND BENCHMARKING 29
5 R-tree Modifications
Although we have learned that R-trees are a very efficient and comfortable
way to structure and work on spatial data there is no hint that they are the
ultimate way to do so. Since its occurrence, Guttmann’s original concept has
been reviewed and reconsidered by quite a few computer scientists over the
years. Some of them have made only small modifications to the original R-
tree, whereas others have even tried to combine them with other data
structures to gain a performance advantage. Before presenting you with a
final conclusion on the topic of R-trees, at least a few of these modified R-
tree structures shall be mentioned.
of those would lead way too far, interested readers may like to refer to H.
Samet’s comments on the design and analysis of spatial data structures [7].
6 CONCLUSION 32
6 Conclusion
Therefore one could say that when it comes to the processing and
indexing of spatial objects the R-tree is what B-trees are in the field of one-
dimensional data indexes.
7 GLOSSARY 33
7 Glossary
CP, child-pointer
Reference stored in a node, pointing to another node on a lower level of
the index structure.
MBR, minimum bounding rectangle
The smallest n-dimensional rectangle that can be laid around an n-
dimensional object so that it completely covers the object along each
dimension.
Overflow
The insertion of data into a node exceeds the maximum number of entries
for nodes of the given R-tree.
TID, tuple-identifier
Reference stored in leaf nodes pointing to tuples of objects in the
database.
Underflow
The deletion of data from a node under-runs the minimum number of
entries for nodes of the given R-tree.
8 FIGURES 34
8 Figures
9 References
10 Index
A
T
Antonin Guttman, 3
Tree
C AVL-tree, 4
B-tree, 4
Cell methods, 3
k-d tree, 3
computer aided design
K-D-B tree, 3
CAD, 3
V
M
virtual reality, 3
minimum bounding rectangle, 4
MBR, 4