R*-tree


In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a better query performance. Like the standard R-tree, it can store both point and spatial data. It was proposed by Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger in 1990.

Difference between R*-trees and R-trees

Minimization of both coverage and overlap is crucial to the performance of R-trees. Overlap means that, on data query or insertion, more than one branch of the tree needs to be expanded. A minimized coverage improves pruning performance, allowing exclusion of whole pages from search more often, in particular for negative range queries. The R*-tree attempts to reduce both, using a combination of a revised node split algorithm and the concept of forced reinsertion at node overflow. This is based on the observation that R-tree structures are highly susceptible
to the order in which their entries are inserted, so an insertion-built structure
is likely to be sub-optimal. Deletion and reinsertion of entries allows them to "find" a place in the tree
that may be more appropriate than their original location.
When a node overflows, a portion of its entries are removed from the node and reinserted into the tree. This has the
effect of producing more well-clustered groups of entries in nodes, reducing node coverage. Furthermore,
actual node splits are often postponed, causing average node occupancy to rise. Re-insertion can be seen as a method of incremental tree optimization triggered on node overflow.
The R*-tree describes three metrics by which the quality of a split can be quantified. These being overlap, defined as the intersection area of the bounding boxes of two clusters; Area-value, being the sum of the area of two cluster bounding boxes and Margin-value being the sum of the perimeters of two cluster bounding boxes.

Performance

  • Improved split heuristic produces pages that are more rectangular and thus better for many applications.
  • Reinsertion method optimizes the existing tree but increases complexity.
  • Efficiently supports point and spatial data at the same time.

Algorithm and complexity

  • The R*-tree uses the same algorithm as the regular R-tree for query and delete operations.
  • When inserting, the R*-tree uses a combined strategy. For leaf nodes, overlap is minimized, while for inner nodes, enlargement and area are minimized.
  • When splitting, the R*-tree uses a topological split that chooses a split axis based on perimeter, then minimizes overlap.
  • In addition to an improved split strategy, the R*-tree also tries to avoid splits by reinserting objects and subtrees into the tree, inspired by the concept of balancing a B-tree.
Worst case query and delete complexity are thus identical to the R-Tree. The insertion strategy to the R*-tree is with more complex than the linear split strategy of the R-tree, but less complex than the quadratic split strategy for a page size of objects and has little impact on the total complexity. The total insert complexity is still comparable to the R-tree: reinsertions affect at most one branch of the tree and thus reinsertions, comparable to performing a split on a regular R-tree. So, on overall, the complexity of the R*-tree is the same as that of a regular R-tree.
An implementation of the full algorithm must address many corner cases and tie situations not discussed here.