R*-tree
In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a better query performance. Like the standard R-tree, it can store both point and spatial data. It was proposed by Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger in 1990.
Difference between R*-trees and R-trees
Minimization of both coverage and overlap is crucial to the performance of R-trees. Overlap means that, on data query or insertion, more than one branch of the tree needs to be expanded. A minimized coverage improves pruning performance, allowing exclusion of whole pages from search more often, in particular for negative range queries. The R*-tree attempts to reduce both, using a combination of a revised node split algorithm and the concept of forced reinsertion at node overflow. This is based on the observation that R-tree structures are highly susceptibleto the order in which their entries are inserted, so an insertion-built structure
is likely to be sub-optimal. Deletion and reinsertion of entries allows them to "find" a place in the tree
that may be more appropriate than their original location.
When a node overflows, a portion of its entries are removed from the node and reinserted into the tree. This has the
effect of producing more well-clustered groups of entries in nodes, reducing node coverage. Furthermore,
actual node splits are often postponed, causing average node occupancy to rise. Re-insertion can be seen as a method of incremental tree optimization triggered on node overflow.
The R*-tree describes three metrics by which the quality of a split can be quantified. These being overlap, defined as the intersection area of the bounding boxes of two clusters; Area-value, being the sum of the area of two cluster bounding boxes and Margin-value being the sum of the perimeters of two cluster bounding boxes.
Performance
- Improved split heuristic produces pages that are more rectangular and thus better for many applications.
- Reinsertion method optimizes the existing tree but increases complexity.
- Efficiently supports point and spatial data at the same time.
Algorithm and complexity
- The R*-tree uses the same algorithm as the regular R-tree for query and delete operations.
- When inserting, the R*-tree uses a combined strategy. For leaf nodes, overlap is minimized, while for inner nodes, enlargement and area are minimized.
- When splitting, the R*-tree uses a topological split that chooses a split axis based on perimeter, then minimizes overlap.
- In addition to an improved split strategy, the R*-tree also tries to avoid splits by reinserting objects and subtrees into the tree, inspired by the concept of balancing a B-tree.
An implementation of the full algorithm must address many corner cases and tie situations not discussed here.