Tracing garbage collection


In computer programming, tracing garbage collection is a form of automatic memory management that consists of determining which objects should be deallocated by tracing which objects are reachable by a chain of references from certain "root" objects, and considering the rest as "garbage" and collecting them. Tracing is the most common type of garbage collection – so much so that "garbage collection" often refers to the tracing method, rather than others such as reference counting – and there are a large number of algorithms used in implementation.

Reachability of an object

Informally, an object is reachable if it is referenced by at least one variable in the program, either directly or through references from other reachable objects. More precisely, objects can be reachable in only two ways:
  1. A distinguished set of roots: objects that are assumed to be reachable. Typically, these include all the objects referenced from anywhere in the call stack, and any global variables.
  2. Anything referenced from a reachable object is itself reachable; more formally, reachability is a transitive closure.
The reachability definition of "garbage" is not optimal, insofar as the last time a program uses an object could be long before that object falls out of the environment scope. A distinction is sometimes drawn between syntactic garbage, those objects the program cannot possibly reach, and semantic garbage, those objects the program will in fact never again use. For example:

Object x = new Foo;
Object y = new Bar;
x = new Quux;
/* At this point, we know that the Foo object
* originally assigned to x will never be
* accessed: it is syntactic garbage.
*/
/* In the following block, y *could* be semantic garbage;
* but we won't know until x.check_something returns
* some value -- if it returns at all.
*/
if )
System.exit;

The problem of precisely identifying semantic garbage can easily be shown to be partially decidable: a program that allocates an object, runs an arbitrary input program, and uses if and only if finishes would require a semantic garbage collector to solve the halting problem. Although conservative heuristic methods for semantic garbage detection remain an active research area, essentially all practical garbage collectors focus on syntactic garbage.
Another complication with this approach is that, in languages with both reference types and unboxed value types, the garbage collector needs to somehow be able to distinguish which variables on the stack or fields in an object are regular values and which are references: in memory, an integer and a reference might look alike. The garbage collector then needs to know whether to treat the element as a reference and follow it, or whether it is a primitive value. One common solution is the use of tagged pointers.

Strong and weak references

The garbage collector can reclaim only objects that have no references pointing to them either directly or indirectly from the root set. However, some programs require weak references, which should be usable for as long as the object exists but should not prolong its lifetime. In discussions about weak references, ordinary references are sometimes called strong references. An object is eligible for garbage collection if there are no strong references to it, even though there still might be some weak references to it.
A weak reference is not merely just any pointer to the object that a garbage collector does not care about. The term is usually reserved for a properly managed category of special reference objects which are safe to use even after the object disappears because they lapse to a safe value. An unsafe reference that is not known to the garbage collector will simply remain dangling by continuing to refer to the address where the object previously resided. This is not a weak reference.
In some implementations, weak references are divided into subcategories. For example, the Java Virtual Machine provides three forms of weak references, namely soft references, phantom references, and regular weak references. A softly referenced object is only eligible for reclamation if the garbage collector decides that the program is low on memory. Unlike a soft reference or a regular weak reference, a phantom reference does not provide access to the object that it references. Instead, a phantom reference is a mechanism that allows the garbage collector to notify the program when the referenced object has become phantom reachable. An object is phantom reachable, if it still resides in memory and it is referenced by a phantom reference, but its finalizer has already executed. Similarly, Microsoft.NET provides two subcategories of weak references, namely long weak references and short weak references.

Weak collections

s can also be devised which have weak tracking features. For instance, weak hash tables are useful. Like a regular hash table, a weak hash table maintains an association between pairs of objects, where each pair is understood to be a key and value. However, the hash table does not actually maintain a strong reference on these objects. Special behavior takes place when either the key or value or both become garbage: the hash table entry is spontaneously deleted. There exist further refinements such as hash tables which have only weak keys or only weak values.
Weak hash tables are important for maintaining associations between objects, such that the objects engaged in the association can still become garbage if nothing in the program refers to them any longer.
The use of a regular hash table for such a purpose could lead to a "logical memory leak": the accumulation of reachable data which the program does not need and will not use.

Basic algorithm

Tracing collectors are so called because they trace through the working set of memory. These garbage collectors perform collection in cycles. It is common for cycles to be triggered when there is not enough free memory for the memory manager to satisfy an allocation request. But cycles can often be requested by the mutator directly or run on a time schedule. The original method involves a naïve mark-and-sweep in which the entire memory set is touched several times.

Naïve mark-and-sweep

In the naive mark-and-sweep method, each object in memory has a flag reserved for garbage collection use only. This flag is always cleared, except during the collection cycle.
The first stage is the mark stage which does a tree traversal of the entire 'root set' and marks each object that is pointed to by a root as being 'in-use'. All objects that those objects point to, and so on, are marked as well, so that every object that is reachable via the root set is marked.
In the second stage, the sweep stage, all memory is scanned from start to finish, examining all free or used blocks; those not marked as being 'in-use' are not reachable by any roots, and their memory is freed. For objects which were marked in-use, the in-use flag is cleared, preparing for the next cycle.
This method has several disadvantages, the most notable being that the entire system must be suspended during collection; no mutation of the working set can be allowed. This can cause programs to 'freeze' periodically, making some real-time and time-critical applications impossible. In addition, the entire working memory must be examined, much of it twice, potentially causing problems in paged memory systems.

Tri-color marking

Because of these performance problems, most modern tracing garbage collectors implement some variant of the tri-color marking abstraction, but simple collectors often do not make this abstraction explicit. Tri-color marking works as described below.
Three sets are created white, black and grey:
  • The white set, or condemned set, is the set of objects that are candidates for having their memory recycled.
  • The black set is the set of objects that can be shown to have no outgoing references to objects in the white set, and to be reachable from the roots. Objects in the black set are not candidates for collection.
  • The grey set contains all objects reachable from the roots but yet to be scanned for references to "white" objects. Since they are known to be reachable from the roots, they cannot be garbage-collected and will end up in the black set after being scanned.
In many algorithms, initially the black set starts as empty, the grey set is the set of objects which are directly referenced from roots and the white set includes all other objects. Every object in memory is at all times in exactly one of the three sets. The algorithm proceeds as following:
  1. Pick an object o from the grey set
  2. Move each white object that o references to the grey set. This ensures that neither this object nor any object it references can be garbage-collected.
  3. Move o to the black set
  4. Repeat the last three steps until the grey set is empty.
When the grey set is empty, the scan is complete; the black objects are reachable from the roots, while the white objects are not and can be garbage-collected.
Since all objects not immediately reachable from the roots are added to the white set, and objects can only move from white to grey and from grey to black, the algorithm preserves an important invariant no black objects reference white objects. This ensures that the white objects can be freed once the grey set is empty. This is called the tri-color invariant. Some variations on the algorithm do not preserve this invariant but use a modified form for which all the important properties hold.
The tri-color method has an important advantage it can be performed "on-the-fly", without halting the system for significant periods of time. This is accomplished by marking objects as they are allocated and during mutation, maintaining the various sets. By monitoring the size of the sets, the system can perform garbage collection periodically, rather than as needed. Also, the need to touch the entire working set on each cycle is avoided.