Persistent data structure

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.
A data structure is partially persistent if all versions can be accessed but only the newest version can be modified. The data structure is fully persistent if every version can be both accessed and modified. If there is also a meld or merge operation that can create a new version from two previous versions, the data structure is called confluently persistent. Structures that are not persistent are called ephemeral.
These types of data structures are particularly common in logical and functional programming, as languages in those paradigms discourage the use of mutable data.

Partial versus full persistence

In the partial persistence model, a programmer may query any previous version of a data structure, but may only update the latest version. This implies a linear ordering among each version of the data structure. In the fully persistent model, both updates and queries are allowed on any version of the data structure. In some cases the performance characteristics of querying or updating older versions of a data structure may be allowed to degrade, as is true with the rope data structure. In addition, a data structure can be referred to as confluently persistent if, in addition to being fully persistent, two versions of the same data structure can be combined to form a new version which is still fully persistent.

Techniques for preserving previous versions

Copy-on-write

One method for creating a persistent data structure is to use a platform provided ephemeral data structure such as an array to store the data in the data structure and copy the entirety of that data structure. This is an inefficient technique because the entire backing data structure must be copied for each write, leading to worst case performance characteristics for m modifications of an array of size n.
Copy-on-write memory management can reduce the price for an update from to, where B is the memory block size and u the number of pages updated in an operation.

Fat node

The fat node method is to record all changes made to node fields in the nodes themselves, without erasing old values of the fields. This requires that nodes be allowed to become arbitrarily “fat”. In other words, each fat node contains the same information and pointer fields as an ephemeral node, along with space for an arbitrary number of extra field values. Each extra field value has an associated field name and a version stamp which indicates the version in which the named field was changed to have the specified value. Besides, each fat node has its own version stamp, indicating the version in which the node was created. The only purpose of nodes having version stamps is to make sure that each node only contains one value per field name per version. In order to navigate through the structure, each original field value in a node has a version stamp of zero.

Complexity of fat node

With using fat node method, it requires O space for every modification: just store the new data. Each modification takes O additional time to store the modification at the end of the modification history. This is an amortized time bound, assuming modification history is stored in a growable array. At access time, the right version at each node must be found as the structure is traversed. If m modifications were to be made, then each access operation would have slowdown resulting from the cost of finding the nearest modification in the array. Alternatively, one can employ the van Emde Boas tree at each node to reduce the time for an access to at the cost of increasing update time to. If only partial persistence is required, the time for an update can be kept at its original order of magnitude, modulo randomization and amortization.

Path copying

This method assumes that the data structure is a linked graph of nodes.
On update, a copy is made of all nodes on the path to any node which is about to be modified. These changes must then be cascaded back through the data structure: all nodes that pointed to the old node must be modified to point to the new node instead. These modifications cause more cascading changes, and so on, until the root node is reached.

Complexity of path copying

With m modifications, this costs O additive lookup time. Modification time and space are bounded by the maximal number of ancestors for any node in the data structure times the cost of the update in the ephemeral data structure. In a Balanced Binary Search Tree without parent pointers the worst case modification time complexity is O. However, in a linked list the worst case modification time complexity is O.

A combination

Driscoll, Sarnak, Sleator, Tarjan came up with a way to combine the techniques of fat nodes and path copying, achieving O access slowdown and O amortized overhead in space and time per modification. Their method assumes a linked data structure with at most d incoming pointers to each node, where d is a known constant.
In each node, one modification box is stored. This box can hold one modification to the node—either a modification to one of the pointers, or to the node's key, or to some other piece of node-specific data—and a timestamp for when that modification was applied. Initially, every node's modification box is empty.
Whenever a node is accessed, the modification box is checked, and its timestamp is compared against the access time. If the modification box is empty, or the access time is before the modification time, then the modification box is ignored and only the normal part of the node is considered. On the other hand, if the access time is after the modification time, then the value in the modification box is used, overriding that value in the node.
Modifying a node works like this. If the node's modification box is empty, then it is filled with the modification. Otherwise, the modification box is full. A copy of the node is made, but using only the latest values. The modification is performed directly on the new node, without using the modification box. Finally, this change is cascaded to the node's parent, just like path copying.
With this algorithm, given any time t, at most one modification box exists in the data structure with time t. Thus, a modification at time t splits the tree into three parts: one part contains the data from before time t, one part contains the data from after time t, and one part was unaffected by the modification.

Complexity of the combination

Time and space for modifications require amortized analysis. A modification takes O amortized space, and O amortized time. To see why, use a potential function ϕ, where ϕ is the number of full live nodes in T. The live nodes of T are just the nodes that are reachable from the current root at the current time. The full live nodes are the live nodes whose modification boxes are full.
Each modification involves some number of copies, say k, followed by 1 change to a modification box. Consider each of the k copies. Each costs O space and time, but decreases the potential function by one. The final step fills a modification box, which costs O time and increases ϕ by one.
Putting it all together, the change in ϕ is Δϕ =1 − k. Thus, the algorithm takes O= O space and O = O time

Generalized form of persistence

Path copying is one of the simple methods to achieve persistency in a certain data structure such as binary search trees. It is nice to have a general strategy for implementing persistence that works with any given data structure. In order to achieve that, we consider a directed graph. We assume that each vertex in has a constant number of outgoing edges that are represented by pointers. Each vertex has a label representing the data. We consider that a vertex has a bounded number of edges leading into it which we define as inedges. We allow the following different operations on.

CREATE-NODE: Creates a new vertex with no incoming or outgoing edges.
CHANGE-EDGE: Changes the edge of to point to
CHANGE-LABEL: Changes the value of the data stored at to

Any of the above operations is performed at a specific time and the purpose of the persistent graph representation is to be able to access any version of at any given time. For this purpose we define a table for each vertex in. The table contains columns and rows. Each row contains in addition to the pointers for the outgoing edges, a label which represents the data at the vertex and a time at which the operation was performed. In addition to that there is an array inedges that keeps track of all the incoming edges to. When a table is full, a new table with rows can be created. The old table becomes inactive and the new table becomes the active table.

CREATE-NODE

A call to CREATE-NODE creates a new table and set all the references to null

CHANGE-EDGE

If we assume that CHANGE-EDGE is called, then there are two cases to consider.

There is an empty row in the table of the vertex : In this case we copy the last row in the table and we change the edge of vertex to point to the new vertex
Table of the vertex is full: In this case we need to create a new table. We copy the last row of the old table into the new table. We need to loop in the array inedges in order to let each vertex in the array point to the new table created. In addition to that, we need to change the entry in the inedges for every vertex such that edge exists in the graph.