Tagged Deterministic Finite Automaton

In the automata theory, a tagged deterministic finite automaton is an extension of deterministic finite automaton. In addition to solving the recognition problem for regular languages, TDFA is also capable of submatch extraction and parsing. While canonical DFA can find out if a string belongs to the language defined by a regular expression, TDFA can also extract substrings that match specific subexpressions. More generally, TDFA can identify positions in the input string that match tagged positions in a regular expression.

History

TDFA were first described by Ville Laurikari in 2000.
Prior to that it was unknown whether it is possible to perform submatch extraction in one pass on a deterministic finite-state automaton,
so this paper was an important advancement.
Laurikari described TDFA construction and gave a proof that the determinization process terminates,
however the algorithm did not handle disambiguation correctly.
In 2007 Chris Kuklewicz implemented TDFA in a Haskell library Regex-TDFA with POSIX longest-match semantics.
Kuklewicz gave an informal description of the algorithm
and answered the principal question whether TDFA are capable of POSIX longest-match disambiguation,
which was doubted by other researchers.
In 2017 Ulya Trafimovich described TDFA with one-symbol lookahead.
The use of a lookahead symbol reduces the number of registers and register operations in a TDFA,
which makes it faster and often smaller than Laurikari TDFA.
Trafimovich called TDFA variants with and without lookahead TDFA and TDFA by analogy with LR parsers LR and LR.
The algorithm was implemented in the open-source lexer generator RE2C.
Trafimovich formalized Kuklewicz disambiguation algorithm.
In 2018 Angelo Borsotti worked on an experimental Java implementation of TDFA;
it was published later in 2021.
In 2019 Borsotti and Trafimovich adapted POSIX disambiguation algorithm by Okui and Suzuki to TDFA.
They gave a formal proof of correctness of the new algorithm
and showed that it is faster than Kuklewicz algorithm in practice.
In 2020 Trafimovich published an article about TDFA implementation in RE2C.
In 2022 Borsotti and Trafimovich published a paper with a detailed description of TDFA construction.
The paper incorporated their past research and presented multi-pass TDFA that are better suited to just-in-time determinization.
They also compared TDFA against other algorithms and provided benchmarks.

Formal definition

TDFA have the same basic structure as ordinary DFA: a finite set of states linked by transitions. In addition to that, TDFA have a fixed set of registers that hold tag values, and register operations on transitions that set or copy register values.
The values may be scalar offsets, or offset lists for tags that match repeatedly. There is no one-to-one mapping between tags in a regular expression and registers in a TDFA: a single tag may need many registers, and the same register may hold values of different tags.
The following definition is according to Trafimovich and Borsotti. The original definition by Laurikari is slightly different.
A tagged deterministic finite automaton is a tuple
, where:

is a finite set of symbols
is a finite set of tags
is a finite set of states with initial state and a subset of final states
is a finite set of registers with a subset of final registers
is a transition function
is a final function, where is a set of register operations of the following types:
* set register to nil or to the current position:, where
* copy register to register :
* copy register to register and append history:, where is a string over

Example

Figure 0 shows an example TDFA for regular expression
with alphabet and a set of tags
that matches strings of the form with at least one symbol.
TDFA has four states three of which are final.
The set of registers is
with a subset of final registers
where register corresponds to -th tag.
Transitions have operations defined by the function,
and final states have operations defined by the function.
For example, to match string,
one starts in state 0,
matches the first and moves to state 1,
matches the second and loops to state 1,
matches and moves to state 2,
executes the final operations in state 2
and finally exits TDFA.

Complexity

Canonical DFA solve the recognition problem in linear time.
The same holds for TDFA, since the number of registers and register operations is fixed and depends only on the regular expression, but not on the length of input.
The overhead on submatch extraction depends on tag density in a regular expression
and nondeterminism degree of each tag.
On one extreme, if there are no tags, a TDFA is identical to a canonical DFA.
On the other extreme, if every subexpression is tagged, a TDFA effectively performs full parsing and has many operations on every transition.
In practice for real-world regular expressions with a few submatch groups the overhead is negligible compared to matching with canonical DFA.

TDFA construction

TDFA construction is performed in a few steps.
First, a regular expression is converted to a tagged nondeterministic finite automaton.
Second, a TNFA is converted to a TDFA using a determinization procedure;
this step also includes disambiguation that resolves conflicts between ambiguous TNFA paths.
After that, a TDFA can optionally go through a number of optimizations that reduce the number of registers and operations,
including minimization that reduces the number of states.
Algorithms for all steps of TDFA construction with pseudocode are given in the paper by Borsotti and Trafimovich.
This section explains TDFA construction on the example of a regular expression,
where is a tag and are alphabet symbols.

Tagged NFA

TNFA is a nondeterministic finite automaton with tagged ε-transitions.
It was first described by Laurikari,
although similar constructions were known much earlier as Mealy machines and nondeterministic finite-state transducers.
TNFA construction is very similar to Thompson's construction:
it mirrors the structure of a regular expression.
Importantly, TNFA preserves ambiguity in a regular expression:
if it is possible to match a string in two different ways,
then TNFA for this regular expression has two different accepting paths for this string.
TNFA definition by Borsotti and Trafimovich differs from the original one by Laurikari
in that TNFA can have negative tags on transitions:
they are needed to make the absence of match explicit in cases when there is a bypass for a tagged transition.
Figure 1 shows TNFA for the example regular expression.
It has three kinds of transitions:
transitions on alphabet symbols,
tagged ε-transitions, and
untagged ε-transitions.
TNFA has a single start state 0 and a single final state 11.
In order to understand TNFA determinization, it helps to understand TNFA simulation first.
Recall the canonical ε-NFA simulation: it constructs a subset of active states as an ε-closure of the start state,
and then in a loop repeatedly steps on the next input symbol and constructs ε-closure of the active state set.
Eventually the loop terminates: either the active set becomes empty,
or all input symbols get consumed.
TNFA simulation is similar, but it additionally tracks tag values.
Every time simulation encounters a tagged transition, it updates tag value to the current offset.
Since the simulation tracks multiple nondeterministic paths simultaneously, tag values along these paths may differ and should be tracked separately.
Another difficulty is the need for disambiguation:
unlike canonical NFA simulation, TNFA simulation needs to distinguish ambiguous paths,
as they may have different tag values.
Figure 2 shows an example of TNFA simulation on a string :

Disambiguation

Ambiguity means the existence of multiple different parse trees for the same input.
It is a property of a regular expression;
ambiguity is preserved by TNFA construction
and gets resolved during TNFA simulation or determinization.
One way to resolve ambiguity is use a disambiguation policy, the most notable examples being the leftmost-greedy
and the longest-match policies.
The leftmost-greedy policy is defined in terms of regular expression structure;
it admits a simple and efficient implementation.
The POSIX policy, on the other hand, is defined in terms of the structure of parse results;
it is more difficult to implement and computationally more complex than the leftmost-greedy policy.
TDFA can work with both policies, and there is no runtime overhead
on disambiguation, since it happens during determinization and gets built into TDFA structure.
For TNFA simulation, on the other hand, the time spent on disambiguation is included in the runtime.
The example regular expression is deliberately ambiguous, as
it allows one to parse in two different ways:
either as the left alternative, or the right one.
Depending on which alternative is preferred, tag should either have value
,
or .
Both POSIX and leftmost greedy disambiguation policies agree that the first alternative is preferable in this case.

Determinization

TNFA determinization is based on the canonical powerset construction algorithm that converts an NFA to a DFA.
The algorithm simulates NFA on all possible strings. At each step of the simulation, the active set of NFA states forms a new DFA state. If the new state is identical to an existing DFA state, it is discarded and replaced with the existing one, and the current branch of simulation terminates. Otherwise the new state is added to the growing set of DFA states and simulation from this state continues. Eventually determinization terminates: although the set of all possible strings is infinite, the set of different DFA states is finite, and at some point all new states become identical to existing ones.
In the case of TDFA naive powerset construction faces a problem:
TDFA states contain tag information, which changes at every step of the simulation.
This prevents TDFA states from mapping: a pair of states that contain identical TNFA states but different tag offsets are not identical and cannot be mapped.
As a result, simulation continues indefinitely, the set of TDFA states grows, and determinization does not terminate.
To solve this problem, Laurikari applied the idea of indirection:
instead of storing immediate tag values in TDFA states, he suggested storing them indirectly in registers.
Tag values in registers may be different, but it doesn't matter as long as the registers are identical.
This solves the termination problem:
even if TDFA states have different registers, they can be made identical
by adding operations that copy the corresponding register values.
Indirection is not free: it requires adding runtime register operations on transitions that update register values.
To reduce the runtime overhead, Trafimovich used the lookahead optimization.
The idea is to move register operations from the incoming transition into a TDFA state
to the outgoing transitions from this state.
This way the operations get split on the lookahead symbol,
which reduces the overlap between register lifetimes and results in a faster TDFA.
To use this optimization, it is necessary to track lookahead tags in each TDFA state under construction
and take them into account when mapping TDFA states.
Figure 3 shows the determinization process for the running example.
Figure 4 shows the resulting TDFA.

Optimizations

The goal of optimizations is to reduce TDFA size and the number of registers and operations on transitions.
This section describes a few optimizations that are used in a practical TDFA implementation.
None of these optimizations are particularly complex or vital for TDFA operation, but when applied
together and in the correct order they can make TDFA considerably faster and smaller.
Fixed-tags optimization is applied at the regular expression level.
The idea is, if a pair of tags happens to be within fixed distance from each other, there
is no need to track both of them: the value of one tag can be computed from the value of the other tag one by adding a fixed offset.
For example, in the regular expression
tags and are within one symbol distance from each other,
so can be computed as.
This optimization can reduce both TDFA construction time
and matching time.
Register optimizations are applied after TDFA construction.
A TDFA induces a control flow graph on registers:
operations on transitions form basic blocks,
and there is an arc between two blocks if one of them is reachable from the other one by TDFA transitions, without passing through other register operations.
CFG represents a program on registers as variables, so the usual compiler optimizations can be applied to it
.
TDFA minimization is very similar to DFA minimization, except for one additional restriction:
register actions on TDFA transitions must be taken into account.
So, TDFA states that are identical, but have different register actions on incoming transitions on the same symbol, cannot be merged.
All the usual algorithms for DFA minimization can be adapted to TDFA minimization,
for example Moore's algorithm is used in the RE2C lexer generator.
Figure 5 shows an optimized TDFA for regular expression.
Note that it is in fact the same as a TDFA for a semantically equivalent regular expression,
where ambiguity has been removed.
Ambiguity is resolved during determinization, and the unoptimized TDFA on Figure 4 is unambiguous,
but it has some built-in redundancy that can be removed by optimizations.

Multi-pass TDFA

TDFA with registers are well suited for ahead-of-time determinization, when the time spent on TDFA construction is not included in the runtime.
But for just-in-time determinization it is desirable to reduce TDFA construction time.
Another concern is tag density in a regular expression:
TDFA with registers are efficient if the number of tags is relatively small.
But for heavily tagged regular expressions TDFA with registers are suboptimal: transitions get cluttered with operations, making TDFA execution slow.
Register optimizations also become problematic due to the size of liveness and interference information.
Multi-pass TDFA address both issues: they reduce TDFA construction time,
and they are better suited to dense submatch extraction.
The main difference with canonical TDFA is that multi-pass TDFA have no register operations.
Instead they have multiple passes: a forward pass that matches the input
string and records a sequence of TDFA states, and one or more backward passes that iterate through the recorded
states and collect submatch information.
A single backward pass is sufficient, but an extra pass may be used e.g. to estimate the necessary amount of memory for submatch results.
In order to trace back the matching TNFA path from a sequence of TDFA states,
multi-pass TDFA use backlinks on transitions and in final states.
A backlink is a pair, where the first component is an index in backlink arrays on preceding transitions,
and the second component is a sequence of tags for a fragment of TNFA path between TDFA states.
Figure 6 shows an example of a multi-pass TDFA for regular expression matching on a string
:

The forward pass collects a sequence of states 0, 1, 1, 2.
The backward pass traces backlinks from state 2 to state 0 and concatenates tag sequences and input symbols. It starts with backlink. Tagged string is. Index 0 selects backlink on preceding transition from state 1 to state 2. Tagged string becomes. Index 0 selects backlink on preceding looping transition from state 1 to state 1. Tagged string becomes. Index 0 selects backlink on preceding transition from state 0 to state 1. The resulting tagged string is. It contains all the necessary information to reconstruct last offsets or offset lists for each tag.

Related automata

StaDFA described by Mohammad Imran Chowdhury
are very similar to TDFA, except that they have register operations in states, not on transitions.
DSSTs described by Grathwohl
are more distant relatives to TDFA, better suited to full parsing than submatch extraction.
DSST states contain path trees constructed by the ε-closure,
while TDFA states contain similar information in a decomposed form.