Two-way string-matching algorithm
In computer science, the two-way string-matching algorithm is a string-searching algorithm, discovered by Maxime Crochemore and Dominique Perrin in 1991. It takes a pattern of size m, called a “needle”, preprocesses it in linear time O, producing information that can then be used to search for the needle in any “haystack” string, taking only linear time O with n being the haystack's length.
The two-way algorithm can be viewed as a combination of the forward-going Knuth–Morris–Pratt algorithm and the backward-running Boyer–Moore string-search algorithm.
Like those two, the 2-way algorithm preprocesses the pattern to find partially repeating periods and computes “shifts” based on them, indicating what offset to “jump” to in the haystack when a given character is encountered.
Unlike BM and KMP, it uses only O additional space to store information about those partial repeats: the search pattern is split into two parts, represented only by the position of that split. Being a number less than m, it can be represented in ⌈log₂ m⌉ bits. This is sometimes treated as "close enough to O in practice", as the needle's size is limited by the size of addressable memory; the overhead is a number that can be stored in a single register, and treating it as O is like treating the size of a loop counter as O rather than log of the number of iterations.
The actual matching operation performs at most 2n − m comparisons.
Breslauer later published two improved variants performing fewer comparisons, at the cost of storing additional data about the preprocessed needle:
- The first one performs at most n + ⌊/2⌋ comparisons, ⌈/2⌉ fewer than the original. It must however store ⌈log[Golden ratio|] m⌉ additional offsets in the needle, using O space.
- The second adapts it to only store a constant number of such offsets, denoted c, but must perform n + ⌊ * ⌋ comparisons, with ε = −1 = O going to zero exponentially quickly as c increases.
Critical factorization
Before we define critical factorization, we should define:- A factorization is a partition of a string. For example,
is a factorization of"Wikipedia". - A period of a string is an integer such that all characters -distance apart are equal. More precisely, holds for any integer. This definition is allowed to be vacuously true, so that any word of length has a period of. To illustrate, the 8-letter word
"educated"has period 6 in addition to the trivial periods of 8 and above. The minimum period of is denoted as. - A repetition in is a non-empty string such that:
- * is a suffix of or is a suffix of ;
- * is a prefix of or is a prefix of ;
- : In other words, occurs on both sides of the cut with a possible overflow on either side. Examples include
"an"forand"voca"for. Each factorization trivially has at least one repetition: the string. - A local period is the length of a repetition in. The smallest local period in is denoted as. Because the trivial repetition is guaranteed to exist and has the same length as, we see that.
The algorithm
The algorithm starts by computing a critical factorization of the needle n as the preprocessing step. This step produces the index of the periodic right-half, and the period of this stretch. The suffix computation here follows the authors' formulation. It can alternatively be computed using the Duval's algorithm, which is simpler and still linear time but slower in practice.Shorthand for inversion.
function cmp
if a > b return 1
if a = b return 0
if a < b return -1
function maxsuf
length ← len
cur_period ← 1 currently known period.
period_test_idx ← 1 index for period testing, 0 < period_test_idx <= cur_period.
maxsuf_test_idx ← 0 index for maxsuf testing. greater than maxs.
maxsuf_idx ← -1 the proposed starting index of maxsuf
while maxsuf_test_idx + period_test_idx < length
cmp_val ← cmp
if rev
cmp_val *= -1
if cmp_val < 0
Suffix is smaller. Period is the entire prefix so far.
maxsuf_test_idx += period_test_idx
period_test_idx ← 1
cur_period ← maxsuf_test_idx - maxsuf_idx
else if cmp_val 0
They are the same - we should go on.
if period_test_idx cur_period
We are done checking this stretch of cur_period. reset period_test_idx.
maxsuf_test_idx += cur_period
period_test_idx ← 1
else
period_test_idx += 1
else
Suffix is larger. Start over from here.
maxsuf_idx ← maxsuf_test_idx
maxsuf_test_idx += 1
cur_period ← 1
period_test_idx ← 1
return
function crit_fact
← maxsuf
← maxsuf
if idx1 > idx2
return
else
return
The comparison proceeds by first matching for the right-hand-side, and then for the left-hand-side if it matches. Linear-time skipping is done using the period.
function match
needle_len ← len
haystack_len ← len
← crit_fact
Matches ← set of matches.
Match the suffix.
Use a library function like memcmp, or write your own loop.
if needle... needle needle... needle
Matches ←
pos ← 0
s ← 0
''TODO. At least put the skip in.''