Matching wildcards
In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.
The problem
A wildcard matcher tests a wildcard pattern p against an input string s. It performs an anchored match, returns true only when p matches the entirety of s.The pattern can be based on any common syntax, but on Windows programmers tend to only discuss a simplified syntax supported by the native C runtime:
- No escape characters are defined
- Wildcards: matches exactly one occurrence of any character. matches arbitrary many occurrences of any character.
Definition
Stated in zero-based indices, the wildcard-matching problem can be defined recursively as:where mij is the result of matching the pattern p against the text t truncated at i and j characters respectively. This is the formulation used by Richter's algorithm and the Snippets algorithm found in Cantatore's collection. This description is similar to the Levenshtein distance.
Related problems
Directly related problems in computer science include:- Pattern matching with don't cares or gaps, an unanchored string search with only the equivalent of defined.
- Pattern matching with wildcards, an unanchored string search with the equivalent of both wildcards defined. Has an exponential runtime unless a length-bound is given in the pattern matching with flexible wildcards variant.
History
Among both recursive and non-recursive algorithms, strategies for performing the pattern matching operation vary widely, as evidenced among the variety of example algorithms referenced below. Test case development and performance optimization techniques have been demonstrably brought to bear on certain algorithms, particularly those developed by critics of the recursive algorithms.
Recursive algorithms
The recursion generally happens on matching* when there is more suffix to match against. This is a form of backtracking, also done by some regular expression matchers.- Rich Salz' wildmat algorithm
- Filip's algorithm and Vignesh Murugesan's algorithm
- Martin Richter's algorithm
- C library fnmatch implementations :
- * Guido van Rossum's BSD libc fnmatch, also part of Apple libc
- * Glibc fnmatch
- The ABORT signal against over-recursion. While it is correct to naively recurse by the entire rest of the strings on
*and making sure that ONE of the substrings return a positive match, the running time becomes exponential for rejecting a match with many*in the text. Lars Mathiesen changes the return to three classes, match, no-match, and ABORT The ABORT value is returned when the text is consumed too early or when another asterisk match has failed, guaranteeing a linear performance with respect to the number of asterisks. Git/Rsync's wildmatch ABORT also covers invalid inputs. The new INN uwildmat does the same. - Asterisk advancement in recursion. This wildmatch tweak is relatively more minor. It applies to when the recursion wants to match "*X" on "abcX": when an asterisk is followed by a literal like "X", it is obvious that only the last comparison with equal lengths would have a chance of producing a match. This is seen earlier in uwildmat in 2000 and more implicitly in van Rossum's fnmatch for.
The recursive algorithms are in general easier to reason about, and with the ABORT modification they perform acceptably in terms of worst-case complexity. On strings without * they take linear-to-string-size time to match since there is a fixed one-to-one relation.
Non-recursive algorithms
The following are developed by critics of the recursive algorithms:- Kirk J. Krauss's wildcard-matching algorithm, used by IBM
- Alessandro Cantatore's collection of wildcard matching algorithms
- Dogan Kurt's iterative matcher and slower NFA matcher.
- Siler's incorrect algorithm
- Jack Handy's incorrect algorithm
In addition, the problem of wildcard matching can be converted into regular expression matching using a naive text-replacement approach. Although non-recursive regular expression matchers such as Thompson's construction are less used in practice due to lack of backreference support, wildcard matching in general does not come with a similarly rich set of features. The Russ Cox implementation of Thompson NFA can be trivially modified for such. Gustavo Navarro's -based nrgrep algorithm provides a more streamlined implementation with emphasis on efficient suffixes. See also.