| separator/punctuator | Punctuation characters and paired delimiters. | Lexical grammarThe specification of a programming language often includes a set of rules, the lexical grammar, which defines the lexical syntax. The lexical syntax is usually a regular language, with the grammar rules consisting of regular expressions; they define the set of possible character sequences of a token. A lexer recognizes strings, and for each kind of string found, the lexical program takes an action, most simply producing a token. Two important common lexical categories are white space and comments. These are also defined in the grammar and processed by the lexer but may be discarded and considered non-significant, at most separating two tokens. There are two important exceptions to this. First, in off-side rule languages that delimit blocks with indenting, initial whitespace is significant, as it determines block structure, and is generally handled at the lexer level; see phrase structure, below. Secondly, in some uses of lexers, comments and whitespace must be preserved – for examples, a prettyprinter also needs to output the comments and some debugging tools may provide messages to the programmer showing the original source code. In the 1960s, notably for ALGOL, whitespace and comments were eliminated as part of the line reconstruction phase, but this separate phase has been eliminated and these are now handled by the lexer.DetailsScannerThe first stage, the scanner, is usually based on a finite-state machine. It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles. For example, an integer lexeme may contain any sequence of numerical digit characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token. In some languages, the lexeme creation rules are more complex and may involve backtracking over previously read characters. For example, in C, one 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal.EvaluatorA lexeme, however, is only a string of characters known to be of a certain kind. The second stage of a lexical analyzer, the evaluator, goes over the characters of the lexeme to produce a value containing relevant information for the parser. The lexeme's type combined with its value is what properly constitutes a token. The value in the token can be whatever is deemed necessary for the parser to interpret a token of that type. Some examples of typical values produced by an evaluator include:
- A token for an identifier will often simply contain the characters of the associated lexeme.
- Token values for keywords and special characters are usually omitted, as the type alone contains all the information needed.
- Evaluators processing integer literals may pass the string on as is, or may perform evaluation themselves to produce numeric values.
- For a simple quoted string literal, the evaluator needs to remove only the quotes, but the evaluator for an escaped string literal may also incorporate a lexer, which unescapes the escape sequences.
The evaluator may also suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. For example, in the source code of a computer program, the string might be converted into the following lexical token stream; where each line represents a token composed of a TYPE followed by an optional value: IDENTIFIER "net_worth_future" EQUALS OPEN_PARENTHESIS IDENTIFIER "assets" MINUS IDENTIFIER "liabilities" CLOSE_PARENTHESIS SEMICOLON Lexers may be written by hand. This is practical if the list of tokens is small, but lexers generated by automated tooling as part of a compiler-compiler toolchain are more practical for a larger number of potential tokens. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production rule in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state transition table for a finite-state machine. Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English-based language, an IDENTIFIER token might be any English alphabetic character or an underscore, followed by any number of instances of ASCII alphanumeric characters and/or underscores. This could be represented compactly by the string. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9". Regular expressions and the finite-state machines they generate are not powerful enough to handle recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." They are unable to keep count, and verify that n is the same on both sides, unless a finite set of permissible values exists for n. It takes a full parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end.ObstaclesTypically, lexical tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often, a tokenizer relies on simple heuristics, for example:
- Punctuation and whitespace may or may not be included in the resulting list of tokens.
- All contiguous strings of alphabetic characters are part of one token; likewise with numbers.
- Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.
In languages that use inter-word spaces, this approach is fairly straightforward. However, even here there are many edge cases such as contractions, hyphenated words, emoticons, and larger constructs such as URIs. A classic example is "New York-based", which a naive tokenizer may break at the space even though the better break is at the hyphen. Tokenization is particularly difficult for languages written in scriptio continua, which exhibit no word boundaries, such as Ancient Greek, Chinese, or Thai. Agglutinative languages, such as Korean, also make tokenization tasks complicated. Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special cases, or fitting the tokens to a language model that identifies collocations in a later processing step.
|