Lexical analysis

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols, data types and language keywords. Lexical tokenization is related to the type of tokenization used in large language models but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Rule-based programs

A rule-based program, performing lexical tokenization, is called tokenizer, or scanner, although scanner is also a term for the first stage of a lexer. A lexer forms the first phase of a compiler frontend in processing. Analysis generally occurs in one pass. Lexers and parsers are most often used for compilers, but can be used for other computer language tools, such as prettyprinters or linters. Lexing can be divided into two stages: the scanning, which segments the input string into syntactic units called [|lexemes] and categorizes these into token classes, and the evaluating, which converts lexemes into processed values.
Lexers are generally quite simple, with most of the complexity deferred to the syntactic analysis or semantic analysis phases, and can often be generated by a lexer generator, notably lex or derivatives. However, lexers can sometimes include some complexity, such as [|phrase structure] processing to make input easier and simplify the parser, and may be written partly or fully by hand, either to support more features or for performance.

Disambiguation of "lexeme"

What is called "lexeme" in rule-based natural language processing is not equal to what is called lexeme in linguistics. What is called "lexeme" in rule-based natural language processing can be equal to the linguistic equivalent only in analytic languages, such as English, but not in highly synthetic languages, such as fusional languages. What is called a [|lexeme] in rule-based natural language processing is more similar to what is called a word in linguistics, although in some cases it may be more similar to a morpheme.

Lexical token and lexical tokenization

A lexical token is a string with an assigned and thus identified meaning, in contrast to the probabilistic token used in large language models. A lexical token consists of a token name and an optional token value. The token name is a category of a rule-based lexical unit.

Token name	Explanation	Sample token values
identifier	Names assigned by the programmer.	,,
keyword	Reserved words of the language.	,,
separator/punctuator	Punctuation characters and paired delimiters.	Lexical grammar The specification of a programming language often includes a set of rules, the lexical grammar, which defines the lexical syntax. The lexical syntax is usually a regular language, with the grammar rules consisting of regular expressions; they define the set of possible character sequences of a token. A lexer recognizes strings, and for each kind of string found, the lexical program takes an action, most simply producing a token. Two important common lexical categories are white space and comments. These are also defined in the grammar and processed by the lexer but may be discarded and considered non-significant, at most separating two tokens. There are two important exceptions to this. First, in off-side rule languages that delimit blocks with indenting, initial whitespace is significant, as it determines block structure, and is generally handled at the lexer level; see phrase structure, below. Secondly, in some uses of lexers, comments and whitespace must be preserved – for examples, a prettyprinter also needs to output the comments and some debugging tools may provide messages to the programmer showing the original source code. In the 1960s, notably for ALGOL, whitespace and comments were eliminated as part of the line reconstruction phase, but this separate phase has been eliminated and these are now handled by the lexer. Details Scanner The first stage, the scanner, is usually based on a finite-state machine. It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles. For example, an integer lexeme may contain any sequence of numerical digit characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token. In some languages, the lexeme creation rules are more complex and may involve backtracking over previously read characters. For example, in C, one 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal. Evaluator A lexeme, however, is only a string of characters known to be of a certain kind. The second stage of a lexical analyzer, the evaluator, goes over the characters of the lexeme to produce a value containing relevant information for the parser. The lexeme's type combined with its value is what properly constitutes a token. The value in the token can be whatever is deemed necessary for the parser to interpret a token of that type. Some examples of typical values produced by an evaluator include: A token for an identifier will often simply contain the characters of the associated lexeme. Token values for keywords and special characters are usually omitted, as the type alone contains all the information needed. Evaluators processing integer literals may pass the string on as is, or may perform evaluation themselves to produce numeric values. For a simple quoted string literal, the evaluator needs to remove only the quotes, but the evaluator for an escaped string literal may also incorporate a lexer, which unescapes the escape sequences. The evaluator may also suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. For example, in the source code of a computer program, the string might be converted into the following lexical token stream; where each line represents a token composed of a TYPE followed by an optional value: IDENTIFIER "net_worth_future" EQUALS OPEN_PARENTHESIS IDENTIFIER "assets" MINUS IDENTIFIER "liabilities" CLOSE_PARENTHESIS SEMICOLON Lexers may be written by hand. This is practical if the list of tokens is small, but lexers generated by automated tooling as part of a compiler-compiler toolchain are more practical for a larger number of potential tokens. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production rule in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state transition table for a finite-state machine. Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English-based language, an IDENTIFIER token might be any English alphabetic character or an underscore, followed by any number of instances of ASCII alphanumeric characters and/or underscores. This could be represented compactly by the string. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9". Regular expressions and the finite-state machines they generate are not powerful enough to handle recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." They are unable to keep count, and verify that n is the same on both sides, unless a finite set of permissible values exists for n. It takes a full parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end. Obstacles Typically, lexical tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often, a tokenizer relies on simple heuristics, for example: Punctuation and whitespace may or may not be included in the resulting list of tokens. All contiguous strings of alphabetic characters are part of one token; likewise with numbers. Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters. In languages that use inter-word spaces, this approach is fairly straightforward. However, even here there are many edge cases such as contractions, hyphenated words, emoticons, and larger constructs such as URIs. A classic example is "New York-based", which a naive tokenizer may break at the space even though the better break is at the hyphen. Tokenization is particularly difficult for languages written in scriptio continua, which exhibit no word boundaries, such as Ancient Greek, Chinese, or Thai. Agglutinative languages, such as Korean, also make tokenization tasks complicated. Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special cases, or fitting the tokens to a language model that identifies collocations in a later processing step. Lexer generator Lexers are often generated by a lexer generator, analogous to parser generators, and such tools often come together. The most established is lex, paired with the yacc parser generator, or rather some of their many reimplementations, like flex. These generators are a form of domain-specific language, taking in a lexical specification – generally regular expressions with some markup – and emitting a lexer. These tools yield very fast development, which is very important in early development, both to get a working lexer and because a language specification may change often. Further, they often provide advanced features, such as pre- and post-conditions which are hard to program by hand. However, an automatically generated lexer may lack flexibility, and thus may require some manual modification, or an all-manually written lexer. Lexer performance is a concern, and optimizing is worthwhile, more so in stable languages where the lexer runs very often. lex/flex-generated lexers are reasonably fast, but improvements of two to three times are possible using more tuned generators. Hand-written lexers are sometimes used, but modern lexer generators produce faster lexers than most hand-coded ones. The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach. With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like re2c have proven to produce engines that are between two and three times faster than flex produced engines. It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools. Phrase structure Lexical analysis mainly segments the input stream of characters into tokens, simply grouping the characters into pieces and categorizing them. However, the lexing may be significantly more complex; most simply, lexers may omit tokens or insert added tokens. Omitting tokens, notably whitespace and comments, is very common when these are not needed by the compiler. Less commonly, added tokens may be inserted. This is done mainly to group tokens into statements, or statements into blocks, to simplify the parser. Line continuation Line continuation is a feature of some languages where a newline is normally a statement terminator. Most often, ending a line with a backslash results in the line being continued – the following line is joined to the prior line. This is generally done in the lexer: The backslash and newline are discarded, rather than the newline being tokenized. Examples include bash, other shell scripts and Python. Semicolon insertion Many languages use the semicolon as a statement terminator. Most often this is mandatory, but in some languages the semicolon is optional in many contexts. This is mainly done at the lexer level, where the lexer outputs a semicolon into the token stream, despite one not being present in the input character stream, and is termed semicolon insertion or automatic semicolon insertion. In these cases, semicolons are part of the formal phrase grammar of the language, but may not be found in input text, as they can be inserted by the lexer. Optional semicolons or other terminators or separators are also sometimes handled at the parser level, notably in the case of trailing commas or semicolons. Semicolon insertion is a feature of BCPL and its distant descendant Go, though it is absent in B or C. Semicolon insertion is present in JavaScript, though the rules are somewhat complex and much-criticized; to avoid bugs, some recommend always using semicolons, while others use initial semicolons, termed defensive semicolons, at the start of potentially ambiguous statements. Semicolon insertion and line continuation can be seen as complementary: Semicolon insertion adds a token even though newlines generally do not generate tokens, while line continuation prevents a token from being generated even though newlines generally do generate tokens. Off-side rule The off-side rule can be implemented in the lexer, as in Python, where increasing the indenting results in the lexer emitting an INDENT token and decreasing the indenting results in the lexer emitting one or more DEDENT tokens. These tokens correspond to the opening brace in languages that use braces for blocks and means that the phrase grammar does not depend on whether braces or indenting are used. This requires that the lexer hold state, namely a stack of indent levels, and thus can detect changes in indenting when this changes, and thus the lexical grammar is not context-free: INDENT–DEDENT depend on the contextual information of prior indent levels. Context-sensitive lexing Generally lexical grammars are context-free, or almost so, and thus require no looking back or ahead, or backtracking, which allows a simple, clean, and efficient implementation. This also allows simple one-way communication from lexer to parser, without needing any information flowing back to the lexer. There are exceptions, however. Simple examples include semicolon insertion in Go, which requires looking back one token; concatenation of consecutive string literals in Python, which requires holding one token in a buffer before emitting it ; and the off-side rule in Python, which requires maintaining a count of indent level. These examples all only require lexical context, and while they complicate a lexer somewhat, they are invisible to the parser and later phases. A more complex example is the lexer hack in C, where the token class of a sequence of characters cannot be determined until the semantic analysis phase since typedef names and variable names are lexically identical but constitute different token classes. Thus in the hack, the lexer calls the semantic analyzer and checks if the sequence requires a typedef name. In this case, information must flow back not from the parser only, but from the semantic analyzer back to the lexer, which complicates design. Books 1 Project Hail Mary 36,266 2 Wuthering Heights 19,402 3 Gregg shorthand 13,480 4 I Am Legend (novel) 12,851 5 The Count of Monte Cristo 9,073 6 Dungeon Crawler Carl 7,267 7 Hamlet 7,090 8 Tales of Dunk and Egg 6,985 9 Frankenstein 6,907 10 Bridgerton (novel series) 6,598 Films 1 The Bride! 61,211 2 Hoppers (film) 54,919 3 XXX (film series) 52,230 4 Project Hail Mary (film) 48,295 5 The Super Mario Galaxy Movie 40,974 6 XXX: Return of Xander Cage 37,559 7 XXX (2002 film) 37,019 8 Scream 7 34,665 9 Hamnet (film) 31,978 10 With Love (2026 film) 30,723 Programming Languages 1 Python (programming language) 4,412 2 C (programming language) 4,342 3 Scratch (programming language) 3,453 4 C++ 2,263 5 JavaScript 2,167 6 Rust (programming language) 2,041 7 YAML 1,620 8 Java (programming language) 1,614 9 R (programming language) 1,588 10 Go (programming language) 1,457 TV Series 1 One Piece (2023 TV series) 83,678 2 Paradise (2025 TV series) 53,522 3 Young Sherlock (British TV series) 43,978 4 Bridgerton 40,936 5 Love Story (2026 TV series) 39,425 6 DTF St. Louis 39,351 7 The Pitt 35,417 8 Rooster (TV series) 30,705 9 A Knight of the Seven Kingdoms (TV series) 30,437 10 Ted (TV series) 28,091 Video Games 1 Resident Evil Requiem 34,490 2 Wordle 32,742 3 Pokémon Pokopia 29,773 4 Crimson Desert 13,194 5 Roblox 11,340 6 Pokémon (video game series) 9,154 7 Resident Evil Village 8,614 8 Minecraft 8,370 9 Marathon (2026 video game) 8,337 10 Resident Evil 7: Biohazard 7,886 © 2026 OWIKI.org. Content is available under Creative Commons Attribution-ShareAlike 4.0 unless otherwise noted. Status: ONLINE Version: 1.05

Lexical analysis

Rule-based programs

Disambiguation of "lexeme"

Lexical token and lexical tokenization

Lexical grammar

Details

Scanner

Evaluator

Obstacles

Lexer generator

Phrase structure

Line continuation

Semicolon insertion

Off-side rule

Context-sensitive lexing