LL parser
In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.
An LL parser is called an LL parser if it uses k tokens of lookahead when parsing a sentence. A grammar is called an LL(k) grammar if an LL parser can be constructed from it. A formal language is called an LL language if it has an LL grammar. The set of LL languages is properly contained in that of LL languages, for each k ≥ 0. A corollary of this is that not all context-free languages can be recognized by an LL parser.
An LL parser is called LL-regular if it parses an LL-regular language. The class of LLR grammars contains every LL grammar for every k. For every LLR grammar there exists an LLR parser that parses the grammar in linear time.
Two nomenclative outlier parser types are LL and LL. A parser is called LL/LL if it uses the LL/LL parsing strategy. LL and LL parsers are functionally closer to PEG parsers. An LL parser can parse an arbitrary LL grammar optimally in the amount of lookahead and lookahead comparisons. The class of grammars parsable by the LL strategy encompasses some context-sensitive languages due to the use of syntactic and semantic predicates and has not been identified. It has been suggested that LL parsers are better thought of as TDPL parsers.
Against the popular misconception, LL parsers are not LLR in general, and are guaranteed by construction to perform worse on average and far worse in the worst-case.
LL grammars, particularly LL grammars, are of great practical interest, as parsers for these grammars are easy to construct, and many computer languages are designed to be LL for this reason. LL parsers may be table-based, i.e. similar to LR parsers, but LL grammars can also be parsed by recursive descent parsers. According to Waite and Goos, LL grammars were introduced by Stearns and Lewis.
Overview
For a given context-free grammar, the parser attempts to find the leftmost derivation.Given an example grammar G:
Generally, there are multiple possibilities when selecting a rule to expand the leftmost non-terminal. In step 2 of the previous example, the parser must choose whether to apply rule 2 or rule 3:
To be efficient, the parser must be able to make this choice deterministically when possible, without backtracking. For some grammars, it can do this by peeking on the unread input. In our example, if the parser knows that the next unread symbol is ' parser can look ahead at k'' symbols. However, given a grammar, the problem of determining if there exists a LL parser for some k that recognizes it is undecidable. For each k, there is a language that cannot be recognized by an LL parser, but can be by an.
We can use the above analysis to give the following formal definition:
Let G be a context-free grammar and. We say that G is LL, if for any two leftmost derivations:
In this definition, is the start symbol and any non-terminal. The already derived input, and yet unread and are strings of terminals. The Greek letters, and represent any string of both terminals and non-terminals. The prefix length corresponds to the lookahead buffer size, and the definition says that this buffer is enough to distinguish between any two derivations of different words.
Parser
The LL parser is a deterministic pushdown automaton with the ability to peek on the next k input symbols without reading. This peek capability can be emulated by storing the lookahead buffer contents in the finite state space, since both buffer and input alphabet are finite in size. As a result, this does not make the automaton more powerful, but is a convenient abstraction.The stack alphabet is, where:
- is the set of non-terminals;
- the set of terminal symbols with a special end-of-input symbol $.
- with some, if and there is a rule ;
- with, i.e. is popped off the stack, if. In this case, an input symbol is read and if, the parser rejects the input.
The states and the transition function are not explicitly given; they are specified using a more convenient parse table instead. The table provides the following mapping:
- row: top-of-stack symbol
- column: ≤ k lookahead buffer contents
- cell: rule number for or
Concrete example
Set up
To explain an LL parser's workings we will consider the following small LL grammar:- S → F
- S →
- F → a
An LL parsing table for a grammar has a row for each of the non-terminals and a column for each terminal.
Each cell of the table may point to at most one rule of the grammar. For example, in the parsing table for the above grammar, the cell for the non-terminal 'S' and terminal '
The algorithm to construct a parsing table is described in a later section, but first let's see how the parser uses the parsing table to process its input.
Parsing procedure
In each step, the parser reads the next-available symbol from the input stream, and the top-most symbol from the stack. If the input symbol and the stack-top symbol match, the parser discards them both by moving to the next input symbol and popping the top-most symbol off the stack. This is repeated until the input symbol and top-most symbol on the stack do not match.Thus, in its first step, the parser reads the input symbol '
In the second step, the parser removes the '
Now the parser has an 'a' on its input stream and an 'S' as its stack top. The parsing table instructs it to apply rule from the grammar and write the rule number 1 to the output stream. The stack becomes:
The parser now has an 'a' on its input stream and an 'F' as its stack top. The parsing table instructs it to apply rule from the grammar and write the rule number 3 to the output stream. The stack becomes:
The parser now has an '
In the next three steps the parser will replace 'F' on the stack by '
In this case the parser will report that it has accepted the input string and write the following list of rule numbers to the output stream:
This is indeed a list of rules for a leftmost derivation of the input string, which is:
Parser implementation in C++
Below follows a C++ implementation of a table-based LL parser for the example language:- include
- include
- include
/*
Converts a valid token to the corresponding terminal symbol
- /
int main
Parser implementation in Python
from enum import Enum
from collections.abc import Generator
class Term:
pass
class Rule:
pass
- All constants are indexed from 0
LPAR = 0
RPAR = 1
A = 2
PLUS = 3
END = 4
INVALID = 5
def __str__:
return f"T_"
class NonTerminal:
S = 0
F = 1
def __str__:
return f"N_"
- Parse table
,
,
stack =
def lexical_analysis -> Generator:
for c in input_string:
match c:
case "a":
yield Terminal.A
case "+":
yield Terminal.PLUS
case "":
yield Terminal.RPAR
case _:
yield Terminal.INVALID
yield Terminal.END
def syntactic_analysis -> None:
position = 0
while stack:
svalue = stack.pop
token = tokens
if isinstance:
if svalue token:
position += 1
if token Terminal.END:
else:
raise ValueError
elif isinstance:
rule = table
for r in reversed:
stack.append
if __name__ "__main__":
inputstring = ""
syntactic_analysis
Outputs:
Remarks
As can be seen from the example, the parser performs three types of steps depending on whether the top of the stack is a nonterminal, a terminal or the special symbol $:- If the top is a nonterminal then the parser looks up in the parsing table, on the basis of this nonterminal and the symbol on the input stream, which rule of the grammar it should use to replace nonterminal on the stack. The number of the rule is written to the output stream. If the parsing table indicates that there is no such rule then the parser reports an error and stops.
- If the top is a terminal then the parser compares it to the symbol on the input stream and if they are equal they are both removed. If they are not equal the parser reports an error and stops.
- If the top is $ and on the input stream there is also a $ then the parser reports that it has successfully parsed the input, otherwise it reports an error. In both cases the parser will stop.
Constructing an LL(1) parsing table
In order to fill the parsing table, we have to establish what grammar rule the parser should choose if it sees a nonterminal A on the top of its stack and a symbol a on its input stream.It is easy to see that such a rule should be of the form A → w and that the language corresponding to w should have at least one string starting with a.
For this purpose we define the First-set of w, written here as Fi, as the set of terminals that can be found at the start of some string in w, plus ε if the empty string also belongs to w.
Given a grammar with the rules A1 → w1,..., An → wn, we can compute the Fi and Fi for every rule as follows:
- initialize every Fi with the empty set
- add Fi to Fi for every rule Ai → wi, where Fi is defined as follows:
- * for every terminal a
- * for every nonterminal A with ε not in Fi
- * for every nonterminal A with ε in Fi
- * Fi =
- add Fi to Fi for every rule Ai → wi
- do steps 2 and 3 until all Fi sets stay the same.
- for each rule A → w
- for each terminal a
- for all words w0 and w1
Unfortunately, the First-sets are not sufficient to compute the parsing table.
This is because a right-hand side w of a rule might ultimately be rewritten to the empty string.
So the parser should also use the rule A → w if ε is in Fi and it sees on the input stream a symbol that could follow A. Therefore, we also need the Follow-set of A, written as Fo here, which is defined as the set of terminals a such that there is a string of symbols αAaβ that can be derived from the start symbol. We use $ as a special terminal indicating end of input stream, and S as start symbol.
Computing the Follow-sets for the nonterminals in a grammar can be done as follows:
- initialize Fo with and every other Fo with the empty set
- if there is a rule of the form Aj → wAi, then
- * if the terminal a is in Fi, then add a to Fo
- * if ε is in Fi, then add Fo to Fo
- * if w' has length 0, then add Fo to Fo
- repeat step 2 until all Fo sets stay the same.
- for each rule of the form
If T denotes the entry in the table for nonterminal A and terminal a, then
Equivalently: T contains the rule A → w for each
If the table contains at most one rule in every one of its cells, then the parser will always know which rule it has to use and can therefore parse strings without backtracking.
It is in precisely this case that the grammar is called an LL grammar.
Constructing an LL(''k'') parsing table
The construction for LL parsers can be adapted to LL for k > 1 with the following modifications:- the truncated product is defined, where w:''k denotes the initial length-k'' prefix of words of length > k, or w, itself, if w has length k or less,Fo =
- Apply Fi = Fi ⋅ Fi also in step 2 of the Fi construction given for LL.
- In step 2 of the Fo construction, for Aj → wAiw simply add Fi·'Fo to Fo.
Until the mid-1990s, it was widely believed that was impractical, since the parser table would have exponential size in k in the worst case. This perception changed gradually after the release of the Purdue Compiler Construction Tool Set around 1992, when it was demonstrated that many programming languages can be parsed efficiently by an LL parser without triggering the worst-case behavior of the parser. Moreover, in certain cases LL parsing is feasible even with unlimited lookahead. By contrast, traditional parser generators like yacc use LALR(1) parser tables to construct a restricted LR parser with a fixed one-token lookahead.
Conflicts
As described in the introduction, LL parsers recognize languages that have LL grammars, which are a special case of context-free grammars; LL parsers cannot recognize all context-free languages. The LL languages are a proper subset of the LR languages, which in turn are a proper subset of all context-free languages. In order for a context-free grammar to be an LL grammar, certain conflicts must not arise.Terminology
Let A be a non-terminal. FIRST is defined as the set of terminals that can appear in the first position of any string derived from A. FOLLOW is the union over:- FIRST where B is any non-terminal that immediately follows A in the right-hand side of a production rule.
- FOLLOW where B is any head of a rule of the form.
LL(1) conflicts
There are two main types of LL conflicts:FIRST/FIRST conflict
The FIRST sets of two different grammar rules for the same non-terminal intersect.An example of an LL FIRST/FIRST conflict:
S -> E | E 'a'
E -> 'b' | ε
FIRST = and FIRST =, so when the table is drawn, there is conflict under terminal b of production rule S.
Special case: left recursion
Left recursion will cause a FIRST/FIRST conflict with all alternatives.E -> E '+' term | alt1 | alt2
FIRST/FOLLOW conflict
The FIRST and FOLLOW set of a grammar rule overlap. With an empty string in the FIRST set, it is unknown which alternative to select.An example of an LL conflict:
S -> A 'a' 'b'
A -> 'a' | ε
The FIRST set of A is, and the FOLLOW set is.
Solutions to LL(1) conflicts
Left factoring
A common left-factor is "factored out".A -> X | X Y Z
becomes
A -> X B
B -> Y Z | ε
Can be applied when two alternatives start with the same symbol like a FIRST/FIRST conflict.
Another example using above FIRST/FIRST conflict example:
S -> E | E 'a'
E -> 'b' | ε
becomes
S -> 'b' | ε | 'b' 'a' | 'a'
then through left-factoring, becomes
S -> 'b' E | E
E -> 'a' | ε
Substitution
Substituting a rule into another rule to remove indirect or FIRST/FOLLOW conflicts.Note that this may cause a FIRST/FIRST conflict.
Left recursion removal
For a general method, see removing left recursion.A simple example for left recursion removal:
The following production rule has left recursion on E
E -> E '+' T
E -> T
This rule is nothing but list of Ts separated by '+'. In a regular expression form T *.
So the rule could be rewritten as
E -> T Z
Z -> '+' T Z
Z -> ε
Now there is no left recursion and no conflicts on either of the rules.
However, not all context-free grammars have an equivalent LL-grammar, e.g.:
S -> A | B
A -> 'a' A 'b' | ε
B -> 'a' B 'b' 'b' | ε
It can be shown that there does not exist any LL-grammar accepting the language generated by this grammar.