Control-flow graph


In computer science, a control-flow graph is a representation, using graph notation, of all paths that might be traversed through a function during its execution. The control-flow graph was conceived by Frances E. Allen, who noted that Reese T. Prosser used boolean connectivity matrices for flow analysis before.
The CFG is essential to many compiler optimizations and static-analysis tools.

Definition

A control flow graph is the directed graph of the basic blocks of the function and the control flow between them.
The exact details vary between representations. Typically, a basic block consists of a straight-line sequence of instructions or statements. Only the last instruction or statement in each block can perform control flow, and control flow can only be directed to the first instruction or statement in a block.
In most CFG representations, there are two specially designated blocks: the entry block, through which control enters into the function, and the exit block, through which all control exits the function. Some representations permit multiple exit blocks, especially if there are different kinds of exit, or permit the exit block to be omitted if it is not reachable.
If there is a control-flow edge from block A to block B, then A is called a predecessor of B, and B is called a successor of A.

Examples

As a simple example, consider the following C function definition:

void print_within_parentheses

This function has three basic blocks:
  • Block 1 is the entry block. It runs from the start of the function until the end of the expression p != NULL. It ends with the if control flow, either entering block 2 if the condition is true or skipping ahead to block 3 if the condition is false.
  • Block 2 is the body of the if statement. It ends by unconditionally falling into block 3.
  • Block 3 is the exit block. It runs from the end of the if statement to the implicit return at the end of the function body.
The nested structure of the source program makes it a little hard to see the basic blocks. As an alternative representation that makes the blocks more obvious, we can flatten that structure into a sequence of labeled blocks. To make the control flow explicit, we will require each block to end in either goto, an if/else of gotos, or return.

void print_within_parentheses

This is a source-level representation of the CFG. The basic blocks consist of sequences of C statements taken directly from the source program. Only the structured control flow has been eliminated.
Now consider a more complex example which includes a loop:

void print_as_ordered_tuple

This function has six basic blocks, which can be clearly seen in our source-level representation of the CFG:

void print_as_ordered_tuple

This style of statement-level CFG representation is not commonly used, and it is shown here only for illustrative purposes. Not all programs can be easily represented this way, even in C. Notice how the local variable declaration has been raised to the top of function, and its initialization in block1 has been turned into an assignment. This would cause difficulties for scope-driven features such as shadowing and variable-length arrays. Statements would also need to be significantly rewritten to deal with expression-level control flow, such as is introduced by the &&, ||, and ? : operators.
Most practical CFG representations use something other than complete source statements as the components of their basic blocks. For example, compiler frameworks such as LLVM use a CFG in which basic blocks consist of abstract static single-assignment instructions as their primary IR. Here is the above function translated into LLVM IR:

define void @print_as_ordered_tuple

Notice how the block structure is the same in the source-level and the LLVM CFGs. In both representations, the functions have the same basic semantics, performing the same sequence of operations and control flow. Only the representation of the individual operations is different.

Construction

A very explicit control flow graph can be obtained from a source function by placing each statement into its own basic block. If the statement is not a control-flow statement, a "fall-through" jump to the block for the next statement is added to the block. Unfortunately, this tends to create a large number of unnecessary basic blocks and fall-through jumps, which makes subsequent analysis more cumbersome and expensive. It is usually desirable that successive statements be placed in the same basic block whenever possible. Another way to say this is that, across the entire CFG, every edge A→B should have the property that:
Such a graph can be derived from the one-statement-per-block CFG by performing an edge contraction for every edge that falsifies the predicate above, which is to say, by merging two blocks whenever the source block always jumps to the destination block and the destination block can only be jumped to by the source block. However, this contraction-based algorithm is of little practical importance except as a visualization aid for understanding CFG construction because of the expense of building the initial form. In typical implementations, the CFG is built directly from the program in a way that minimizes unnecessary blocks by construction, such as by scanning the program for necessary boundaries between basic blocks.

Reachability

is a graph property useful in optimization. If a block cannot be reached by any path from the entry block, then it cannot possibly be executed, in which case it is called unreachable code. Unreachable code can usually be removed from the control flow graph without any negative consequences.
If the exit block cannot be reached by any path from the entry block, then control flow cannot leave the graph normally. This indicates the presence of either an infinite loop or, in representations that directly support these features, an abnormal exit or termination of the program. The exit block being reachable by some path does not mean that the program will necessarily reach it, and it is impossible to prove that it will be reached in a general graph; see Halting problem.
Optimization can reveal unreachable code and infinite loops that were not apparent in the original program. For example, consider the following LLVM function:

define void @double_until_odd

In this form, the loop in this function is not an infinite loop because there is a path which leaves the loop: if %3 is false, the program will branch to block3 and return. However, a numerical analysis can prove that the product of any number and 2 will be even, which means that %2 is always zero and %3 can never be false. This function can therefore be optimized into this form:

define void @double_until_odd

There is now an infinite loop in the control flow graph, and the exit block can no longer be reached. Note that this optimization did not change the behavior of the program: it was always true that the loop would never exit. All that has changed is that the control flow graph is more accurately reflecting that truth.

Dominance relationships

A block M dominates a block N if every path from the entry that reaches block N also has to visit block M. The entry block dominates all blocks. M is said to properly dominate N if M dominates N and they are different blocks. Furthermore, an individual statement or instruction X is said to properly dominate a statement or instruction Y if the block containing X dominates the block containing Y and, if those are the same block, X strictly precedes Y in the block.
It is said that a block M immediately dominates block N if M dominates N, and there is no intervening block P such that M dominates P and P dominates N. In other words, M is the last dominator on all paths from entry to N. Every reachable block has a unique immediate dominator except for the entry block, which has none.
The dominator tree is a directed graph representing the dominance relationships in the function. The nodes of the graph are the reachable basic blocks of the function, and there is an edge from block M to block N if M is an immediate dominator of N. Since each non-entry reachable block has a unique immediate dominator, this is a tree rooted at the entry block. The dominator tree can be calculated efficiently using Lengauer–Tarjan's algorithm.
In the reverse direction, block M post-dominates block N if every path from N to the exit also has to visit block M. The exit block post-dominates all blocks. M is said to properly post-dominate N if it post-dominates N and they are different blocks. M is an immediate post-dominator of N if it post-dominates N and there is no block P such that M post-dominates P and P post-dominates N. Every block from which the exit block is reachable has a unique immediate post-dominator except for the exit block, which has none. This gives rise to a post-dominator tree, rooted at the exit block, over the blocks from which the exit block is reachable.
The standard definition of dominance includes unreachable blocks, and the standard definition of post-dominance includes blocks from which the exit block cannot be reached. However, such blocks have unique properties under these relationships: an unreachable block is dominated by all blocks, and a block from which the exit cannot be reached is post-dominated by all blocks. Most analyses relying on dominance or post-dominance will want to ignore them, and they are not included in dominator or post-dominator trees. Some representations require the addition of impossible edges to prevent these blocks from existing, at least formally.