In computing, code generation is the process by which a compiler's code generator converts some intermediate representation of source code into a form that can be readily executed by a machine. Sophisticated compilers typically perform multiple passes over various intermediate forms. This multi-stage process is used because many algorithms for code optimization are easier to apply one at a time, or because the input to one optimization relies on the completed processing performed by another optimization. This organization also facilitates the creation of a single compiler that can target multiple architectures, as only the last of the code generation stages needs to change from target to target. The input to the code generator typically consists of a parse tree or an abstract syntax tree. The tree is converted into a linear sequence of instructions, usually in an intermediate language such as three-address code. Further stages of compilation may or may not be referred to as "code generation", depending on whether they involve a significant change in the representation of the program.
Major tasks
In addition to the basic conversion from an intermediate representation into a linear sequence of machine instructions, a typical code generator tries to optimize the generated code in some way. Tasks which are typically part of a sophisticated compiler's "code generation" phase include:
Instruction selection: which instructions to use.
Instruction scheduling: in which order to put those instructions. Scheduling is a speed optimization that can have a critical effect on pipelined machines.
Instruction selection is typically carried out by doing a recursivepostorder traversal on the abstract syntax tree, matching particular tree configurations against templates; for example, the tree W := ADD might be transformed into a linear sequence of instructions by recursively generating the sequences for t1 := X and t2 := MUL, and then emitting the instruction ADD W, t1, t2. In a compiler that uses an intermediate language, there may be two instruction selection stages—one to convert the parse tree into intermediate code, and a second phase much later to convert the intermediate code into instructions from the instruction set of the target machine. This second phase does not require a tree traversal; it can be done linearly, and typically involves a simple replacement of intermediate-language operations with their corresponding opcodes. However, if the compiler is actually a language translator, then the second code-generation phase may involve building a tree from the linear intermediate code.
The fundamental task of taking input in one language and producing output in a non-trivially different language can be understood in terms of the coretransformational operations of formal language theory. Consequently, some techniques that were originally developed for use in compilers have come to be employed in other ways as well. For example, YACC takes input in Backus-Naur form and converts it to a parser in C. Though it was originally created for automatic generation of a parser for a compiler, yacc is also often used to automate writing code that needs to be modified each time specifications are changed. Many integrated development environments support some form of automatic source code generation, often using algorithms in common with compiler code generators, although commonly less complicated.
Reflection
In general, a syntax and semantic analyzer tries to retrieve the structure of the program from the source code, while a code generator uses this structural information to produce code. In other words, the former adds information while the latter loses some of the information. One consequence of this information loss is that reflection becomes difficult or even impossible. To counter this problem, code generators often embed syntactic and semantic information in addition to the code necessary for execution.