Optimizing compiler


An optimizing compiler is a compiler designed to generate code that is optimized in aspects such as minimizing program execution time, memory usage, storage size, and power consumption. Optimization is generally implemented as a sequence of optimizing transformations, a.k.a. compiler optimizations algorithms that transform code to produce semantically equivalent code optimized for some aspect.
Optimization is limited by a number of factors. Theoretical analysis indicates that some optimization problems are NP-complete, or even undecidable. Also, producing perfectly optimal code is not possible since optimizing for one aspect often degrades performance for another. Optimization is a collection of heuristic methods for improving resource usage in typical programs.

Categorization

Local vs. global scope

describes how much of the input code is considered to apply optimizations.
Local scope optimizations use information local to a basic block. Since basic blocks contain no control flow statements, these optimizations require minimal analysis, reducing time and storage requirements. However, no information is retained across jumps.
Global scope optimizations, also known as intra-procedural optimizations, operate on individual functions. This gives them more information to work with but often makes expensive computations necessary. Worst-case assumptions need to be made when function calls occur or global variables are accessed because little information about them is available.

Peephole optimization

s are usually performed late in the compilation process after machine code has been generated. This optimization examines a few adjacent instructions to see whether they can be replaced by a single instruction or a shorter sequence of instructions. For instance, a multiplication of a value by two might be more efficiently executed by left-shifting the value or by adding the value to itself.

Inter-procedural optimization

s analyze all of a program's source code. The more information available, the more effective the optimizations can be. The information can be used for various optimizations, including function inlining, where a call to a function is replaced by a copy of the function body.

Link-time optimization

, or whole-program optimization, is a more general class of interprocedural optimization. During LTO, the compiler has visibility across translation units which allows it to perform more aggressive optimizations like cross-module inlining and devirtualization.

Machine and object code optimization

Machine code optimization involves using an object code optimizer to analyze the program after all machine code has been linked. Techniques such as macro compression, which conserves space by condensing common instruction sequences, become more effective when the entire executable task image is available for analysis.

Language-independent vs. language-dependent

Most high-level programming languages share common programming constructs and abstractions, such as branching constructs, looping constructs, and encapsulation constructs. Thus, similar optimization techniques can be used across languages. However, certain language features make some optimizations difficult. For instance, pointers in C and C++ make array optimization difficult; see alias analysis. However, languages such as PL/I that also support pointers implement optimizations for arrays. Conversely, some language features make certain optimizations easier. For example, in some languages, functions are not permitted to have side effects. Therefore, if a program makes several calls to the same function with the same arguments, the compiler can infer that the function's result only needs to be computed once. In languages where functions are allowed to have side effects, the compiler can restrict such optimization to functions that it can determine have no side effects.

Machine-independent vs. machine-dependent

Many optimizations that operate on abstract programming concepts are independent of the machine targeted by the compiler, but many of the most effective optimizations are those that best exploit special features of the target platform. Examples are instructions that do several things at once, such as decrement register and branch if not zero.
The following is an instance of a local machine-dependent optimization. To set a register to 0, the obvious way is to use the constant '0' in an instruction that sets a register value to a constant. A less obvious way is to XOR a register with itself or subtract it from itself. It is up to the compiler to know which instruction variant to use. On many RISC machines, both instructions would be equally appropriate, since they would both be the same length and take the same time. On many other microprocessors such as the Intel x86 family, it turns out that the XOR variant is shorter and probably faster, as there will be no need to decode an immediate operand, nor use the internal "immediate operand register"; the same applies on IBM System/360 and successors for the subtract variant. A potential problem with this is that XOR or subtract may introduce a data dependency on the previous value of the register, causing a pipeline stall, which occurs when the processor must delay execution of an instruction because it depends on the result of a previous instruction. However, processors often treat the XOR of a register with itself or the subtract of a register from itself as a special case that does not cause stalls.

Factors affecting optimization

;Target machine: Whether particular optimizations can and should be applied may depend on the characteristics of the target machine. Some compilers such as GCC and Clang parameterize machine-dependent factors so that they can be used to optimize for different machines.
;Target CPU architecture
  • Number of registers: Registers can be used to optimize for performance. Local variables can be stored in registers instead of the stack. Temporary/intermediate results can be accessed in registers instead of slower memory.
  • RISC vs. CISC: CISC instruction sets often have variable instruction lengths, often have a larger number of possible instructions that can be used, and each instruction could take differing amounts of time. RISC instruction sets attempt to limit the variability in each of these: instruction sets are usually constant in length, with few exceptions, there are usually fewer combinations of registers and memory operations, and the instruction issue rate is usually constant in cases where memory latency is not a factor. There may be several ways of carrying out a certain task, with CISC usually offering more alternatives than RISC. Compilers have to know the relative costs among the various instructions and choose the best instruction sequence.
  • Pipelines: A pipeline is a CPU broken up into an assembly line. It allows the use of parts of the CPU for different instructions by breaking up the execution of instructions into various stages: instruction decode, address decode, memory fetch, register fetch, compute, register store, etc. One instruction could be in the register store stage, while another could be in the register fetch stage. Pipeline conflicts occur when an instruction in one stage of the pipeline depends on the result of another instruction ahead of it in the pipeline but not yet completed. Pipeline conflicts can lead to pipeline stalls: where the CPU wastes cycles waiting for a conflict to resolve. Compilers can schedule, or reorder, instructions so that pipeline stalls occur less frequently.
  • Number of functional units: Some CPUs have several ALUs and FPUs that allow them to execute multiple instructions simultaneously. There may be restrictions on which instructions can pair with which other instructions, and which functional unit can execute which instruction. They also have issues similar to pipeline conflicts. Instructions can be scheduled so that the functional units are fully loaded.
;Machine architecture
  • CPU cache size and type : Techniques such as inline expansion and loop unrolling may increase the size of the generated code and reduce code locality. The program may slow down drastically if a highly used section of code no longer fits in the cache as a result of optimizations that increase code size. Also, caches that are not fully associative have higher chances of cache collisions even in an unfilled cache.
  • Cache/memory transfer rates: These give the compiler an indication of the penalty for cache misses. This is used mainly in specialized applications.
;Intended use
  • Debugging: During development, optimizations are often disabled to speed compilation or to make the executable code easier to debug. Optimizing transformations, particularly those that reorder code, can make it difficult to relate the executable code to the source code.
  • General-purpose use: Prepackaged software is often expected to run on a variety of machines that may share the same instruction set but have different performance characteristics. The code may not be optimized to any particular machine or may be tuned to work best on the most popular machine while working less optimally on others.
  • Special-purpose use: If the software is compiled for machines with uniform characteristics, then the compiler can heavily optimize the generated code for those machines.

    Common themes

Optimization includes the following, sometimes conflicting themes.
;Optimize the common case: The common case may have unique properties that allow a fast path at the expense of a slow path. If the fast path is taken more often, the result is better overall performance.
;Avoid redundancy: Reuse results that are already computed and store them for later use, instead of recomputing them.
;Less code: Remove unnecessary computations and intermediate values. Less work for the CPU, cache, and memory usually results in faster execution. Alternatively, in embedded systems, less code brings a lower product cost.
;Fewer jumps by using straight line code, also called branch-free code: Less complicated code. Jumps interfere with the prefetching of instructions, thus slowing down code. Using inlining or loop unrolling can reduce branching, at the cost of increasing binary file size by the length of the repeated code. This tends to merge several basic blocks into one.
;Locality: Code and data that are accessed closely together in time should be placed close together in memory to increase spatial locality of reference.
;Exploit the memory hierarchy: Accesses to memory are increasingly more expensive for each level of the memory hierarchy, so place the most commonly used items in registers first, then caches, then main memory, before going to disk.
;Parallelize: Reorder operations to allow multiple computations to happen in parallel, either at the instruction, memory, or thread level.
;More precise information is better: The more precise the information the compiler has, the better it can employ any or all of these optimization techniques.
;Runtime metrics can help: Information gathered during a test run can be used in profile-guided optimization. Information gathered at runtime, ideally with minimal overhead, can be used by a JIT compiler to dynamically improve optimization.
;Strength reduction: Replace complex, difficult, or expensive operations with simpler ones. For example, replacing division by a constant with multiplication by its reciprocal, or using induction variable analysis to replace multiplication by a loop index with addition.