Compiler


In computing, a compiler is software that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language to create an executable program.
There are many different types of compilers which produce output in different useful forms. A cross-compiler produces code for a different CPU or operating system than the one on which the cross-compiler itself runs. A bootstrap compiler is often a temporary compiler, used for compiling a more permanent or better optimized compiler for a language.
Related software include decompilers, programs that translate from low-level languages to higher level ones; programs that translate between high-level languages, usually called source-to-source compilers or transpilers; language rewriters, usually programs that translate the form of expressions without a change of language; and compiler-compilers, compilers that produce compilers, often in a generic and reusable way so as to be able to produce many differing compilers.
A compiler is likely to perform some or all of the following operations, often called phases: preprocessing, lexical analysis, parsing, semantic analysis, conversion of input programs to an intermediate representation, code optimization and machine specific code generation. Compilers generally implement these phases as modular components, promoting efficient design and correctness of transformations of source input to target output. Program faults caused by incorrect compiler behavior can be very difficult to track down and work around; therefore, compiler implementers invest significant effort to ensure compiler correctness.

Comparison with interpreter

With respect to making source code runnable, an interpreter provides a similar function as a compiler, but via a different mechanism. An interpreter executes code without converting it to machine code. Therefore, some interpreters execute source code while others execute an intermediate form such as bytecode.
Hence a program compiled to native code tends to run faster than when interpreted. Environments with a bytecode-intermediate-form tends toward intermediate-speed. While Just-in-time compilation allows for native execution speed with a one-time startup processing time cost.
For low-level programming languages, such as assembly and C, it is typical that they are compiled, especially when speed is a significant concern, rather than being cross-platform supported. So that for such languages, there are more one-to-one correspondences between the source code and the resulting machine code, making it easier for programmers to control the use of hardware.
In theory; a programming language can be used via either a compiler or an interpreter, but in practice, each language tends to be used with only one or the other. Nonetheless, it is possible to write a compiler for a language that is commonly interpreted. For example, Common Lisp can be compiled to Java bytecode, as well as C code, or directly to native code.

History

Theoretical computing concepts developed by scientists, mathematicians, and engineers formed the basis of digital modern computing development during World War II. Primitive binary languages evolved because digital devices only understand ones and zeros and the circuit patterns in the underlying machine architecture. In the late 1940s, assembly languages were created to offer a more workable abstraction of the computer architectures. Limited memory capacity of early computers led to substantial technical challenges when the first compilers were designed. Therefore, the compilation process needed to be divided into several small programs. The front end programs produce the analysis products used by the back end programs to generate target code. As computer technology provided more resources, compiler designs could align better with the compilation process.
It is usually more productive for a programmer to use a high-level language, so the development of high-level languages followed naturally from the capabilities offered by digital computers. High-level languages are formal languages that are strictly defined by their syntax and semantics which form the high-level language architecture. Elements of these formal languages include:
  • Alphabet, any finite set of symbols;
  • String, a finite sequence of symbols;
  • Language, any set of strings on an alphabet.
The sentences in a language may be defined by a set of rules called a grammar.
Backus–Naur form describes the syntax of "sentences" of a language. It was developed by John Backus and used for the syntax of Algol 60. The ideas derive from the context-free grammar concepts by linguist Noam Chomsky. "BNF and its extensions have become standard tools for describing the syntax of programming notations. In many cases, parts of compilers are generated automatically from a BNF description."
Between 1942 and 1945, Konrad Zuse designed the first programming language for computers called Plankalkül. Zuse also envisioned a Planfertigungsgerät to automatically translate the mathematical formulation of a program into machine-readable punched film stock. While no actual implementation occurred until the 1970s, it presented concepts later seen in APL designed by Ken Iverson in the late 1950s. APL is a language for mathematical computations.
Between 1949 and 1951, Heinz Rutishauser proposed Superplan, a high-level language and automatic translator. His ideas were later refined by Friedrich L. Bauer and Klaus Samelson.
High-level language design during the formative years of digital computing provided useful programming tools for a variety of applications:
  • FORTRAN for engineering and science applications is considered to be one of the first actually implemented high-level languages and first optimizing compiler.
  • COBOL evolved from A-0 and FLOW-MATIC to become the dominant high-level language for business applications.
  • LISP for symbolic computation.
Compiler technology evolved from the need for a strictly defined transformation of the high-level source program into a low-level target program for the digital computer. The compiler could be viewed as a front end to deal with the analysis of the source code and a back end to synthesize the analysis into the target code. Optimization between the front end and back end could produce more efficient target code.
Some early milestones in the development of compiler technology:
  • May 1952: Grace Hopper's team at Remington Rand wrote the compiler for the A-0 programming language, although the A-0 compiler functioned more as a loader or linker than the modern notion of a full compiler.
  • 1952, before September: An Autocode compiler developed by Alick Glennie for the Manchester Mark I computer at the University of Manchester is considered by some to be the first compiled programming language.
  • 1954–1957: A team led by John Backus at IBM developed FORTRAN which is usually considered the first high-level language. In 1957, they completed a FORTRAN compiler that is generally credited as having introduced the first unambiguously complete compiler.
  • 1959: The Conference on Data Systems Language initiated development of COBOL. The COBOL design drew on A-0 and FLOW-MATIC. By the early 1960s COBOL was compiled on multiple architectures.
  • 1958–1960: Algol 58 was the precursor to ALGOL 60. It introduced code blocks, a key advance in the rise of structured programming. ALGOL 60 was the first language to implement nested function definitions with lexical scope. It included recursion. Its syntax was defined using BNF. ALGOL 60 inspired many languages that followed it. Tony Hoare remarked: "... it was not only an improvement on its predecessors but also on nearly all its successors."
  • 1958–1962: John McCarthy at MIT designed LISP. The symbol processing capabilities provided useful features for artificial intelligence research. In 1962, LISP 1.5 release noted some tools: an interpreter written by Stephen Russell and Daniel J. Edwards, a compiler and assembler written by Tim Hart and Mike Levin.
Early operating systems and software were written in assembly language. In the 1960s and early 1970s, the use of high-level languages for system programming was still controversial due to resource limitations. However, several research and industry efforts began the shift toward high-level systems programming languages, for example, BCPL, BLISS, B, and C.
BCPL designed in 1966 by Martin Richards at the University of Cambridge was originally developed as a compiler writing tool. Several compilers have been implemented, Richards' book provides insights to the language and its compiler. BCPL was not only an influential systems programming language that is still used in research but also provided a basis for the design of B and C languages.
BLISS was developed for a Digital Equipment Corporation PDP-10 computer by W. A. Wulf's Carnegie Mellon University research team. The CMU team went on to develop BLISS-11 compiler one year later in 1970.
Multics, a time-sharing operating system project, involved MIT, Bell Labs, General Electric and was led by Fernando Corbató from MIT. Multics was written in the PL/I language developed by IBM and IBM User Group. IBM's goal was to satisfy business, scientific, and systems programming requirements. There were other languages that could have been considered but PL/I offered the most complete solution even though it had not been implemented. For the first few years of the Multics project, a subset of the language could be compiled to assembly language with the Early PL/I compiler by Doug McIlory and Bob Morris from Bell Labs. EPL supported the project until a boot-strapping compiler for the full PL/I could be developed.
Bell Labs left the Multics project in 1969, and developed a system programming language B based on BCPL concepts, written by Dennis Ritchie and Ken Thompson. Ritchie created a boot-strapping compiler for B and wrote Unics operating system for a PDP-7 in B. Unics eventually became spelled Unix.
Bell Labs started the development and expansion of C based on B and BCPL. The BCPL compiler had been transported to Multics by Bell Labs and BCPL was a preferred language at Bell Labs. Initially, a front-end program to Bell Labs' B compiler was used while a C compiler was developed. In 1971, a new PDP-11 provided the resource to define extensions to B and rewrite the compiler. By 1973 the design of C language was essentially complete and the Unix kernel for a PDP-11 was rewritten in C. Steve Johnson started development of Portable C Compiler to support retargeting of C compilers to new machines.
Object-oriented programming offered some interesting possibilities for application development and maintenance. OOP concepts go further back but were part of LISP and Simula language science. Bell Labs became interested in OOP with the development of C++. C++ was first used in 1980 for systems programming. The initial design leveraged C language systems programming capabilities with Simula concepts. Object-oriented facilities were added in 1983. The Cfront program implemented a C++ front-end for C84 language compiler. In subsequent years several C++ compilers were developed as C++ popularity grew.
In many application domains, the idea of using a higher-level language quickly caught on. Because of the expanding functionality supported by newer programming languages and the increasing complexity of computer architectures, compilers became more complex.
DARPA sponsored a compiler project with Wulf's CMU research team in 1970. The Production Quality Compiler-Compiler PQCC design would produce a Production Quality Compiler from formal definitions of source language and the target. PQCC tried to extend the term compiler-compiler beyond the traditional meaning as a parser generator without much success. PQCC might more properly be referred to as a compiler generator.
PQCC research into code generation process sought to build a truly automatic compiler-writing system. The effort discovered and designed the phase structure of the PQC. The BLISS-11 compiler provided the initial structure. The phases included analyses, intermediate translation to virtual machine, and translation to the target. TCOL was developed for the PQCC research to handle language specific constructs in the intermediate representation. Variations of TCOL supported various languages. The PQCC project investigated techniques of automated compiler construction. The design concepts proved useful in optimizing compilers and compilers for the programming language Ada.
The Ada STONEMAN document formalized the program support environment along with the kernel and minimal. An Ada interpreter NYU/ED supported development and standardization efforts with the American National Standards Institute and the International Standards Organization. Initial Ada compiler development by the U.S. Military Services included the compilers in a complete integrated design environment along the lines of the STONEMAN document. Army and Navy worked on the Ada Language System project targeted to DEC/VAX architecture while the Air Force started on the Ada Integrated Environment targeted to IBM 370 series. While the projects did not provide the desired results, they did contribute to the overall effort on Ada development.
Other Ada compiler efforts got underway in Britain at the University of York and in Germany at the University of Karlsruhe. In the U. S., Verdix delivered the Verdix Ada Development System to the Army. VADS provided a set of development tools including a compiler. Unix/VADS could be hosted on a variety of Unix platforms such as DEC Ultrix and the Sun 3/60 Solaris targeted to Motorola 68020 in an Army CECOM evaluation. There were soon many Ada compilers available that passed the Ada Validation tests. The Free Software Foundation GNU project developed the GNU Compiler Collection which provides a core capability to support multiple languages and targets. The Ada version GNAT is one of the most widely used Ada compilers. GNAT is free but there is also commercial support, for example, AdaCore, was founded in 1994 to provide commercial software solutions for Ada. GNAT Pro includes the GNU GCC based GNAT with a tool suite to provide an integrated development environment.
High-level languages continued to drive compiler research and development. Focus areas included optimization and automatic code generation. Trends in programming languages and development environments influenced compiler technology. More compilers became included in language distributions and as a component of an IDE. The interrelationship and interdependence of technologies grew. The advent of web services promoted growth of web languages and scripting languages. Scripts trace back to the early days of Command Line Interfaces where the user could enter commands to be executed by the system. User Shell concepts developed with languages to write shell programs. Early Windows designs offered a simple batch programming capability. The conventional transformation of these language used an interpreter. While not widely used, Bash and Batch compilers have been written. More recently sophisticated interpreted languages became part of the developers tool kit. Modern scripting languages include PHP, Python, Ruby and Lua. All of these have interpreter and compiler support.
"When the field of compiling began in the late 50s, its focus was limited to the translation of high-level language programs into machine code... The compiler field is increasingly intertwined with other disciplines including computer architecture, programming languages, formal methods, software engineering, and computer security." The "Compiler Research: The Next 50 Years" article noted the importance of object-oriented languages and Java. Security and parallel computing were cited among the future research targets.