Markup language
A markup language is a text-encoding system which specifies the structure and formatting of a document and potentially the relationships among its parts. Markup can control the display of a document or enrich its content to facilitate automated processing.
A markup language is a set of rules governing what markup information may be included in a document and how it is combined with the content of the document in a way to facilitate use by humans and computer programs. The idea and terminology evolved from the marking up of paper manuscripts, traditionally written with a red pen or blue pencil on authors' manuscripts.
Older markup languages, which typically focus on typesetting and presentation, include troff, TeX, and LaTeX. Scribe and most modern markup languages, such as XML, identify document components, with the expectation that technology, such as stylesheets, will be used to apply formatting or other processing.
Some markup languages, such as the widely used HTML, have pre-defined presentation semantics, meaning that their specifications prescribe some aspects of how to present the structured data on particular media. HTML, like DocBook, Open eBook, JATS, and many others, are based on the markup metalanguages XML and SGML. That is, SGML and XML allow designers to specify particular schemas, which determine which elements, attributes, and other features are permitted, and where.
A key characteristic of most markup languages is that they allow combining markup with content such as text and pictures. For example, if a few words in a sentence need to be emphasized, or identified as a proper name, defined term, or another special item, the markup may be inserted between the characters of the sentence.
Etymology
The word markup is derived from the traditional publishing practice of marking up a manuscript, which involves adding handwritten annotations in the form of conventional symbolic printer's instructions—in the margins and text of a paper or printed manuscript.For centuries, this task was done primarily by skilled typographers known as markup men or markers who marked up text to indicate what typeface, style, and size should be applied to each part, and then passed the manuscript to others for typesetting by hand or machine.
The markup was also commonly applied by editors, proofreaders, publishers, and graphic designers, and by authors themselves, all of whom might also mark things such as corrections and changes.
Types
There are three general categories of electronic markup, articulated by James Coombs, Allen Renear, and Steven DeRose in 1987, and Tim Bray in 2003.Presentational markup
Presentational markup is used by traditional word-processing systems. Binary codes embedded within document text produce the WYSIWYG effect. Such markup is usually hidden from human users, even authors and editors. Such systems use procedural and descriptive markup internally but convert them to present the user with formatted arrangements of type.Procedural markup
Markup is embedded in text which provides instructions for programs to process the text. Well-known examples include troff, TeX, and Markdown. Generally, software processes the text sequentially from beginning to end, following the instructions as encountered. Such text is often edited with the markup visible and directly manipulated by the author. Popular procedural markup systems usually include programming constructs, especially macros, allowing complex sets of instructions to be invoked by a simple name. This is much faster, less error-prone, and more maintenance-friendly than re-stating the same or similar instructions in many places.Descriptive markup
Descriptive markup is specifically used to describe parts of the document for what they are, rather than how they should be processed. Well-known systems that provide many such labels include LaTeX, HTML, and XML. The objective is to decouple the structure of the document from any particular treatment or rendition of it. Such markup is often described as semantic. An example of a descriptive markup is HTML's<cite> tag, which is used to label a citation. Descriptive markup—sometimes called logical markup or conceptual markup—encourages authors to write in a way that describes the material conceptually, rather than visually.There is considerable overlap and concurrent use of markup types. In modern word-processing systems, presentational markup is often saved in descriptive-markup-oriented systems such as XML, and then processed procedurally by implementations. The programming in procedural-markup systems, such as TeX, may be used to create higher-level markup systems that are more descriptive in nature, such as LaTeX.
In recent years, several markup languages have been developed with ease of use as a key goal, and without input from standards organizations, aimed at allowing authors to create formatted text via web browsers, for example in wikis and web forums. These are sometimes called lightweight markup languages. Markdown, BBCode, and the markup language used by Wikipedia are examples of such languages.
History
GenCode
The first well-known public presentation of markup languages in computer text processing was made by William W. Tunnicliffe at a conference in 1967, although he preferred to call it generic coding. It can be seen as a response to the emergence of processing programs such as RUNOFF that each used their own control notation, often specific to the target typesetting device. In the 1970s, Tunnicliffe led the development of a standard called GenCode for the publishing industry. Book designer Stanley Rice published speculation along similar lines in 1970.Brian Reid, in his 1980 dissertation at Carnegie Mellon University, developed a theory and working implementation of descriptive markup in actual use. However, IBM researcher Charles Goldfarb is more commonly considered the inventor of markup languages. Goldfarb developed the basic idea while working on a primitive document management system intended for law firms in 1969, and helped invent IBM's Generalized Markup Language later that same year. GML was first publicly disclosed in 1973.
In 1975, Goldfarb moved from Cambridge, Massachusetts to Silicon Valley and became a product planner at the IBM Almaden Research Center. There, he convinced IBM's executives to deploy GML commercially in 1978 as part of IBM's Document Composition Facility product, and it was widely used in business within a few years.
Standard Generalized Markup Language, the first standard descriptive markup language, was based on both GML and GenCode. It was the result of an International Organization for Standardization committee that was first chaired by Tunnicliffe, and which Goldfarb also worked on beginning in 1974. Goldfarb eventually became chair of the committee. SGML was first released by ISO as the ISO 8879 standard in October 1986.
troff and nroff
Some early examples of computer markup languages available outside the publishing industry can be found in typesetting tools on Unix systems such as troff and nroff. In these systems, formatting commands were inserted into the document text so that typesetting software could format the text according to the editor's specifications. It was a trial and error iterative process to correctly print a document. The availability of WYSIWYG publishing software supplanted much use of these languages among casual users, though professional publishing work still uses markup to specify the non-visual structure of texts, and WYSIWYG editors now usually save documents in a markup-language-based format.TeX
Another major publishing standard is TeX, created and refined by Donald Knuth in the 1970s and 1980s. TeX concentrated on the detailed layout of text and font descriptions to typeset mathematical books. This required Knuth to spend considerable time investigating the art of typesetting. TeX is mainly used in academia, where it is a de facto standard in many scientific disciplines. A TeX macro package known as LaTeX provides a descriptive markup system on top of TeX, and is widely used both among the scientific community and the publishing industry.Scribe, GML, and SGML
The first language to make a clear distinction between structure and presentation was Scribe, developed by Brian Reid and described in his doctoral thesis in 1980. Scribe was revolutionary in a number of ways, introducing the idea of styles separated from the marked-up document, and a grammar that controlled the usage of descriptive elements. Scribe influenced the development of GML and later SGML, and is a direct ancestor to HTML and LaTeX.In the early 1980s, the idea that markup should focus on the structural aspects of a document and leave the visual presentation of that structure to the interpreter led to the creation of SGML. The language was developed by a committee chaired by Goldfarb. It incorporated ideas from many different sources, including Tunnicliffe's project, GenCode. Sharon Adler, Anders Berglund, and James A. Marke were also key members of the SGML committee.
SGML specifies a syntax for including the markup in documents, as well as one for separately describing what tags are allowed, and where. This allows authors to create and use any markup they want, selecting tags that make the most sense to them and are named in their own natural languages, while also allowing automated verification. Thus, SGML is properly a metalanguage, and many markup languages are derived from it. From the late 1980s onward, most substantial new markup languages have been based on SGML, including the Text Encoding Initiative guidelines and DocBook. SGML was promulgated as the ISO 8879 standard in 1986.
SGML found wide acceptance and use in fields with very large-scale documentation requirements. However, many found it cumbersome and difficult to learn—a side effect of its design attempting to do too much and being too flexible. For example, SGML made end tags optional in certain contexts, because its developers thought markup would be done manually by overworked support staff who would appreciate saving keystrokes.