Overlapping markup
In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner.
A document with overlapping markup cannot be represented as a tree.
This is also known as concurrent markup.
Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.
History
The problem of non-hierarchical structures in documents has been recognised since 1988; resolving it against the dominant paradigm of text as a single hierarchy was initially thought to be merely a technical issue, but has, in fact, proven much more difficult.In 2008, Jeni Tennison identified markup overlap as "the main remaining problem area for markup technologists".
Markup overlap continues to be a primary issue in the digital study of theological texts in 2019, and is a major reason for the field retaining specialised markup formats—the Open Scripture Information Standard and the Theological Markup Language—rather than the inter-operable Text Encoding Initiative-based formats common to the rest of the digital humanities.
Properties and types
A distinction exists between schemes that allow non-contiguous overlap, and those that allow only contiguous overlap. Often, 'markup overlap' strictly means the latter.Contiguous overlap can always be represented as a linear document with milestones, without the need for fragmenting a component into multiple physical ones. Non-contiguous overlap may require document fragmentation. Another distinction in overlapping markup schemes is whether elements can overlap with other elements of the same kind.
A scheme may have a privileged hierarchy.
Some XML-based schemes, for example, represent one hierarchy directly in the XML document tree, and represent other, overlapping, structures by another means;
these are said to be non-privileged.
identifies a tripartite classification of instances of overlap: 1. "Variation of content and structure", 2. "Overlay of multiple perspectives or markup sets", and 3. "Overlap of individual start and end tags within a single markup perspective";
additionally, some apparent instances of overlap are in fact schema definition problems, which can be resolved hierarchically.
He contends that type 1 is best resolved by a system of multiple documents external to the markup, but types 2 and 3 require dealing with internally.
Approaches and implementations
identifies several criteria for judging solutions to the overlap problem:- readability and maintainability,
- tool support and compatibility with XML,
- possible validation schemes, and
- ease of processing.
Some web browsers attempted to represent overlapping start and end tags with non-hierarchical Document Object Models, but this was not standardised across all browsers and was incompatible with the innately hierarchical nature of the DOM.
HTML5 defines how processors should deal with such mis-nested markup in the HTML syntax and turn it into a single hierarchy.
With XHTML and SGML-based HTML, however, mis-nested markup is a strict error and makes processing by standards-compliant systems impossible.
The HTML standard defines a paragraph concept which can cause overlap with other elements and can be non-contiguous.
SGML, which early versions of HTML were based on, has a feature called CONCUR that allows multiple independent hierarchies to co-exist without privileging any.
DTD validation is only defined for each individual hierarchy with CONCUR. Validation across hierarchies is not defined by the standard. CONCUR cannot support self-overlap, and it interacts poorly with some of SGML's abbreviatory features.
This feature has been poorly supported by tools and has seen very little actual use;
using CONCUR to represent document overlap was not a recommended use case, according to a commentary by the standard's editor.
Within hierarchical languages
There are several approaches to representing overlap in a non-overlapping language.The Text Encoding Initiative, as an XML-based markup scheme, cannot directly represent overlapping markup.
All four of the below approaches are suggested.
The Open Scripture Information Standard is another XML-based scheme, designed to mark up the Bible.
It uses empty milestone elements to encode non-privileged components.
To illustrate these approaches, marking up the sentences and lines of a fragment of Richard III by William Shakespeare will be used as a running example. Where there is a privileged hierarchy, the lines will be used.
Multiple documents
Multiple documents can each provide different internally consistent hierarchies. The advantage of this approach is that each document is simple and can be processed with existing tools, but requires maintenance of redundant content and it can be difficult to cross-reference between different views. With multiple documents, the overlap can be analysed with data comparison and delta encoding techniques, and, in an XML context, specific XML tree differencing algorithms are available.recommends this approach for encoding multiple variants of a single text and to accept the duplication of the parts which do not vary, rather than attempting to create a structure that represents all of the variation present;
further, he suggests that this alignment be performed automatically, and that misalignment is rare in practice.
Example, with lines marked up:
With sentences marked up:
Who prays continually for Richmond's good.
And flaky darkness breaks within the east.
Milestones
Milestones are empty elements that mark the beginning and end of a component, typically using the XML ID mechanism to indicate which "begin" element goes with which "end" element. Milestones can be used to embed a non-privileged structure within a hierarchical language, In their basic form they can only represent contiguous overlap. Generic XML can of course parse the milestone elements, but do not understand their special meaning and so cannot easily process or validate the non-privileged structure.Milestone have the advantage that the markup for overlapping elements is located right at the relevant boundaries, like other markup. This is an advantage for maintainability and readability. CLIX is an example of such an approach.
Example:
Punctuation and spaces have been identified as a type of milestone-style 'crypto-overlap' or 'pseudo-markup', as the boundaries of words, clauses, sentences and the like do not necessarily align with the formal markup boundaries hierarchically.
It is also possible to use more complex milestones to represent non-contiguous structures. For example, TAGML's "suspend" and "resume" semantic can be expressed using milestones, for example by adding an attribute to indicate whether each milestone represents a start, suspend, resume, or end point. Re-ordering and even self-overlap can be achieved similarly, by annotating each milestone with a "next chunk" reference.
Joins
Joins are pointers within a privileged hierarchy to other components of the privileged hierarchy, which may be used to reconstruct a non-privileged component akin to following a linked list. A single non-privileged element is segmented into several partial elements within the privileged hierarchy; the partial elements themselves do not represent a single unit in the non-privileged hierarchy, which can be misleading and make processing difficult. While this approach can support some discontiguous structures, it is not able to re-order elements. A slightly different approach can, however, express re-ordering by expressing the join away from the content, at the cost of directness and maintainability.Join-based representations can introduce the possibility of cycles between elements; detecting and rejecting these adds complexity to implementations.
Example:
Stand-off markup
Stand-off markup is similar to using joins, except that there may be no privileged hierarchy: each part of the document is given a label, and the document structure is expressed by pointing to the content from markup that 'stands off' from the content, and might contain no content itself. The TEI guidelines identify the unity of the elements as a primary advantage of stand-off markup over joins, in addition to the ability to produce and distribute annotations separately from the text, possibly even by different authors applying markup to a read-only document, allowing collaborative approaches to markup by a divide and conquer strategy.Example:
I, by attorney, bless thee from thy mother,
Who prays continually for Richmond's good.
So much for that.—The silent hours steal on,
And flaky darkness breaks within the east.
...
It has been claimed that separating markup and text can result in overall simplification and increased maintainability, and by 2017, "he current state of the art to linguistically annotated data is to use a graph-based representation serialized as standoff XML as a pivot format", i.e., that standoff was the most widely accepted approach to address the overlapping markup challenge.
Standoff formalisms have been the basis for an ISO standard for linguistic annotation, they have been successfully applied for developing corpus management systems, and they are actively being developed in the TEI. One published example of a successful stand-off annotation scheme was developed as part of a bitext natural language documentation project focused on the preservation of low-resource or endangered languages.