Unicode compatibility characters


In Unicode and the UCS, a compatibility character is a character that is encoded solely to maintain round-trip convertibility with other, often older standards. According to the Unicode Glossary:

A character that would not have been encoded except for compatibility and round-trip convertibility with other standards

Although compatibility is used in names, it is not marked as a property. However, the definition is more complicated than the glossary reveals. One of the properties given to characters by the Unicode consortium is the characters' decomposition, or compatibility decomposition. More than five thousand characters have a compatibility decomposition mapping that compatibility character to one or more other UCS characters. By setting a character's decomposition property, Unicode establishes that character as a compatibility character. The reasons for these compatibility designations are both varied and discussed in further detail below. The term decomposition is sometimes confusing because a character's decomposition can, in some cases, be a singleton. In these cases, the decomposition of one character is simply another approximately equivalent character.

Compatibility character types and keywords

The compatibility decomposition property for the 5,402 Unicode compatibility characters includes a keyword that divides the compatibility characters into 17 logical groups. Those characters with a compatibility decomposition but without a keyword are termed canonical decomposable characters and those characters are not compatibility characters. Keywords for compatibility decomposable characters include: <initial>, <medial>, <final>, <isolated>, <wide>, <narrow>, <small>, <square>, <vertical>, <circle>, <noBreak>, <fraction>, <sub>, <super>, and <compat>.These keywords provide some indication of the relation between the compatibility character and its compatibility decomposition character sequence. Compatibility characters fall in three basic categories:
  1. Characters corresponding to multiple alternate glyph forms and precomposed diacritics to support software and font implementations that do not include complete Unicode text layout capabilities.
  2. Characters included from other character sets or otherwise added to the UCS that constitute rich text rather than the plain text goals of Unicode.
  3. Some other characters that are semantically distinct, but visually similar.
Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a find on a page for a capital Latin letter 'I' and their software application fails to find the visually similar Roman numeral 'Ⅰ'.

Compatibility mappings types

Glyph substitution and composition

Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include:
;Ligatures: Ligatures such as 'ffi' in the Latin script were often encoded as a separate character in legacy character sets. Unicode's approach to ligatures is to treat them as rich text and, if turned on, handle them through glyph substitution.
;Precomposed Roman numerals: For example, can be decomposed into and two characters. Precomposed characters are in the Number Forms block.
;Precomposed fractions: These decomposition have the keyword <fraction>. A fully conforming text handler should display identically to the composed fraction 1⁄4. Precomposed characters are in the Number Forms block.
;Contextual glyphs or forms: These arise primarily in the Arabic script. Using fonts with glyph substitution capabilities such as OpenType and TrueTypeGX, Unicode conforming software can substitute the proper glyphs for the same character depending on whether that character appears at the beginning, end, middle of a word or in isolation. Such glyph substitution is also necessary for vertical text layout for some East Asian languages. In this case glyphs must be substituted or synthesized for wide, narrow, small and square glyph forms. Non-conforming software or software using other character sets instead use multiple separate character for the same letter depending on its position: further complicating text processing.
The UCS, Unicode character properties and the Unicode algorithms provide software implementations with everything needed to properly display these characters from their decomposition equivalents. Therefore, these decomposable compatibility characters become redundant and unnecessary. Their existence in the character set requires extra text processing to ensure text is properly compared and collated. Moreover, these compatibility characters provide no additional or distinct semantics. Nor do these characters provide any visually distinct rendering provided the text layout and fonts are Unicode conforming. Also, none of these characters are required for round-trip convertibility to other character sets, since the transliteration can easily map decomposed characters to precomposed counterparts in another character set. Similarly, contextual forms, such as a final Arabic letter can be mapped based on its position within a word to the appropriate legacy character set form character.
In order to dispense with these compatibility characters, text software must conform to several Unicode protocols. The software must be able to:
  1. Compose diacritic marked graphemes from letter characters and one or more separate combining diacritic marks.
  2. Substitute ligatures and contextual glyph variants.
  3. Lay out CJKV text vertically, substituting glyphs for small, vertical, narrow, wide square forms, either from font data or synthesized as needed.
  4. Combine fractions using the and any other arbitrary characters.
  5. Combine a ̸| with other symbols: for example ∄ or ∄ for.
All together these compatibility characters included for incomplete Unicode implementations total 3,779 of the 5,402 designated compatibility characters. These include all of the compatibility characters marked with the keywords <initial>, <medial>, <final>, <isolated>, <fraction>, <wide>, <narrow>, <small>, <vertical>, <square>. Also it includes nearly all of the canonical and most of the <compat> keyword compatibility characters.

Rich text compatibility characters

Many other compatibility characters constitute what Unicode considers rich text and therefore outside the goals of Unicode and UCS. In some sense even compatibility characters discussed in the previous section—those that aid legacy software in displaying ligatures and vertical text—constitute a form of rich text, since the rich text protocols determine whether text is displayed in one way or another. However, the choice to display text with or without ligatures or vertically versus horizontally are both non-semantic rich text. They are simply style differences. This is in contrast to other rich text such as italics, superscripts and subscripts, or list markers where the styling of the rich text implies certain semantics along with it.
For comparing, collating, handling and storing plain text, rich text variants are semantically redundant. For example, using a superscript character for the numeral 4 is likely indistinguishable from using the standard character for a numeral 4 and then using rich text protocols to make it superscript. Such alternate rich text characters therefore create ambiguity because they appear visually the same as their plain text counterpart characters with rich text formatting applied. These rich text compatibility characters include:
;Mathematical Alphanumeric Symbols: These symbols are simply clones of the Latin and Greek alphabets and Indic-Arabic decimal digits repeated in 15 various typefaces. They are intended as an arbitrary palette for mathematical notation. However, they tend to undermine the distinction between encoding characters versus encoding visual glyphs as well as Unicode's goals of supporting only plain text characters. Such alternate styling for a mathematical symbol palette could be easily created through rich text protocols instead.
;Enclosed Alphanumerics and ideographs : These are characters included primarily for list markers. They do not constitute plain text characters. Moreover, the use of other rich text protocols is more appropriate since, the set of enclosed alphanumerics or ideographs provisioned in the UCS is limited.
;Circled alphanumerics and ideographs: The circled forms are also likely for use as markers. Again, using characters along with rich text protocols to encircle characters strings is more flexible.
;Spaces and no-break spaces of varying widths: These characters are simply rich text variants of and. Other rich text protocols should be used instead such as tracking, kerning or word-spacing attributes.
;Some subscript and superscript form characters: Many of the subscript and superscript characters are actually semantically distinct characters from the International Phonetic Alphabet and other writing systems and do not really fall in the category of rich text. However, others simply constitute rich text presentation forms of other Greek, Latin and numeral characters. These rich text superscript and subscript characters therefore properly belong to this category of rich text compatibility characters. Most of these are in the "Superscripts and Subscripts" or the "Basic Latin" blocks.
For all of these rich text compatibility characters the display of glyphs is typically distinct from their compatibility decomposition characters. However, these are considered compatibility characters and discouraged for use by the Unicode consortium because they are not plain text characters, which is what Unicode seeks to support with its UCS and associated protocols. Rich text should be handled through non-Unicode protocols such as HTML, CSS, RTF and other such protocols.
The rich text compatibility characters comprise 1,451 of the 5,402 compatibility characters. These include all of the compatibility characters marked with keywords <circle> and <font> ; 11 spaces variants from the <compat> and canonical characters; and some of the keyword <superscript> and <subscript> from the "Superscripts and Subscripts" block.