ISO/IEC 2022


ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.
ISO 2022 specifies a general structure which character encodings can conform to, dedicating particular ranges of bytes to be used for non-printing control codes for formatting and in-band instructions, rather than graphical characters. It also specifies a syntax for escape sequences, multiple-byte sequences beginning with the control code, which can likewise be used for in-band instructions. Specific sets of control codes and escape sequences designed to be used with ISO 2022 include ISO/IEC 6429, portions of which are implemented by ANSI.SYS and terminal emulators.
ISO 2022 itself also defines particular control codes and escape sequences which can be used for switching between different coded character sets so as to use multiple in a single document, effectively combining them into a single stateful encoding. It is designed to be usable in both 8-bit environments and 7-bit environments.

Encodings and conformance

The ASCII character set supports the ISO Basic Latin alphabet, and does not provide good support for languages which use additional letters, or which use a different writing system altogether. Other writing systems with relatively few characters, such as Greek, Cyrillic, Arabic or Hebrew, as well as forms of the Latin script using diacritics or letters absent from the ISO Basic Latin alphabet, have historically been represented on personal computers with different 8-bit, single byte, extended ASCII encodings, which follow ASCII when the most significant bit is 0, and include additional characters for a most significant bit of 1. Some of these, such as the ISO 8859 series, conform to ISO 2022, while others such as DOS code page 437 do not, usually due to not reserving the bytes 0x80–9F for control codes.
Certain East Asian languages, specifically Chinese, Japanese, and Korean, are written using far more characters than the maximum of 256 which can be represented in a single byte, and were first represented on computers with language-specific double-byte encodings or variable-width encodings; some of these conform to, while others do not. Control codes in ISO 2022 are always represented with a single byte, regardless of the number of bytes used for graphical characters. CJK encodings used in 7-bit environments which use mechanisms to switch between character sets are often given names starting with "ISO-2022-", most notably ISO-2022-JP, although some other CJK encodings such as EUC-JP also make use of ISO 2022 mechanisms.
Since the first 256 code points of Unicode were taken from ISO 8859-1, Unicode inherits the concept of C0 and C1 control codes from ISO 2022, although it adds other non-printing characters besides the ISO 2022 control codes. However, Unicode transformation formats such as UTF-8 generally deviate from the ISO 2022 structure in various ways, including:
  • Using 8-bit bytes, but not representing the C1 codes in their single-byte forms specified in ISO 2022
  • Representing all characters, including control codes, with multiple bytes
  • Mixing bytes with the most significant bit set and unset within the coded representation for a single code point
ISO 2022 escape sequences do, however, exist for switching to and from UTF-8 as a "coding system different from that of ISO 2022", which are supported by certain terminal emulators such as xterm.

Overview

Elements

ISO/IEC 2022 specifies the following:
  • An infrastructure of multiple character sets with particular structures which may be included in a single character encoding system, including multiple graphical character sets and multiple sets of both primary (C0) and secondary (C1) control codes,
  • A format for encoding these sets, assuming that 8 bits are available per byte,
  • A format for encoding these sets in the same encoding system when only 7 bits are available per byte, and a method for transforming any conformant character data to pass through such a 7-bit environment,
  • The general structure of ANSI escape codes, and
  • Specific escape code formats for identifying individual character sets, for announcing the use of particular encoding features or subsets, and for interacting with or switching to other encoding systems.

Code versions

A specific implementation does not have to implement all of the standard; the conformance level and the supported character sets are defined by the implementation. Although many of the mechanisms defined by the ISO/IEC 2022 standard are infrequently used, several established encodings are based on a subset of the ISO/IEC 2022 system. In particular, 7-bit encoding systems using ISO/IEC 2022 mechanisms include ISO-2022-JP, which has primarily been used in Japanese-language e-mail. 8-bit encoding systems conforming to ISO/IEC 2022 include ISO/IEC 4873, which is in turn conformed to by ISO/IEC 8859, and Extended Unix Code, which is used for East Asian languages. More specialised applications of ISO 2022 include the MARC-8 encoding system used in MARC 21 library records.

Designation escape sequences

The escape sequences for switching to particular character sets or encodings are registered with the ISO-IR registry and follow the patterns defined within the standard. Character encodings making use of these escape sequences require data to be processed sequentially in a forward direction, since the correct interpretation of the data depends on previously encountered escape sequences.
Specific profiles such as ISO-2022-JP may impose extra conditions, such as that the current character set is reset to US-ASCII before the end of a line. Furthermore, the escape sequences declaring the national character sets may be absent if a specific ISO-2022-based encoding permits or requires this, and dictates that particular national character sets are to be used. For example, ISO-8859-1 states that no defining escape sequence is needed.

Multi-byte characters

To represent large character sets, ISO/IEC 2022 builds on ISO/IEC 646's property that a seven-bit character representation will normally be able to represent 94 graphic characters ; if only the C0 control codes are excluded, this can be expanded to 96 characters. Using two bytes, it is thus possible to represent up to 8,836 characters; and, using three bytes, up to 830,584 characters. Though the standard defines it, no registered character set uses three bytes.
For the two-byte character sets, the code point of each character is normally specified in so-called row-cell or kuten form, which comprises two numbers between 1 and 94 inclusive, specifying a row and cell of that character within the zone. For a three-byte set, an additional plane number is included at the beginning. The escape sequences do not only declare which character set is being used, but also whether the set is single-byte or multi-byte, and also whether each byte has 94 or 96 permitted values.

Code structure

Notation and nomenclature

ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated" into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked" to interpret bytes in the stream.
Encoding byte values are often given in column-line notation, where two decimal numbers in the range 00–15 are separated by a slash. Hence, for instance, codes 2/0 through 2/15 inclusive may be referred to as "column 02". This is the notation used in the ISO/IEC 2022 / ECMA-35 standard itself. They may be described elsewhere using hexadecimal, as is often used in this article, or using the corresponding ASCII characters, although the escape sequences are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.
Byte values from the 7-bit ASCII graphic range, being on the left side of a character code table, are referred to as "GL" codes ' while bytes from the "high ASCII" range, if available, are referred to as the "GR" codes '. The terms "CL" and "CR" are defined for the control ranges, but the CL range always invokes the primary controls, whereas the CR range always either invokes the secondary controls or is unused.

Fixed coded characters

The delete character DEL, the escape character ESC and the space character SP are designated "fixed" coded characters and are always available when G0 is invoked over GL, irrespective of what character sets are designated. They may not be included in graphical character sets, although other sizes or types of whitespace character may be.

General syntax of escape sequences

Sequences using the ESC character take the form ESC , where the ESC character is followed by zero or more intermediate bytes from the range 0x20–0x2F, and one final byte from the range 0x30–0x7E.
The first byte, or absence thereof, determines the type of escape sequence; it might, for instance, designate a working set, or denote a single control function. In all types of escape sequences, bytes in the range 0x30–0x3F are reserved for unregistered private uses defined by prior agreement between parties.
Control functions from some sets may make use of further bytes following the escape sequence proper. For example, the ISO 6429 control function "", which can be represented using an escape sequence, is followed by zero or more bytes in the range 0x30–0x3F, then zero or more bytes in the range 0x20–0x2F, then by a single byte in the range 0x40–0x7E, the entire sequence being called a "control sequence".

Graphical character sets

Each of the four working sets G0 through G3 may be a 94-character set or a 94n-character multi-byte set. Additionally, G1 through G3 may be a 96- or 96n-character set.
In a 96- or 96n-character set, the bytes 0x20 through 0x7F when GL-invoked, or 0xA0 through 0xFF when GR-invoked, are allocated to and may be used by the set. In a 94- or 94n-character set, the bytes 0x20 and 0x7F are not used. When a 96- or 96n-character set is invoked in the GL region, the space and delete characters are not available until a 94- or 94n-character set is invoked in GL. 96-character sets cannot be designated to G0.
Registration of a set as a 96-character set does not necessarily mean that the 0x20/A0 and 0x7F/FF bytes are actually assigned by the set; some examples of graphical character sets which are registered as 96-sets but do not use those bytes include the G1 set of I.S. 434, the box drawing set from ISO/IEC 10367, and ISO-IR-164.

Combining characters

Characters are expected to be spacing characters, not combining characters, unless specified otherwise by the graphical set in question. ISO 2022 / ECMA-35 also recognizes the use of the backspace and carriage return control characters as means of combining otherwise spacing characters, as well as the CSI sequence "Graphic Character Combination" 0x5F.
Use of the backspace and carriage return in this manner is permitted by ISO/IEC 646 but prohibited by ISO/IEC 4873 / ECMA-43 and by ISO/IEC 8859, on the basis that it leaves the graphical character repertoire undefined. ISO/IEC 4873 / ECMA-43 does, however, permit the use of the GCC function provided that the sequence of characters is kept the same and merely displayed in one space, rather than being over-stamped to form a character with a different meaning.

Control character sets

Control character sets are classified as "primary" or "secondary" control code sets, respectively also called "C0" and "C1" control code sets.
A C0 control set must contain the ESC control character at 0x1B, whereas a C1 control set may not contain the escape control whatsoever. Hence, they are entirely separate registrations, with a C0 set being only a C0 set and a C1 set being only a C1 set.
If codes from the C0 set of ISO 6429 / ECMA-48, i.e. the ASCII control codes, appear in the C0 set, they are required to appear at their ISO 6429 / ECMA-48 locations. Inclusion of transmission control characters in the C0 set, besides the ten included by ISO 6429 / ECMA-48, or inclusion of any of those ten in the C1 set, is also prohibited by the ISO/IEC 2022 / ECMA-35 standard.
A C0 control set is invoked over the CL range 0x00 through 0x1F, whereas a C1 control function may be invoked over the CR range 0x80 through 0x9F or by using escape sequences, but not both. Which style of C1 invocation is used must be specified in the definition of the code version. For example, ISO/IEC 4873 specifies CR bytes for the C1 controls which it uses. If necessary, which invocation is used may be communicated using [|announcer sequences].
In the latter case, single control functions from the C1 control code set are invoked using "type Fe" escape sequences, meaning those where the ESC control character is followed by a byte from columns 04 or 05 through ESC 0x5F.

Other control functions

Additional control functions are assigned to "type Fs" escape sequences
through ESC 0x7E ; these have permanently assigned meanings rather than depending on the C0 or C1 designations. Registration of control functions to type "Fs" sequences must be approved by ISO/IEC JTC 1/SC 2. Other single control functions may be registered to type "3Ft" escape sequences 0x40 through ESC 0x23 0x7E, although no "3Ft" sequences are currently assigned. Some of these are specified in ECMA-35, others in ECMA-48. ECMA-48 refers to these as "independent control functions".
CodeHexAbbr.NameEffect
ESC `1B 60DMIDisable manual inputDisables some or all of the manual input facilities of the device.
ESC a1B 61INTInterruptInterrupts the current process.
ESC b1B 62EMIEnable manual inputEnables the manual input facilities of the device.
ESC c1B 63RISReset to initial stateThe device's display and input subsystems revert to the same state as when it's just been powered on. Connections to clients are unaffected.
ESC d1B 64CMDCoding method delimiterUsed when interacting with an outer coding / representation system, [|see below].
ESC n1B 6ELS2Locking shift twoShift function, see [|below].
ESC o1B 6FLS3Locking shift threeShift function, see below.
ESC |1B 7CLS3RLocking shift three rightShift function, see below.
ESC }1B 7DLS2RLocking shift two rightShift function, see below.
ESC ~1B 7ELS1RLocking shift one rightShift function, see below.

Escape sequences of type "Fp"
through ESC 0x3F or of type "3Fp" 0x30 through ESC 0x23 0x3F are reserved for single private use control codes, by prior agreement between parties. Several such sequences of both types are used by DEC terminals such as the VT100, and are thus supported by terminal emulators.

Shift functions

By default, GL codes specify G0 characters and GR codes specify G1 characters; this may be otherwise specified by prior agreement. The set invoked over each area may also be modified with control codes referred to as shifts, as shown in the table below.
An 8-bit code may have GR codes specifying G1 characters, i.e. with its corresponding 7-bit code using Shift In and Shift Out to switch between the sets, although some instead have GR codes specifying G2 characters, with the corresponding 7-bit code using a single-shift code to access the second set.
The codes shown in the table below are the most common encodings of these control codes, conforming to ISO/IEC 6429. The LS2, LS3, LS1R, LS2R and LS3R shifts are registered as single control functions and are always encoded as the escape sequences listed below, whereas the others are part of a C0 or C1 control code set and SO, meaning that their coding and availability may vary depending on which control sets are designated: they must be present in the designated control sets if their functionality is used. The C1 controls themselves, as mentioned above, may be represented using escape sequences or 8-bit bytes, but not both.
Alternative encodings of the single-shifts as C0 control codes are available in certain control code sets. For example, SS2 and SS3 are usually available at 0x19 and 0x1D respectively in T.51 and T.61. This coding is currently recommended by ISO/IEC 2022 / ECMA-35 for applications requiring 7-bit single-byte representations of SS2 and SS3, and may also be used for SS2 only, although older code sets with SS2 at 0x1C also exist, and were mentioned as such in an earlier edition of the standard. The 0x8E and 0x8F coding of the single shifts as shown below is mandatory for ISO/IEC 4873 levels 2 and 3.
CodeHexAbbr.NameEffect
SI0FSI
LS0
Shift In
Locking shift zero
GL encodes G0 from now on
SO0ESO
LS1
Shift Out
Locking shift one
GL encodes G1 from now on
ESC n1B 6ELS2Locking shift twoGL encodes G2 from now on
ESC o1B 6FLS3Locking shift threeGL encodes G3 from now on
CR area: SS2
Escape code: ESC N
CR area: 8E
Escape code: 1B 4E
SS2Single shift twoGL or GR encodes G2 for the immediately following character only
CR area: SS3
Escape code: ESC O
CR area: 8F
Escape code: 1B 4F
SS3Single shift threeGL or GR encodes G3 for the immediately following character only
ESC ~1B 7ELS1RLocking shift one rightGR encodes G1 from now on
ESC }1B 7DLS2RLocking shift two rightGR encodes G2 from now on
ESC |1B 7CLS3RLocking shift three rightGR encodes G3 from now on

Although officially considered shift codes and named accordingly, single-shift codes are not always viewed as shifts, and they may simply be viewed as prefix bytes, since they do not require the encoder to keep the currently active set as state, unlike locking shift codes. In 8-bit environments, either GL or GR, but not both, may be used as the single-shift area. This must be specified in the definition of the code version. For instance, ISO/IEC 4873 specifies GL, whereas packed EUC specifies GR. In 7-bit environments, only GL is used as the single-shift area. If necessary, which single-shift area is used may be communicated using announcer sequences.
The names "locking shift zero" and "locking shift one" refer to the same pair of C0 control characters as the names "shift in" and "shift out". However, the standard refers to them as LS0 and LS1 when they are used in 8-bit environments and as SI and SO when they are used in 7-bit environments.
The ISO/IEC 2022 / ECMA-35 standard permits, but discourages, invoking G1, G2 or G3 in both GL and GR simultaneously.

Registration of graphical and control code sets

The ISO International register of coded character sets to be used with escape sequences lists graphical character sets, control code sets, single control codes and so forth which have been registered for use with ISO/IEC 2022. The procedure for registering codes and sets with the ISO-IR registry is specified by ISO/IEC 2375. Each registration receives a unique escape sequence, and a unique registry entry number to identify it. For example, the CCITT character set for Simplified Chinese is known as ISO-IR-165.
Registration of coded character sets with the ISO-IR registry identifies the documents specifying the character set or control function associated with an ISO/IEC 2022 non‑private-use escape sequence. This may be a standard document; however, registration does not create a new ISO standard, does not commit the ISO or IEC to adopt it as an international standard, and does not commit the ISO or IEC to add any of its characters to the Universal Coded Character Set.
ISO-IR registered escape sequences are also used encapsulated in a Formal Public Identifier to identify character sets used for numeric character references in SGML. For example, the string can be used to identify the International Reference Version of ISO 646-1983, and the HTML 4.01 specification uses to identify Unicode. The textual representation of the escape sequence, included in the third element of the FPI, will be recognised by SGML implementations for supported character sets.

Character set designations

Escape sequences to designate character sets take the form ESC . As mentioned above, the intermediate bytes are from the range 0x20–0x2F, and the final byte is from the range 0x30–0x7E. The first byte identifies the type of character set and the working set it is to be designated to, whereas the byte identify the character set itself, as assigned in the ISO-IR register.
Additional bytes may be added before the byte to extend the byte range. This is currently only used with 94-character sets, where codes of the form ESC. However, in a graphical set designation sequence, if the second byte or the third byte is 0x20, the set denoted is a "dynamically redefinable character set" defined by prior agreement, which is also considered private use. A graphical set being considered a DRCS implies that it represents a font of exact glyphs, rather than a set of abstract characters. The manner in which DRCS sets and associated fonts are transmitted, allocated and managed is not stipulated by ISO/IEC 2022 / ECMA-35 itself, although it recommends allocating them sequentially starting with byte 0x40 ; however, a manner for transmitting DRCS fonts is defined within some telecommunication protocols such as World System Teletext.
There are also three special cases for multi-byte codes. The code sequences ESC $ @, ESC $ A, and ESC $ B were all registered when the contemporary version of the standard allowed multi-byte sets only in G0, so must be accepted in place of the sequences ESC $ features for switching control character sets, but this is a single-level lookup, in that the C0 set is always invoked over CL, and the C1 set is always invoked over CR or by using escape codes. As noted above, it is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible. The control set designation sequences may also be used from within ISO/IEC 10646, in contexts where processing ANSI escape codes is appropriate, provided that each byte in the sequence is padded to the code unit size of the encoding.
A table of escape sequence bytes and the designation or other function which they perform is below.
CodeHexAbbr.NameEffectExample
ESC SP 1B 20 ACSAnnounce code structureSpecifies code features used, e.g. working sets.ESC SP L
ESC ! 1B 21 CZDC0-designate selects a C0 control character set to be used.ESC ! @
ESC " 1B 22 C1DC1-designate selects a C1 control character set to be used.ESC " C
ESC # 1B 23 -''ESC # 6
A
ESC $ * 1B 24 2A G2DM4G2-designate multibyte 94-set selects a 94n-character set to be used for G2.ESC $ * B
ESC $ + 1B 24 2B G3DM4G3-designate multibyte 94-set selects a 94n-character set to be used for G3.ESC $ + D
ESC $, 1B 24 2C -''-
ESC $ - 1B 24 2D G1DM6G1-designate multibyte 96-set selects a 96n-character set to be used for G1.ESC $ - 1
ESC $. 1B 24 2E G2DM6G2-designate multibyte 96-set selects a 96n-character set to be used for G2.ESC $. 2
ESC $ / 1B 24 2F G3DM6G3-designate multibyte 96-set selects a 96n-character set to be used for G3.ESC $ / 3
ESC % 1B 25 DOCSDesignate other coding systemSwitches coding system, see below.ESC % G
ESC & 1B 26 IRRIdentify revised registrationPrefixes designation escape to denote revision.ESC & @ ESC $ B
ESC ' 1B 27 -''-
ESC I
ESC * 1B 2A G2D4G2-designate 94-set selects a 94-character set to be used for G2.ESC * v
ESC + 1B 2B G3D4G3-designate 94-set selects a 94-character set to be used for G3.ESC + D
ESC, 1B 2C -''-
ESC - 1B 2D G1D6G1-designate 96-set selects a 96-character set to be used for G1.ESC - A
ESC. 1B 2E G2D6G2-designate 96-set selects a 96-character set to be used for G2.ESC. B
ESC / 1B 2F G3D6G3-designate 96-set selects a 96-character set to be used for G3.ESC / b

Note that the registry of bytes is independent for the different types. The 94-character graphic set designated by ESC
Also note that C0 and C1 control character sets are independent; the C0 control character set designated by ESC ! A is not the same as the C1 control character set designated by ESC " A.

Interaction with other coding systems

The standard also defines a way to specify coding systems that do not follow its own structure.
A sequence is also defined for returning to ISO/IEC 2022; the registrations which support this sequence as encoded in ISO/IEC 2022 comprise various Videotex formats, UTF-8, and UTF-1. A second byte of 0x2F is included in the designation sequences of codes which do not use that byte sequence to return to ISO 2022; they may have their own means to return to ISO 2022 or none at all. All existing registrations of the latter type are either transparent raw data, Unicode/UCS formats, or subsets thereof.
CodeHexAbbr.NameEffect
ESC % @1B 25 40DOCSDesignate other coding system Return to ISO/IEC 2022 from another encoding.
ESC % 1B 25 DOCSDesignate other coding system selects an 8-bit code; use ESC % @ to return.
ESC % / 1B 25 2F DOCSDesignate other coding system selects an 8-bit code; there is no standard way to return.
ESC d1B 64CMDCoding method delimiterDenotes the end of an ISO/IEC 2022 coded sequence.

Of particular interest are the sequences which switch to ISO/IEC 10646 formats which do not follow the ISO/IEC 2022 structure. These include UTF-8, its predecessor UTF-1, and UTF-16 and UTF-32.
Several codes were also registered for subsets of UTF-8, UTF-16 and UTF-32, as well as for three levels of UCS-2. However, the only codes currently specified by ISO/IEC 10646 are the level-3 codes for UTF-8, UTF-16 and UTF-32 and the unspecified-level code for UTF-8, with the rest being listed as deprecated. ISO/IEC 10646 stipulates that the big-endian formats of UTF-16 and UTF-32 are designated by their escape sequences.
Of the sequences switching to UTF-8, ESC % G is the one supported by, for example, xterm.
Although use of a variant of the standard return sequence from UTF-16 and UTF-32 is permitted, the bytes of the escape sequence must be padded to the size of the code unit of the encoding, i.e. the coding of the standard return sequence does not conform exactly to ISO/IEC 2022. For this reason, the designations for UTF-16 and UTF-32 use a without-standard-return syntax.
For specifying encodings by labels, the X Consortium's Compound Text format defines five private-use DOCS sequences.

Code structure announcements

The sequence "announce code structure" is used to announce a specific code structure, or a specific group of ISO 2022 facilities which are used in a particular code version. Although announcements can be combined, certain contradictory combinations are prohibited by the standard, as is using additional announcements on top of ISO/IEC 4873 level announcements 12–14. Announcement sequences are as follows:
NumberCodeHexCode version feature announced
1ESC SP A1B 20 41G0 in GL, GR absent or unused, no locking shifts.
2ESC SP B1B 20 42G0 and G1 invoked to GL by locking shifts, GR absent or unused.
3ESC SP C1B 20 43G0 in GL, G1 in GR, no locking shifts, requires an 8-bit environment.
4ESC SP D1B 20 44G0 in GL, G1 in GR if 8-bit, no locking shifts unless in a 7-bit environment.
5ESC SP E1B 20 45Shift functions preserved during 7-bit/8-bit conversion.
6ESC SP F1B 20 46C1 controls using escape sequences.
7ESC SP G1B 20 47C1 controls in CR region in 8-bit environments, as escape sequences otherwise.
8ESC SP H1B 20 4894-character graphical sets only.
9ESC SP I1B 20 4994-character and/or 96-character graphical sets.
10ESC SP J1B 20 4AUses a 7-bit code, even if an eighth bit is available for use.
11ESC SP K1B 20 4BRequires an 8-bit code.
12ESC SP L1B 20 4CComplies to ISO/IEC 4873 level 1.
13ESC SP M1B 20 4DComplies to ISO/IEC 4873 level 2.
14ESC SP N1B 20 4EComplies to ISO/IEC 4873 level 3.
16ESC SP P1B 20 50SI / LS0 used.
18ESC SP R1B 20 52SO / LS1 used.
19ESC SP S1B 20 53LS1R used in 8-bit environments, SO used in 7-bit environments.
20ESC SP T1B 20 54LS2 used.
21ESC SP U1B 20 55LS2R used in 8-bit environments, LS2 used in 7-bit environments.
22ESC SP V1B 20 56LS3 used.
23ESC SP W1B 20 57LS3R used in 8-bit environments, LS3 used in 7-bit environments.
26ESC SP Z1B 20 5ASS2 used.
27ESC SP 1B 20 5BSS3 used.
28ESC SP \1B 20 5CSingle-shifts invoke over GR.

ISO/IEC 2022 code versions

Six 7-bit ISO 2022 code versions are defined by [IETF RFCs, of which ISO-2022-JP and ISO-2022-KR have been extensively used in the past. A number of other variants are defined by vendors, including IBM. Although UTF-8 is the preferred encoding in HTML5, legacy content in ISO-2022-JP remains sufficiently widespread that the WHATWG encoding standard retains support for it, in contrast to mapping ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT entirely to the replacement character, due to concerns about code injection attacks such as cross-site scripting.
8-bit code versions include Extended Unix Code. The ISO/IEC 8859 encodings also follow ISO 2022, in a subset stipulated in ISO/IEC 4873.

Japanese e-mail versions

ISO-2022-JP

is a widely used encoding for Japanese, in particular in e-mail. It was introduced for use on the JUNET network and later codified in IETF RFC 1468, dated 1993. It has an advantage over other encodings for Japanese in that it does not require 8-bit clean transmission. Microsoft calls it Code page 50220. It starts in ASCII and includes the following escape sequences:ESC ESC Roman set ESC $ @ to switch to JIS X 0208-1978 ESC $ B to switch to JIS X 0208-1983
Use of the two characters added in JIS X 0208-1990 is permitted, but without including the IRR sequence, i.e. using the same escape sequence as JIS X 0208-1983. Also, due to being registered before designating multi-byte sets except to G0 was possible, the escapes for JIS X 0208 do not include the second -byte.
The RFC notes that some existing systems did not distinguish ESC.

Versions with halfwidth katakana

Use of ESC is not part of the ISO-2022-JP profile, but is also sometimes used. Python allows it in a variant which it labels ISO-2022-JP-EXT ; this is close in both name and structure to an encoding denoted ISO-2022-JPext by DEC, which furthermore adds a two-byte user-defined region accessed with ESC $, respectively. They are not widely used; JIS X 0208 support in extended 8-bit JIS X 0201 is more commonly achieved via Shift JIS. Microsoft's code page for JIS X 0201-based ISO 2022 with single-byte katakana via Shift Out and Shift In is Code page 50222.

ISO-2022-JP-2

is a multilingual extension of ISO-2022-JP, defined in RFC 1554, which permits the following escape sequences in addition to the ISO-2022-JP ones. The ISO/IEC 8859 parts are 96-character sets which cannot be designated to G0, and are accessed from G2 using the 7-bit escape sequence form of the single-shift code SS2:ESC $ A to switch to GB 2312-1980 ESC $ ESC $ ESC. A to switch to ISO/IEC 8859-1 high part, Extended Latin 1 set 'ESC. F to switch to ISO/IEC 8859-7 high part, Basic Greek set '
ISO-2022-JP with the ISO-2022-JP-2 representation of JIS X 0212, but not the other extensions, was subsequently dubbed ISO-2022-JP-1 by RFC 2237, dated 1997.

IBM Japanese TCP

IBM implements nine 7-bit ISO 2022 based encodings for Japanese, each using a different set of escape sequences: IBM-956, IBM-957, IBM-958, IBM-959, IBM-5052, IBM-5053, IBM-5054, IBM-5055 and ISO-2022-JP, which are collectively termed "TCP/IP Japanese coded character sets". CCSID 9148 is the standard ISO-2022-JP.
Code page / CCSIDACRI definition numberEscape sequences for ACRI
956TCP-01
957TCP-02
958TCP-03
959TCP-04
5052TCP-05
5053TCP-06
5054TCP-07
5055TCP-08
9148TCP-16

JIS X 0213

The JIS X 0213 standard, first published in 2000, defines an updated version of ISO-2022-JP, without the ISO-2022-JP-2 extensions, named ISO-2022-JP-3. The additions made by JIS X 0213 compared to the base JIS X 0208 standard resulted in a new registration being made for the extended JIS plane 1, while the new plane 2 received its own registration. The further additions to plane 1 in the 2004 edition of the standard resulted in an additional registration being added to a further revision of the profile, dubbed ISO-2022-JP-2004. In addition to the basic ISO-2022-JP designation codes, the following designations are recognized:ESC ESC $ ESC $ ESC $

Other 7-bit versions

' is defined in RFC 1557, dated 1993. It encodes ASCII and the Korean double-byte KS X 1001-1992, previously named KS C 5601-1987. Unlike ISO-2022-JP-2, it makes use of the Shift Out and Shift In characters to switch between them, after including ESC $ ) C once at the start of a line to designate KS X 1001 to G1.
'
and are defined in RFC 1922, dated 1996. They are 7-bit encodings making use both of the Shift Out and Shift In functions, and of the 7-bit escape code forms of the single-shift functions SS2 and SS3. They support the character sets GB 2312 and CNS 11643.
The basic ISO-2022-CN profile uses ASCII as its G0 set, and also includes GB 2312 and the first two planes of CNS 11643 :ESC $ ) A to switch to GB 2312-1980 'ESC $ ) G to switch to CNS 11643-1992 Plane 1 'ESC $ * H to switch to CNS 11643-1992 Plane 2 '
The ISO-2022-CN-EXT profile permits the following additional sets and planes.ESC $ ) E to switch to ISO-IR-165 '
ESC $ + I to switch to CNS 11643-1992 Plane 3 'ESC $ + J to switch to CNS 11643-1992 Plane 4 'ESC $ + K to switch to CNS 11643-1992 Plane 5 'ESC $ + L to switch to CNS 11643-1992 Plane 6 'ESC $ + M to switch to CNS 11643-1992 Plane 7
The ISO-2022-CN-EXT profile further lists additional Guobiao standard graphical sets as being permitted, but conditional on their being assigned registered ISO 2022 escape sequences:
  • GB 12345 in G1
  • GB 7589 or GB 13131 in G2
  • GB 7590 or GB 13132 in G3
The character after the ESC or ESC $ specifies the type of character set and working set that is designated to. In the above examples, the character , * or + designates to the G1–G3 character sets.
ISO-2022-KR and ISO-2022-CN are used less frequently than ISO-2022-JP, and are sometimes deliberately not supported due to security concerns. Notably, the WHATWG Encoding Standard used by HTML5 maps ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT to the "replacement" decoder, which maps all input to the replacement character, in order to prevent certain cross-site scripting and related attacks, which utilize a difference in encoding support between the client and server. Although the same security concern also applies to ISO-2022-JP and UTF-16, they could not be given this treatment due to being much more frequently used in deployed content.
In April 2024, a security flaw was found in the implementation of ISO-2022-CN-EXT in glibc, which lead to recommendations to disable the encoding entirely on Linux systems.

ISO/IEC 4873

A subset of ISO 2022 applied to 8-bit single-byte encodings is defined by ISO/IEC 4873, also published by Ecma International as ECMA-43. ISO/IEC 8859 defines 8-bit codes for ISO/IEC 4873 level 1.
ISO/IEC 4873 / ECMA-43 defines three levels of encoding:
  • Level 1, which includes a C0 set, the ASCII G0 set, an optional C1 set and an optional single-byte G1 set. G0 is invoked over GL, and G1 is invoked over GR. Use of shift functions is not permitted.
  • Level 2, which includes a single-byte G2 and/or G3 set in addition to a mandatory G1 set. Only the single-shift functions SS2 and SS3 are permitted, and they invoke over the GL region. SS2 and SS3 must be available in C1 at 0x8E and 0x8F respectively. This minimal required C1 set for ISO 4873 is registered as ISO-IR-105.
  • Level 3, which permits the GR locking-shift functions LS1R, LS2R and LS3R in addition to the single shifts, but otherwise has the same restrictions as level 2.
Earlier editions of the standard permitted non-ASCII assignments in the G0 set, provided that the ISO/IEC 646 invariant positions were preserved, that the other positions were assigned to spacing characters, that 0x23 was assigned to either £ or #, and that 0x24 was assigned to either $ or ¤. For instance, the 8-bit encoding of JIS X 0201 is compliant with earlier editions. This was subsequently changed to fully specify the ISO/IEC 646:1991 IRV / ISO-IR No. 6 set.
The use of the ISO/IEC 646 IRV at ISO/IEC 4873 Level 1 with no C1 or G1 set, i.e. using the IRV in an 8-bit environment in which shift codes are not used and the high bit is always zero, is known as ISO 4873 DV, in which DV stands for "Default Version".
In cases where duplicate characters are available in different sets, the current edition of ISO/IEC 4873 / ECMA-43 only permits using these characters in the lowest numbered working set which they appear in. For instance, if a character appears in both the G1 set and the G3 set, it must be used from the G1 set. However, use from other sets is noted as having been permitted in earlier editions.
ISO/IEC 8859 defines complete encodings at level 1 of ISO/IEC 4873, and does not allow for use of multiple ISO/IEC 8859 parts together. It stipulates that ISO/IEC 10367 should be used instead for levels 2 and 3 of ISO/IEC 4873. ISO/IEC 10367:1991 includes G0 and G1 sets matching those used by the first 9 parts of ISO/IEC 8859, and some supplementary sets.
Character set designation escape sequences are used for identifying or switching between versions during information interchange only if required by a further protocol, in which case the standard requires an ISO/IEC 2022 announcer sequence specifying the ISO/IEC 4873 level, followed by a complete set of escapes specifying the character set designations for C0, C1, G0, G1, G2 and G3 respectively, with an -byte of 0x7E denoting an empty set. Each ISO/IEC 4873 level has its own single ISO/IEC 2022 announcer sequence, which are as follows:
CodeHexAnnouncement
ESC SP L1B 20 4CISO 4873 Level 1
ESC SP M1B 20 4DISO 4873 Level 2
ESC SP N1B 20 4EISO 4873 Level 3

Extended Unix Code

Extended Unix Code is an 8-bit variable-width character encoding system used primarily for Japanese, Korean, and simplified Chinese. It is based on ISO 2022, and only character sets which conform to the ISO 2022 structure can have [|EUC] forms. Up to four coded character sets can be represented. The G0 set is invoked over GL, the G1 set is invoked over GR, and the G2 and G3 sets are invoked using the single shifts SS2 and SS3, which are used as CR bytes and invoke over GR. Locking shift codes are not used.
The code assigned to the G0 set is ASCII, or the country's national ISO 646 character set such as KS-Roman or JIS-Roman. Hence, 0x5C is used to represent a Yen sign in some versions of EUC-JP and a Won sign in some versions of EUC-KR.
G1 is used for a 94x94 coded character set represented in two bytes. The EUC-CN form of and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes.
The EUC code itself does not make use of the announcer or designation sequences from ISO 2022; however, it corresponds to the following sequence of four announcer sequences, with meanings breaking down as follows.
Individual sequenceHexadecimalFeature of EUC denoted
ESC SP C1B 20 43ISO-8
ESC SP Z1B 20 5AG2 accessed using SS2
ESC SP 1B 20 5BG3 accessed using SS3
ESC SP \1B 20 5CSingle-shifts invoke over GR

Compound Text (X11)

The [X Consortium defined an ISO 2022 profile named Compound Text as an interchange format in 1989. This uses only four control codes: NL, with the SDS CSI sequence being used for bidirectional text control. It is an 8-bit code using G0 and G1 for GL and GR, and follows ISO-8859-1 in its initial state. The following F-bytes are used:
Escape sequence typeFinal byteGraphical set
GZD4, G1D4 ASCII
GZD4, G1D4 JIS X 0201 katakana
GZD4, G1D4 JIS X 0201 Roman
G1D6 ISO-8859-1 high part
G1D6 ISO-8859-2 high part
G1D6 ISO-8859-3 high part
G1D6 ISO-8859-4 high part
G1D6 ISO-8859-7 high part
G1D6 ISO-8859-6 high part
G1D6 ISO-8859-8 high part
G1D6 ISO-8859-5 high part
G1D6 ISO-8859-9 high part
GZDM4, G1DM4 GB 2312
GZDM4, G1DM4 JIS X 0208
GZDM4, G1DM4 KS C 5601

For specifying encodings by labels, X11 Compound Text defines five private-use DOCS sequences: for variable-length encodings, and through for fixed-length encodings using one through four bytes respectively. Rather than using another escape sequence to return to, the two bytes following the initial escape sequence specify the remaining length in bytes, coded in base-128 using bytes. The encoding label is included in ISO 8859-1 before the encoded text, and terminated with .

Comparison with other encodings

Advantages

  • As ISO/IEC 2022's entire range of graphical character encodings can be invoked over GL, the available glyphs are not significantly limited by an inability to represent GR and C1, such as in a system limited to 7-bit encodings. It accordingly enables the representation of large set of characters in such a system. Generally, this 7-bit compatibility is not really an advantage, except for backwards compatibility with older systems. The vast majority of modern computers use 8 bits for each byte.
  • As compared to Unicode, ISO/IEC 2022 sidesteps Han unification by using sequence codes to switch between discrete encodings for different East Asian languages. This avoids the issues associated with unification, such as difficulty supporting multiple CJK languages with their associated character variants in a single document and font.

Disadvantages

  • Since ISO/IEC 2022 is a stateful encoding, a program cannot jump in the middle of a block of text to search, insert or delete characters. This makes manipulation of the text very cumbersome and slow when compared to non-stateful encodings. Any jump in the middle of the text may require a backup to the previous escape sequence before the bytes following the escape sequence can be interpreted.
  • Due to the stateful nature of ISO/IEC 2022, an identical and equivalent character may be encoded in different character sets, which may be designated to any of G0 through G3, which may be invoked using single shifts or by using locking shifts to GL or GR. Consequently, characters can be represented in multiple ways, meaning that two visually identical and equivalent strings can not be reliably compared for equality.
  • Some systems, like DICOM and several e-mail clients, use a variant of ISO-2022 in addition to supporting several other encodings. This type of variation makes it difficult to portably transfer text between computer systems.
  • UTF-1, the multi-byte Unicode transformation format compatible with ISO/IEC 2022's representation of 8-bit control characters, has various disadvantages in comparison with UTF-8, and switching from or to other charsets, as supported by ISO/IEC 2022, is typically unnecessary in Unicode documents.
  • Because of its escape sequences, it is possible to construct attack byte sequences in which a malicious string is masked until it is decoded to Unicode, which may allow it to bypass sanitisation. Use of this encoding is thus treated as suspicious by malware protection suites, and 7-bit ISO 2022 data is mapped in its entirety to the replacement character in HTML5 to prevent attacks. Restricted ISO 2022 8-bit code versions which do not use designation escapes or locking shift codes, such as Extended Unix Code, do not share this problem.
  • Concatenation can pose issues. Profiles such as ISO-2022-JP specify that the stream starts in the ASCII state and must end in the ASCII state. This is necessary to ensure that characters in concatenated ISO-2022-JP and/or ASCII streams will be interpreted in the correct set. This has the consequence that if a stream that ends in a multi-byte character is concatenated with one that starts with a multi-byte character, a pair of escape codes are generated switching to ASCII and immediately away from it. However, as stipulated in Unicode Technical Report #36, pairs of ISO 2022 escape sequences with no characters between them should generate a replacement character to prevent them from being used to mask malicious sequences such as cross-site scripting. Implementing this measure, e.g. in Mozilla Thunderbird, has led to interoperability issues, with unexpected characters being generated where two ISO-2022-JP streams have been concatenated.

Standards and registry indices cited

*

Registered code sets cited

*

Internet Requests For Comment cited

*

Other published works cited

*