UTF-EBCDIC
UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using 1 to 5 bytes. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 is applied first. The main difference between this encoding and UTF-8 is that it allows Unicode code points through to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses instead of as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above are larger than the UTF-8 encoding.
The UTF-8-Mod transformation leaves the data in an ASCII-based format, so each byte is fed through a reversible lookup table to produce the final UTF-EBCDIC encoding. For example, in this table maps to ; thus the UTF-EBCDIC encoding of is .
UTF-EBCDIC is rarely used, even on the EBCDIC-based mainframes for which it was designed. IBM EBCDIC-based mainframe operating systems, such as z/OS, usually use UTF-16 for complete Unicode support. For example, IBM Db2, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.
Codepage layout
There are 160 characters with single-byte encodings in UTF-EBCDIC. As can be seen, the single-byte portion is similar to IBM-1047 instead of IBM-37 due to the location of the square brackets. CCSID 37 has at hex BA and BB instead of at hex AD and BD respectively.legend|#DFD|Start bytes for a sequence of that many bytes. Tooltip shows the lowest code point encoded using that start byte.legend|#ADA|Start byte where not all combinations of continuation bytes are valid, either because it is an invalid overlong form, or because it encodes a code point greater than U+10FFFF.legend|#FDD|Continuation bytes. Tooltip shows the hexadecimal value of the 5 bits they add.legend|#DDD|Unused, including lead bytes that can only start an invalid overlong form. For example, 0x76 because even 0x76 0x73 would merely be an overlong encoding of U+005F.Oracle UTFEOracle UTFE is a Unicode 3.0 UTF-8 Oracle database variation, similar to the CESU-8 variant of UTF-8, where supplementary characters are encoded as two 4-byte characters rather than a single 4- or 5-byte character. It is used only on EBCDIC platforms. |