ESC SP variable-length encoding described above is sometimes referred to as the EUC packed format, which is the encoding format usually labeled as EUC. However, internal processing of EUC data may make use of a fixed-length transformation format called the EUC complete two-byte format. This represents:
- Code set 0 as two bytes in the range 0x21–0x7E.
- Code set 1 as two bytes in the range 0xA0–0xFF.
- Code set 2 as a byte in the range 0x21–0x7E followed by a byte in the range 0xA0–0xFF.
- Code set 3 as a byte in the range 0xA0–0xFF followed by a byte in the range 0x21–0x7E.
Initial bytes of 0x00 and 0x80 are used in cases where the [code set">variable-width encoding">variable-length encoding described above is sometimes referred to as the EUC packed format, which is the encoding format usually labeled as EUC. However, internal processing of EUC data may make use of a fixed-length transformation format called the EUC complete two-byte format. This represents:
- Code set 0 as two bytes in the range 0x21–0x7E.
- Code set 1 as two bytes in the range 0xA0–0xFF.
- Code set 2 as a byte in the range 0x21–0x7E followed by a byte in the range 0xA0–0xFF.
- Code set 3 as a byte in the range 0xA0–0xFF followed by a byte in the range 0x21–0x7E.
Initial bytes of 0x00 and 0x80 are used in cases where the [code set uses only one byte. There is also a four-byte fixed-length format. These fixed-length encoding formats are suited to internal processing and are not usually encountered in interchange. EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese". Only the packed format is included in the WHATWG Encoding Standard used by HTML5.EUC-CNEUC-CN is the usual encoded form of the standard for simplified Chinese characters. Unlike the case of Japanese JIS X 0208 and ISO-2022-JP, is not normally used in a 7-bit code version, although a variant form called HZ was sometimes used on USENET. An ASCII character is represented in its usual encoding. A character from is represented by two bytes, both from the range 0xA1–0xFE.748 codeAn encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology. The 748 code contains all of, but is not -compliant and therefore not a true EUC code. The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.IBM code pages 1380, 1381, 1382 and 1383code page 1381 comprises the single-byte code page 1115 and the double-byte code page 1380, which encodes GB 2312 the same way as EUC-CN, but deviates from the EUC structure by extending the lead byte range back to 0x8C, adding 31 IBM-selected characters in 0x8CE0 through 0x8CFE and adding 1880 user-defined characters with lead bytes 0x8D through 0xA0. IBM code page 1383 comprises the single-byte code page 367 and the double-byte code page 1382, which differs by conforming to the EUC structure, adding the 31 IBM-selected characters in 0xFEE0 through 0xFEFE instead, and including only 1360 user-defined characters, interspersed in the positions not used by GB 2312. The alternative CCSID 5479 is used for the pure EUC-CN code page: it uses CCSID 9574 as its double-byte set, which uses CPGID 1382 but excludes the IBM-selected and user-defined characters.is an extension to. It defines an extended form of the EUC-CN encoding capable of representing a larger array of CJK characters sourced largely from, including traditional Chinese characters and characters used only in Japanese. It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes, due to a larger encoding space being required. Variants of GBK are implemented by Windows code page 936, and by IBM's code page 1386. The Unicode-based character encoding defines an extension of GBK capable of encoding the entirety of Unicode. However, Unicode encoded as is a variable-length encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other Unicode transformation formats such as UTF-8.Mac OS Chinese SimplifiedOther EUC-CN variants deviating from the EUC mechanism include the classic Mac OS Chinese Simplified script. It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE, and 0xFF for the U with umlaut, two special font metric characters, the non-breaking space, the copyright sign, the trademark sign and the ellipsis respectively. This differs in what is regarded as a single-byte character versus the first byte of a two-byte character from both EUC and GBK. This use of 0xA0, 0xFD, 0xFE and 0xFF matches Apple's Shift_JIS variant. Besides these changes to the lead byte range, the other distinctive feature of the double-byte portion of Mac OS Chinese Simplified is the inclusion of two extensions to the basic GB 2312-80 set in rows 6 and 8. These are considered "standard extensions to GB 2312", neither of which is proprietary to Apple: the row 8 extension was taken from GB 6345.1, both extensions are included by GB/T 12345, and both extensions are included by GB 18030.EUC-JPEUC-JP is a variable-length encoding used to represent the elements of three Japanese character set standards, namely,, and. Other names for this encoding include Unixized JIS and AT&T JIS. Less than 0.1% of all web pages use EUC-JP since January 2025, while 2.3% of websites written with Japanese use this second-most popular encoding. It is called Code page 954 by IBM. Microsoft has two code page numbers for this encoding. This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes. A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004, encodes and . Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used or its extensions, although it became heavily used by Unix or Unix-like operating systems. Therefore, whether Japanese websites use EUC-JP or Shift_JIS often depends on what OS the author uses. Characters are encoded as follows:
- As an EUC/ISO 2022 compliant encoding, the C0 control characters, space, and DEL are represented as in ASCII.
- A graphical character from ASCII is represented as its usual one-byte representation, in the range 0x21 - 0x7E. While some variants of EUC-JP encode the lower half of here, most encode ASCII, including the W3C/WHATWG Encoding standard used by HTML5, and so does EUC-JIS-2004. While this means that 0x5C is typically mapped to Unicode as U+005C REVERSE SOLIDUS, U+005C may be displayed as a Yen sign by certain Japanese-locale fonts, e.g. on Microsoft Windows, for compatibility with the lower half of.
- A character from JIS X 0208 is represented by two bytes, both in the range 0xA1 - 0xFE. This differs from the ISO-2022-JP representation by having the high bit set. This code set may also contain vendor extensions in some EUC-JP variants. In EUC-JIS-2004, the first plane of is encoded here, which is effectively a superset of standard.
- A character from the upper half of is represented by two bytes, the first being 0x8E, the second being the usual representation in the range 0xA1 - 0xDF. This set may contain IBM vendor extensions in some variants.
- A character from JIS X 0212 is represented in EUC-JP by three bytes, the first being 0x8F, the following two being in the range 0xA1-0xFE, i.e. with the high bit set. In addition to standard, code set 3 of some EUC-JP variants may also contain extensions in rows 83 and 84 to represent characters from IBM's Shift JIS extensions which lack standard JIS X 0212 mappings, which may be coded in either of two layouts, one defined by IBM themselves and one defined by the OSF. In EUC-JIS-2004, the second plane of is encoded here, which does not collide with the allocated rows in standard. Some implementations of EUC-JIS-2004, such as the one used by Python, allow both and plane 2 characters in this set.
Vendor extensions to EUC-JP were often allocated within the individual code sets, as opposed to using invalid EUC sequences. However, some vendor-specific encodings are partially compatible with EUC-JP, due to encoding over GR, but do not follow the packed EUC structure. Often, these do not include use of the single shifts from EUC-JP, and are thus not straight extensions of EUC-JP, with the exception of Super DEC Kanji.
|