KPS 9566


KPS 9566 is a North Korean standard specifying a character encoding for the Chosŏn'gŭl writing system used for the Korean language. The edition of 1997 specified an ISO 2022-compliant 94×94 two-byte coded character set. Subsequent editions have added additional encoded characters outside of the 94×94 plane, in a manner comparable to UHC or GBK.
KPS 9566 differs in approach from KS X 1001, its South Korean counterpart, in using a different ordering of Chosŏn'gŭl, in encoding explicit vertical presentation forms of punctuation, in not encoding duplicate Hanja for multiple readings, and in including several characters specific to the North Korean political system, including special encodings for the names of the country's past and present leaders.
Although KPS 9566 was the original source of several characters added to Unicode, not all KPS 9566 characters have Unicode equivalents. Those which do not are mapped to similar Unicode characters or to the Private Use Area.

Background and other standards

The ASCII character set originated in the United States in 1963, and was revised in 1967 to the form it has today. ASCII also became accepted as an international standard in 1967, becoming ECMA-6, designated ISO/IEC 646 by the International Organization for Standardization. It is presently designated ANSI X3.4-1986 and ISO 646:1991. ASCII was a 7-bit, single-byte encoding including 94 graphical characters, the space, and 33 control codes, which provided basic support for representing American English text as a series of bytes.
The next edition of ISO 646, published in 1972, revised the standard to introduce the concept of national versions of the code, allowing countries to replace a few less commonly used codes with their own required characters. At the same time, work on defining extension mechanisms for ASCII was underway, with the intention of being applicable to both 7-bit and 8-bit environments. This was completed in 1973 and published as JIS X 0202, ECMA-35 and ISO 2022. ISO 2022 specifies mechanisms for using single-byte and multiple-byte character sets with a certain structure in both 7-bit and 8-bit environments, and for declaring and switching between them in a standard fashion using shift codes and escape sequences.
Countries in East Asia, due to using large repertoires of Chinese characters, introduced standardised double-byte encodings for their writing systems, since the number of characters representable in a single-byte code was not sufficient. In an ISO 2022 compliant DBCS, every character can be represented with two ASCII printing character bytes; the location of a character can be referenced by these byte values, or by two numbers from 1 to 94, equal to the respective bytes minus 32. The first registered ISO 2022 compliant DBCS, and the first East Asian DBCS to be established as a national standard, was the first edition of JIS X 0208, published in 1978. This was followed by GB 2312 in 1980, and by Wansung code in 1987. Big5, defined in 1984, did not follow the ISO 2022 structure. When used in an 8-bit environment, GB 2312 and Wansung code were usually used with the eighth bit set, with ASCII or a similar SBCS used with the eighth bit unset; these encoding schemes are known as EUC-CN and EUC-KR, respectively.
Although the Korean writing system includes individual symbols for consonants and vowels, serving as an alphabet, Korean text is properly typeset with these symbols composed into blocks for each syllable. Wansung code included individual Korean syllable blocks separately, treating them as a large set of characters similarly to Hanja, and was first defined by the third edition of the South Korean standard KS C 5601. The first edition had defined an encoding of individual jamo which allowed syllable blocks to be encoded as sequences, which was named N-byte Hangul, and had not been adopted as widely as intended.
Wansung code did not encode all possible modern Korean syllables, only a selection of the 2350 most common, although it allowed them to be specified using combining sequences, which often were not supported. An alternative encoding, also South Korean, named Johab did, and served as a competitor to Wansung for some time. Unified Hangul Code, introduced by Microsoft with Windows 95, extended EUC-KR, allowing the use of invalid EUC double-byte codes to represent all other syllables available in Johab. A similar approach was taken by the Mainland Chinese GBK encoding, extending GB 2312 with support for Traditional Chinese and for less common Chinese characters by encoding them to double-byte codes invalid in EUC-CN.
South Korea was not the only country developing an ISO 2022 DBCS for Korean: the Mainland Chinese GB 12052 was published in 1989. This was not closely related to Wansung code, although it also included composed syllables. Instead, it corresponded to GB 2312 with Korean syllables replacing the Chinese characters, except for the inclusion of a dollar sign in place of a yuan sign. It was developed for use by the Korean minority in north-eastern China.
Likewise, North Korea developed KPS 9566. Although North Korea and South Korea both use Korean Chosŏn'gŭl as their primary writing system, they use different lexicographical orders. Hence, character ordering differs between Wansung code and KPS 9566.
KPS 9566 has undergone several revisions, including editions of 1997 and 2003, mainly to enhance compatibility with Unicode. These are commonly indicated by specifying the year. The current edition as of the release of Red Star OS 3.0 appears to be KPS 9566-2011, which adds Kim Jong Un to the list of leaders. The publicly available code chart for the 1997 edition of KPS 9566 shows a ISO 2022 94×94 plane. The more recent editions, from what sources of information are available outside of North Korea itself, appear to define additional allocations outside of the EUC plane.
Due to the interoperability issues arising from the use of multiple national standard and platform- or font-specific proprietary character encodings, the Unicode standard was developed with the intent of allowing all representable text to be interchanged in a single, universal format. The first edition of Unicode was published in 1991 and 1992, and ISO/IEC 10646 was established in sync with Unicode in 1993. Unicode formats are preferred for international use on the World Wide Web, where legacy character encodings are treated as partial encodings of Unicode by means of mapping files.

Design

In principle, KPS 9566 is similar to the Wansung character set defined by the South Korean KS X 1001 standard, although the two are not compatible. Both encode a section of punctuation, symbols, jamo, kana and alphabetical characters, followed by a subset of the possible modern Chosŏn'gŭl syllables, followed by a section of Hanja. However, KPS 9566 uses a different ordering of jamo and syllables to conform with North Korean lexicographical ordering standards. KPS 9566 also includes 28 explicitly rotated punctuation characters for vertical typography, which KS X 1001 does not, and encodes each Hanja only once, whereas KS X 1001 encodes several Hanja with multiple readings multiple times.
KPS 9566-97 encodes a total of 2679 Chosŏn'gŭl syllables and 4653 Hanja. This provides better coverage than the 2350 syllables encoded by Wansung code: for instance, the 똠 character used in the name of 똠방각하, a noted Korean literary work, does not have an assigned Wansung codepoint, but has one in KPS 9566. The Hanja section includes 4652 characters from the Unified Repertoire and Ordering and one from CJK Unified Ideographs Extension A. The entirety of row 15, the latter half of row 44 and the latter half of row 94 may be used for user-defined purposes.
KPS 9566 is especially distinguished by its inclusion of several special characters from North Korean political life. Specifically, it includes the hammer, sickle and brush emblem of the Workers' Party of Korea, both uncircled and circled, and two groups of three special-purpose characters which spell out the names of the North Korean leaders Kim Il Sung and Kim Jong Il in a special decorative font. The syllables for Kim and Il, which are identical in the spelling of both names, are encoded twice. KPS 9566-2011 additionally includes the name of Kim Jong Un as code points 04-78 to 04-80.
Due to these special characters, there is currently no full round-trip compatibility between KPS 9566 and Unicode, unless unsupported characters are mapped to the Private Use Area.

KPS 10721

North Korea also developed a second character set, KPS 10721 "Code of the supplementary Korean Hanja Set for Information Interchange", which was published in 2000. KPS 10721 encodes a set of at least 19469 Hanja additional to those included in KPS 9566., these did not all have mappings to Unicode, but included 10358 from the Unified Repertoire and Ordering, 3187 from CJK Unified Ideographs Extension A and 107 from CJK Compatibility Ideographs, as well as 5767 from CJK Unified Ideographs Extension B and 50 from CJK Compatibility Ideographs Supplement. All KPS 9566 Hanja are also included in KPS 10721, which uses a different encoding structure, unrelated to ISO 2022.
Besides the mapping of these Hanja to Unicode, little was known about the KPS 10721 standard outside of North Korea prior to 2022. North Korean reference glyphs were provided for only a subset of these Hanja in the Unicode code charts, due to a lack of suitable font data available to the Unicode Consortium. Unicode Hanja characters with KPS 9566 or KPS 10721 sources are nonetheless cross-referenced to their KPS codes in the Unihan database with the key kIRG_KPSource; the Unihan source codes use "KP0" to refer to KPS 9566 and "KP1" for KPS 10721.
In 2022, a Hanja font was isolated from the North Korean Okpyon Android app, which was used to correct some errors in the KPS-10721-to-Unicode mapping data and to supply new North Korean reference glyphs for the Unicode code charts; while doing so, the mappings of KPS 9566 Hanja to KPS 10721 were also deduced. The existing reference glyphs were updated in April 2022, ready for the publication of Unicode 15 in September 2022, while the Unicode Consortium's CJK and Unihan Group recommended in November 2022 that the Unicode Technical Committee include the additional reference glyphs in the next version of Unicode, to be included in Unicode 15.1 in September 2023.