CJK Unified Ideographs

The Chinese, Japanese and Korean scripts share a common background, collectively known as CJK characters. During the process called Han unification, the common characters were identified and named CJK Unified Ideographs. As of Unicode, Unicode defines a total of 101,996 characters.
The term ideographs is a misnomer, as the Chinese script is not ideographic but rather logographic, but was chosen for being more common in English.
Until the early 20th century, Vietnam also used Chinese characters, so sometimes the abbreviation CJKV is used.

UTC sources

The majority of characters submitted by the UTC to the IRG are derived from Unicode Technical Committee documents. Other sources include:ABC Chinese-English Dictionary by John DeFrancis

The Adobe-CNS1 glyph collection
The Adobe-Japan1 glyph collection
A Complete Checklist of Species and Subspecies of Chinese Birds
The Great Nom Dictionary
Annotations to Shuowen Jiezi
GB18030-2000
Required Character List Supplied by the Church of Jesus Christ of Latter-day Saints
Commercial Press New Dictionary, Hong Kong
Modern Chinese Dictionary, by Chinese Academy of Social Sciences, Linguistics Research Institute, Dictionary Editorial Office
Working Group documents

WWW unicode.org/charts/unihan...

Ordering

The ordering of CJK Unified Ideographs within Unicode blocks was initially determined by consulting the following four dictionaries. Primarily, they were arranged in Kangxi Dictionary order, with the other dictionaries consulted, in order, for characters not found in the Kangxi Dictionary, to determine which Kangxi Dictionary character they should follow in the ordering.

Kangxi Dictionary
Dai Kan-Wa Jiten
Hanyu Da Zidian
Dae Jaweon

This system is not used for more recently-added Unicode blocks. The Ideographic Research Group no longer uses the Dae Jaweon, nor the Dai Kan-Wa Jiten, in its work. The Kangxi Dictionary and Hanyu Da Zidian are still used both in existing character source references, and as potential replacements for existing source references discovered to be erroneous. Similarly, although a Kangxi Dictionary index was previously provided as part of the submission data for UTC-source characters, this is no longer the case. Instead, the stroke type of the first residual stroke is supplied with all submitted characters, and used to order characters with the same radical and stroke count within the new Unicode block.

CJK Unified Ideographs blocks

CJK Unified Ideographs

The basic block named CJK Unified Ideographs contains 20,992 basic Chinese characters in the range U+4E00 through U+9FFF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system, hanja in Korea, and chữ Nôm characters in Vietnamese. Many characters in this block are used in all three writing systems, while others are in only one or two of the three.
This block is also known as the Unified Repertoire and Ordering, especially when it needs to be differentiated from the other CJK Unified Ideographs blocks.
The first 20,902 characters in the block are arranged according to the Kangxi Dictionary ordering of radicals. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical order.
The block is the result of Han unification, which was somewhat controversial within East Asia. Since single characters used in more than one of Chinese, Japanese and Korean were coded in the same location, and the modern typographical conventions and handwriting curricula differ slightly between regions, the appearance of a selected glyph could depend on the particular font being used. However, the URO applies the source separation rule, meaning that pairs of characters treated as distinct in a character set used as a source for the URO would remain pairs of separate characters in the new Unicode encoding.
Using variation selectors, it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1 character set, which has 14,684 ideographic variation sequences, is an extreme example of the use of variation selectors.

Charts

4E00–62FF,
6300–77FF,
7800–8CFF,
8D00–9FFF.

CJK Unified Ideographs Extension A

The block named CJK Unified Ideographs Extension A contains 6,592 additional characters in the range U+3400 through U+4DBF.

Charts

3400–4DBF.

CJK Unified Ideographs Extension B

The block named CJK Unified Ideographs Extension B contains 42,720 characters in the range U+20000 through U+2A6DF. These include most of the characters used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block, as well as many Hán-Nôm characters that were formerly used to write Vietnamese.

Charts

20000–215FF,
21600–230FF,
23100–245FF,
24600–260FF,
26100–275FF,
27600–290FF,
29100–2A6DF.

CJK Unified Ideographs Extension C

The block named CJK Unified Ideographs Extension C contains 4,160 characters in the range U+2A700 through U+2B73F. It was initially added in Unicode 5.2.

Charts

2A700–2B73F.

CJK Unified Ideographs Extension D

The block named CJK Unified Ideographs Extension D contains 222 characters in the range U+2B740 through U+2B81D that were added in Unicode 6.0.

Charts

2B740–2B81F.

CJK Unified Ideographs Extension E

The block named CJK Unified Ideographs Extension E contains 5,774 characters in the range U+2B820 through U+2CEAD. It was originally added in Unicode 8.0.

Charts

2B820–2CEAF.

CJK Unified Ideographs Extension F

The block named CJK Unified Ideographs Extension F contains 7,473 characters in the range U+2CEB0 through 2EBE0 that were added in Unicode 10.0. It includes more than 1,000 Sawndip characters for Zhuang.

Charts

2CEB0–2EBEF.

CJK Unified Ideographs Extension G

A block named CJK Unified Ideographs Extension G was added as part of Unicode 13.0 to the Tertiary Ideographic Plane in the range U+30000 through U+3134F, containing 4,939 characters.

Charts

30000–3134F.

CJK Unified Ideographs Extension H

A block named CJK Unified Ideographs Extension H was added as part of Unicode 15.0 to the Tertiary Ideographic Plane in the range U+31350 through U+323AF, containing 4,192 characters.

Charts

31350–323AF.

CJK Unified Ideographs Extension I

A block named CJK Unified Ideographs Extension I was added as part of Unicode to the Supplementary Ideographic Plane in the range U+2EBF0 through U+2EE5F, containing 622 characters.

Charts

2EBF0–2EE5F.

CJK Unified Ideographs Extension J

A block named CJK Unified Ideographs Extension J was added as part of Unicode to the Tertiary Ideographic Plane in the range U+323B0-U+33479, containing 4,298 characters.

Charts

323B0–3347F.

CJK Compatibility Ideographs

The block named CJK Compatibility Ideographs was created to retain round-trip compatibility with other standards.
However, twelve characters in this block actually have the "Unified Ideograph" property: U+FA0E 﨎, U+FA0F 﨏, U+FA11 﨑, U+FA13 﨓, U+FA14 﨔, U+FA1F 﨟, U+FA21 﨡, U+FA23 﨣, U+FA24 﨤, U+FA27 﨧, U+FA28 﨨, and U+FA29 﨩. None of the other characters in this and other "Compatibility" blocks relate to CJK unification.
While 龜 and 亀 are not considered unifiable, is considered a duplicate to.

Charts

F900–FAFF.

Known issues

Disunification

U+4039

The character U+4039 was a unification of two different characters until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings.
The proposal of disunification of U+4039 was accepted for Unicode 5.1, encoding a new character at U+9FC3 to represent shǎn.

Other 3 glyphs in Extension B

In CJK Unified Ideographs Extension B, some characters were incorrectly unified with others. These characters include U+2017B, U+204AF and U+24CB2. The first two characters contained a wrong unification of Chinese and Vietnamese source of their glyph, while the last one unifies the Chinese and Taiwanese ones.
The glyphs for U+2017B and U+204AF were corrected in version 10.0, and the erroneous UCS2003 source glyph U+24CB2 was removed in version 13.0.

Unifiable variants and exact duplicates

Also in CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded by mistake. Additionally, an ISO/IEC JTC 1/SC 2 report has found that six exact duplicates and two semi-duplicates were encoded by mistake:

U+34A8 㒨 = U+20457 ? : U+20457 is the same as the China-source glyph for U+34A8, but it is significantly different from the Taiwan-source glyph for U+34A8
U+3DB7 㶷 = U+2420E ? : same glyph shapes
U+8641 虁 = U+27144 ? : U+27144 is the same as the Korean-source glyph for U+8641, but it is significantly different from the mainland China-, Taiwan- and Japan-source glyphs for U+8641
U+204F2 ? = U+23515 ? : same glyph shapes, but ordered under different radicals
U+249BC ? = U+249E9 ? : same glyph shapes
U+24BD2 ? = U+2A415 ? : same glyph shapes, but ordered under different radicals
U+26842 ? = U+26866 ? : same glyph shapes
U+FA23 﨣 = U+27EAF ? : same glyph shapes

Other CJK ideographs in Unicode, not Unified

Apart from the eleven blocks of "Unified Ideographs," Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their counterparts in other blocks, the usages can be different. An example of a not-unified CJK-character is in the CJK Symbols and Punctuation block. Although it is not covered under "CJK Unified Ideographs", it is treated as a CJK-character for all other intents and purposes.
Four blocks of compatibility characters are included for compatibility with legacy text handling systems and older character sets:

They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore, their use is discouraged.

Font support

The blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A, being parts of the Basic Multilingual Plane, are supported by the majority of the CJK fonts. However, Japanese and Korean fonts usually have fewer characters than Chinese. Extensions B, C, D are supported by additional fonts MingLiU-ExtB, MingLiU_HKSCS-ExtB, PMingLiU-ExtB, SimSun-ExtB included in Microsoft Windows since Vista.

Unicode version history

Unicode version	Addition	Plane	Characters added	Total characters
1.0	CJK Compatibility Ideographs	Basic Multilingual Plane	12	20,914
1.0	CJK Unified Ideographs	BMP	20,902	20,914
3.0	CJK Unified Ideographs Extension A	BMP	6,582	27,496
3.1	CJK Unified Ideographs Extension B	Supplementary Ideographic Plane	42,711	70,207
4.1	CJK Unified Ideographs: Ideographs from HKSCS-2004 and GB 18030-2000 not in ISO 10646	BMP	22	70,229
5.1	CJK Unified Ideographs: Ideographs from Adobe Japan and disunification of U+4039	BMP	8	70,237
5.2	CJK Unified Ideographs: Characters from ARIB #47, #95, #93 and HKSCS	BMP	8	74,394
5.2	CJK Unified Ideographs Extension C	SIP	4,149	74,394
6.0	CJK Unified Ideographs Extension D	SIP	222	74,616
6.1	CJK Unified Ideographs: Character corresponding to Adobe-Japan1-6 CID+20156	BMP	1	74,617
8.0	CJK Unified Ideographs	BMP	9	80,388
8.0	CJK Unified Ideographs Extension E	SIP	5,762	80,388
10.0	CJK Unified Ideographs	BMP	21	87,882
10.0	CJK Unified Ideographs Extension F	SIP	7,473	87,882
11.0	CJK Unified Ideographs	BMP	5	87,887
13.0	CJK Unified Ideographs	BMP	13	92,856
13.0	CJK Unified Ideographs Extension A	BMP	10	92,856
13.0	CJK Unified Ideographs Extension B	SIP	7	92,856
13.0	CJK Unified Ideographs Extension G	Tertiary Ideographic Plane	4,939	92,856
14.0	CJK Unified Ideographs	BMP	3	92,865
14.0	CJK Unified Ideographs Extension B	SIP	2	92,865
14.0	CJK Unified Ideographs Extension C	SIP	4	92,865
15.0	CJK Unified Ideographs Extension C	SIP	1	97,058
15.0	CJK Unified Ideographs Extension H	TIP	4,192	97,058
15.1	CJK Unified Ideographs Extension I	SIP	622	97,680
17.0	CJK Unified Ideographs Extension C	SIP	6	101,996
17.0	CJK Unified Ideographs Extension E	SIP	12	101,996
17.0	CJK Unified Ideographs Extension J	TIP	4,298	101,996