Big5

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.
The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set instead.
Big5 gets its name from the consortium of five companies in Taiwan that developed it.

Encoding

The original Big5 character set is sorted first by usage frequency, second by stroke count, lastly by Kangxi radical.
The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The ETen extension became part of the current Big5 standard through popularity.
The structure of Big5 does not conform to the ISO 2022 standard, but rather bears a certain similarity to the encoding. It is a double-byte character set (DBCS) with the following structure:

First byte	to
Second byte	to, to

.
Standard assignments do not use the bytes through, nor, as either lead or trail bytes. Bytes through are used for both lead and trail bytes for double-byte codes. Bytes through are used as trail bytes following a lead byte, or for single-byte codes otherwise. If the second byte is not in either range, behavior is unspecified. Additionally, certain variants of the Big5 character set, for example the HKSCS, use an expanded range for the lead byte, including values in the to range, whereas others use reduced lead byte ranges.
The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a big endian representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes, is usually written as or just A140.
Strictly speaking, the Big5 encoding contains only DBCS characters. However, in practice, the Big5 codes are always used together with an unspecified, system-dependent single-byte character set (SBCS), so that Big5-encoded text contains a mix of double-byte characters and single-byte characters. Bytes in the range to that are not part of a double-byte character are assumed to be single-byte characters.
The meaning of non-ASCII single bytes outside the permitted values that are not part of a double-byte character varies from system to system. In old MSDOS-based systems, they are likely to be displayed as 8-bit characters; in modern systems, they are likely to either give unpredictable results or generate an error.

A more detailed look at the organization

In the original Big5, the encoding is compartmentalized into different zones:

to	Reserved for user-defined characters 造字
to	"Graphical characters" 圖形碼
to	Reserved, not for user-defined characters
to	Frequently used characters 常用字
to	Reserved for user-defined characters
to	Less frequently used characters 次常用字
to	Reserved for user-defined characters

The "graphical characters" actually comprise punctuation marks, partial punctuation marks, dingbats, foreign characters, and other special characters
In most vendor extensions, extended characters are placed in the various zones reserved for user-defined characters, each of which are normally regarded as associated with the preceding zone. For example, additional "graphical characters" would be expected to be placed in the – range, and additional logograms would be placed in either the – or the – range. Sometimes, this is not possible due to the large number of extended characters to be added;
for example, Cyrillic letters and Japanese kana have been placed in the zone associated with "frequently-used characters".

Duplicates

Big5 has encoded two duplicate characters: "兀" on 0xA461 and 0xC94A, "嗀" on 0xDCD1 and 0xDDFC.
Some encoding mapping also maps the three Suzhou numerals, "〸", "〹" and "〺", in the graphical section to ideograph characters instead of CJK Symbols and Punctuation.

What a Big5 code actually encodes

An individual Big5 code does not always represent a complete semantic unit. The Big5 codes of logograms are always logograms, but codes in the "graphical characters" section are not always complete "graphical characters". What Big5 encodes are particular graphical representations of characters or part of characters that happen to fit in the space taken by two monospaced ASCII characters. This is a property of CJK double-byte character sets, and is not a unique problem of Big5.
To illustrate this point, consider the Big5 code . To English speakers this looks like an ellipsis and the Unicode standard identifies it as such; however, in Chinese, the ellipsis consists of six dots that fit in the space of two Chinese characters, so in fact there is no Big5 code for the Chinese ellipsis, and the Big5 code just represents half of a Chinese ellipsis. It represents only half of an ellipsis because the whole ellipsis should take the space of two Chinese characters, and in many DBCS systems one DBCS character must take exactly the space of one Chinese character.
Characters encoded in Big5 do not always represent things that can be readily used in plain text files; an example is "citation mark", which is, when used, required to be typeset under the title of literary works. Another example is the Suzhou numerals, which is a form of scientific notation that requires the number to be laid out in a 2-D form consisting of at least two rows.

The Matching SBCS

In practice, Big5 cannot be used without a matching SBCS; this is mostly to do with a compatibility reason. However, as in the case of other CJK DBCS character sets, the SBCS to use has never been specified. Big5 has always been defined as a DBCS, though when used it must be paired with a suitable, unspecified SBCS and therefore used as what some people call a MBCS; nevertheless, Big5 by itself, as defined, is strictly a DBCS.
The SBCS to use being unspecified implies that the SBCS used can theoretically vary from system to system. Nowadays, ASCII is the only possible SBCS one would use. However, in old DOS-based systems, code page 437—with its extra special symbols in the control code area including position 127—was much more common. Yet, on a Macintosh system with the Chinese Language Kit, or on a Unix system running the cxterm terminal emulator, the SBCS paired with Big5 would not be code page 437.
Outside the valid range of Big5, the old DOS-based systems would routinely interpret things according to the SBCS that is paired with Big5 on that system. In such systems, characters 127 to 160, for example, were very likely not avoided because they would produce invalid Big5, but used because they would be valid characters in code page 437.
The modern characterization of Big5 as an MBCS consisting of the DBCS of Big5 plus the SBCS of ASCII is therefore historically incorrect and potentially flawed, as the choice of the matching SBCS was, and theoretically still is, quite independent of the flavour of Big5 being used.

History

The inability of ASCII to support large Chinese, Japanese and Korean (CJK) character sets led to governments and industry to find creative solutions to enable their languages to be rendered on computers. A variety of ad hoc and usually proprietary input methods led to efforts to develop a standard system. As a result, Big5 encoding was defined by the Institute for Information Industry of Taiwan in 1984.
The name "Big5" is in recognition that the standard emerged from collaboration of five of Taiwan's largest IT firms:

Big5 was rapidly popularized in Taiwan and worldwide among Chinese who used the traditional Chinese character set through its adoption in several commercial software packages, notably the E-TEN Chinese DOS input system. The Republic of China government declared Big5 as their standard in mid-1980s since it was, by then, the de facto standard for using traditional Chinese on computers.

Extensions

The original Big-5 only include CJK logograms from the Charts of Standard Forms of Common National Characters and Less-Than-Common National Characters, but not letters from people's names, place names, dialects, chemistry, biology, and Japanese kana. As a result, many Big-5 supporting programs include extensions to address the problems.
The plethora of variations make UTF-8 a more consistent code page for modern use.

Vendor extensions

ETen extensions

In the ETen Chinese operating system, the following code points are added, to add support for some characters present in the IBM 5550's code page but absent from generic Big5:0xA3C0–0xA3E0: 33 control characters.0xC6A1–0xC875: circle 1–10, bracket 1–10, Roman numerals 1–9, CJK radical glyphs, Japanese hiragana, Japanese katakana, Cyrillic characters0xF9D6–0xF9FE: the characters '碁', '銹', '恒', '裏', '墻', '粧' and '嫺', followed by 34 additional semigraphic symbols.
In some versions of ETen, there are extra graphical symbols and simplified Chinese characters.

Microsoft code pages

Microsoft created its own version of Big5 extension as code page 950 for use with Microsoft Windows, which supports the F9D6–F9FE code points from ETEN's extensions. In some versions of Windows, the euro currency symbol is mapped to Big-5 code point A3E1.
After installing Microsoft's on top of traditional Chinese Windows, applications using code page 950 automatically use a hidden code page 951 table. The table supports all code points in HKSCS-2001, except for the compatibility code points specified by the standard.

IBM code pages

In contrast to Microsoft's code page 950, IBM's CCSID 950 comprises single byte code page 1114 and double byte code page 947. It incorporates ETEN extensions for lead bytes,, and, while omitting those with lead byte, mapping them instead to the Private Use Area as user-defined characters. It also includes two non-ETEN extension regions with trail bytes, i.e. outside the usual Big5 trail byte range but similar to the Big5+ trail byte range: area 5 has lead bytes and contains IBM-selected characters, while area 9 has lead bytes and is a user-defined region.
IBM refers to the euro sign update of their Big-5 variant as CCSID 1370, which includes both single-byte and double-byte euro signs. It comprises single byte code page 1114 and double byte code page 947. For better compatibility with Microsoft's variant in IBM Db2, IBM also define the pure double-byte code page 1372 and the associated variable-width CCSID 1373, which corresponds to Microsoft's code page 950.
IBM assigns CCSID 5471 to the HKSCS-2001 Big5 code page, CCSID 9567 to the HKSCS-2004 code page, and CCSID 13663 to the HKSCS-2008 code page, while CCSID 1375 is assigned to a growing HKSCS code page, currently equivalent to CCSID 13663.

ChinaSea font

ChinaSea fonts are Traditional Chinese fonts made by ChinaSea. The fonts are rarely sold separately, but are bundled with other products, such as the Chinese version of Microsoft Office 97. The fonts support Japanese kana, kokuji, and other characters missing in Big-5. As a result, the ChinaSea extensions have become more popular than the government-supported extensions. Some Hong Kong BBSes had used encodings in ChinaSea fonts before the introduction of HKSCS.

'Sakura' font

The is developed in Hong Kong and is designed to be compatible with HKSCS. It adds support for kokuji and proprietary dingbats not found in HKSCS.

Unicode-at-on

Unicode-at-on, formerly BIG5 extension, extends BIG-5 by altering code page tables, but uses the ChinaSea extensions starting with version 2. However, with the bankruptcy of ChinaSea, late development, and the increasing popularity of HKSCS and Unicode, the success of this extension is limited at best.
Despite the problems, characters previously mapped to Unicode Private Use Area are remapped to the standardized equivalents when exporting characters to Unicode format.

OPG

The web sites of the Oriental Daily News and Sun Daily, belonging to the Oriental Press Group Limited in Hong Kong, used a downloadable font with a different Big-5 extension coding than the HKSCS.

Official extensions

Taiwan Ministry of Education font

The Taiwan Ministry of Education supplied its own font, the Taiwan Ministry of Education font for use internally.

Taiwan Council of Agriculture font

Executive Yuan introduced a 133-character custom font, the Taiwan Council of Agriculture font, that includes 84 characters from the fish radical and 7 from the bird radical.

Big5+

The Chinese Foundation for Digitization Technology introduced Big5+ in 1997, which used over 20000 code points to incorporate all CJK logograms in Unicode 1.1. However, the extra code points exceeded the original Big-5 definition, preventing it from being installed on Microsoft Windows without new codepage files.

Big-5E

To allow Windows users to use custom fonts, the Chinese Foundation for Digitization Technology introduced Big-5E, which added 3954 characters and removed the Japanese kana from the ETEN extension. Unlike Big-5+, Big5E extends Big-5 within its original definition. Mac OS X 10.3 and later supports Big-5E in the fonts LiHei Pro and LiSong Pro.

Big5-2003

The Chinese Foundation for Digitization Technology made a Big5 definition and put it into CNS 11643 in note form, making it part of the official standard in Taiwan.
Big5-2003 incorporates all Big-5 characters introduced in the 1984 ETEN extensions and the Euro symbol. Cyrillic characters were not included because the authority claimed CNS 11643 does not include such characters.

CDP

The Academia Sinica made a Chinese Data Processing font in late 1990s, which the latest release version 2.5 included 112,533 characters, some less than the Mojikyo fonts.

HKSCS

Hong Kong also adopted Big5 for character encoding. However, written Cantonese has its own characters not available in the normal Big5 character set. To solve this problem, the Hong Kong Government created the Big5 extensions Government Chinese Character Set in 1995 and Hong Kong Supplementary Character Set in 1999. The Hong Kong extensions were commonly distributed as a patch. It is still being distributed as a patch by Microsoft, but a full Unicode font is also available from the Hong Kong Government's web site.
There are two encoding schemes of HKSCS: one encoding scheme is for the Big-5 coding standard and the other is for the ISO 10646 standard. Subsequent to the initial release, there are also HKSCS-2001 and HKSCS-2004. The HKSCS-2004 is aligned technically with the ISO/IEC 10646:2003 and its Amendment 1 published in April 2004 by the International Organization for Standardization.
HKSCS includes all the characters from the common ETen extension, plus some characters from simplified Chinese, place names, people's names, and Cantonese phrases.
, the most recent edition of HKSCS is HKSCS-2016; however, the last edition of HKSCS to encode all of its characters in Big5 was HKSCS-2008, while the characters added in more recent editions are mapped to ISO 10646 / Unicode only. Additionally, similarly to Hong Kong's situation, there are also characters that are needed by Macao but is neither included in Big5 nor HKSCS, hence, the Macao Supplementary Character Set was developed, comprising characters not found in Big5 or HKSCS; this, however, is also not encoded in Big5. The first batch of 121 MSCS characters were submitted for inclusion in or mapping to Unicode in 2009, and the first final version of MSCS was established in 2020.

Kana and Cyrillic

There are two major Big5 extension layouts for encoding kana, Russian Cyrillic and list markers in the range 0xC6A1 through 0xC875. These are not compatible with one another. They are compared in the table below.
The ETEN layout of kana and Cyrillic is also used by the HKSCS and Unicode-At-On variants, as well as by IBM's version of code page 950, and the ETEN layout of the kana is also used by the Big5-2003 variant. The published mapping files for Windows-950 include neither, and this Big5 range is mapped to the Private Use Area by the Windows-950 implementation from International Components for Unicode. Python's built-in codec implementation is using the BIG5.TXT layout. The classic Mac OS version includes neither layout.

Big5 codes 0xC6A1 through 0xC875	-	-
Books 1 Vineland 71,214 2 Project Hail Mary 31,941 3 Wuthering Heights 18,607 4 Hamlet 15,928 5 Hamnet (novel) 15,832 6 Frankenstein 11,017 7 Flowers in the Attic 10,307 8 The Count of Monte Cristo 9,133 9 Dune Messiah 8,113 10 The Testaments 8,006 Films 1 Sinners (2025 film) 622,394 2 Hamnet (film) 295,777 3 Weapons (2025 film) 223,917 4 Mr Nobody Against Putin 163,645 5 Marty Supreme 149,377 6 KPop Demon Hunters 133,023 7 Sentimental Value 129,966 8 Bugonia (film) 112,650 9 The Secret Agent (2025 film) 77,032 10 All the Empty Rooms 73,731 Programming Languages 1 Python (programming language) 4,694 2 C (programming language) 4,564 3 JavaScript 3,307 4 Scratch (programming language) 2,739 5 C++ 2,012 6 Rust (programming language) 1,710 7 Java (programming language) 1,662 8 R (programming language) 1,501 9 COBOL 1,427 10 YAML 1,308 TV Series 1 The Madison (TV series) 106,133 2 One Piece (2023 TV series) 76,319 3 Scarpetta (TV series) 62,845 4 Paradise (2025 TV series) 48,765 5 The Other Bennet Sister 39,436 6 The Pitt 39,127 7 DTF St. Louis 37,811 8 Love Story (2026 TV series) 32,476 9 Young Sherlock (British TV series) 30,900 10 Bridgerton 29,723 Video Games 1 Resident Evil Requiem 23,671 2 Wordle 22,659 3 Crimson Desert 21,539 4 Pokémon Pokopia 21,183 5 Pokémon (video game series) 8,283 6 Minecraft 7,928 7 Roblox 7,908 8 Grand Theft Auto VI 7,100 9 Grand Theft Auto V 6,727 10 Poppy Playtime 6,368 © 2026 OWIKI.org. Content is available under Creative Commons Attribution-ShareAlike 4.0 unless otherwise noted. Status: ONLINE Version: 1.05