Modern Chinese characters


Modern Chinese characters are the Chinese characters used in modern languages, mostly in modern Chinese, and additionally in modern Japanese and Korean. Chinese characters are composed of components, which are in turn composed of strokes.
The 100 most frequently used characters cover over 40% of modern Chinese texts. The 1000 most frequently used characters cover approximately 90% of the texts.
There are a variety of novel aspects of modern Chinese characters, including that of orthography, phonology, and semantics, as well as matters of collation and organization and statistical analysis, computer processing, and pedagogy.

Background

Historical development

Since maturing as a complete writing system, Chinese characters have had an uninterrupted history of development over more than 3,000 years, with stages including
leading to the modern written forms, as illustrated by the development of character :
In 1980, Zhou Youguang, often considered to be the "father of pinyin", published a paper entitled "Introduction to the Studies of Modern Chinese Characters"—within, he detailed aspects of the numbers, orders, forms, sounds, meanings, and pedagogy regarding the modern characters. His paper was followed by Gao Jiaying's "A Brief Discussion on the Establishment of Modern Chinese Character Studies", and other related writings on the subject. At least five textbooks have been published in this area.

Regional varieties

Chinese characters were originally invented for writing the Chinese language, and were later employed for other East Asian languages, developing as part of a shared orthographic tradition. Among the application places, for ordinary and historical purposes, simplified characters are primarily used in mainland China, Singapore, and Malaysia, traditional characters are used in Taiwan, Hong Kong, and Macau, along with kanji in Japan, hanja in Korea, and chữ Hán in Vietnam. For example, the traditional character has the simplified form and the kanji form 広.

Characteristics

In contrast with the Latin alphabet used to write many languages, including English, Chinese characters have many divergent properties, including:
  • being in a two-dimensional block structure
  • potentially having dozens of strokes
  • denoting a morpheme in most cases
  • Normally one character is read as one syllable.
  • Texts written in Chinese characters are intelligible to readers of different dialects and different dynasties.

    Number and sets

Due to the dynamic development of languages, there is no definite number of modern Chinese characters. However a reasonable estimation can be made by a survey of the character sets of relevant standard lists and influential dictionaries in the countries and regions where Chinese characters are used.

Mainland China

The standards in the People's Republic of China include the List of Frequently Used Characters in Modern Chinese, totalling 3,500 characters, and the List of Commonly Used Characters in Modern Chinese.
The current standard is the List of Commonly Used Standard Chinese Characters, which was released by the State Council in June 2013 to replace the previous two lists and other standards. It includes 8,105 characters of the Simplified Chinese writing system, 3,500 as primary, 3,000 as secondary, and 1,605 as tertiary. In addition, there are 2,574 traditional characters and 1,023 variants.
The character sets of Xinhua Zidian and Xiandai Hanyu Cidian, the most popular modern Chinese character dictionary and word dictionary, each include over 13,000 characters of Simplified characters, Traditional characters and variants.

Taiwan

In Taiwan, the standard is the Chart of Standard Forms of Common National Characters with 4,808 characters, and the Chart of Standard Forms of Less-Than-Common National Characters, with 6,341 common national characters. Both lists were released by the Ministry of Education, with a total of 11,149 characters of the Traditional Chinese writing system.

Hong Kong

In Hong Kong, the standard is the List of Graphemes of Commonly-Used Chinese Characters for elementary and junior secondary education, totally 4,762 characters. The list was released by the Education Bureau, and is very influential in educational circles.

Japan

In Japan, the standard is the —a list of 2,136 frequently used characters designated by the Japanese Ministry of Education, as well as 983 jinmeiyō kanji for use in personal names.

Korea

In Korea, the standard is the Basic Hanja for educational use, and the Table of Hanja for Personal Name Use, published by the Supreme Court of Korea in March 1991. The list expanded gradually, and to year 2015 there were 8,142 hanja permitted to be used in Korean names.

Overall estimates

With consideration of all the character sets mentioned above, the total number of modern Chinese characters in the world is over 10,000, probably around 15,000. Such an estimation should not be counted as too rough, considering that there are totally over 90,000 Chinese characters in Unicode, and more if every Chinese character ever appeared in the world is to be included.
A college graduate who is literate in written Chinese knows between three and four thousand characters. Specialists in classical literature or history, who would often encounter characters no longer in use, are estimated to have a working vocabulary of between 5,000 and 6,000 characters.

Frequency

Chinese character frequencies are calculated on data of corpora. A corpus is a collection of texts representative of one or more languages. The frequency of a character is the ratio of the number of its occurrences in the corpus to the total number of characters of the corpus. The formula for calculating frequency is
"", where is the number of times a certain Chinese character appears in the corpus, and is the total number of characters in the corpus.

Origins

The first person to make a statistic study on the frequency of Chinese characters was Chen Heqin. In the 1920s, he and his assistants spent two years manually counting the characters in a corpus of 554,478 characters, and obtained 4,261 different characters with frequency information. They then compiled a book, Applied Lexis of Vernacular Chinese.
The 10 most frequently used characters in their corpus are, by descending frequency,
, , , , , , , , , .

CUHK survey

In 2001, the Chinese University of Hong Kong published a number of frequency lists on their website, entitled "Hong Kong, Mainland China and Taiwan Chinese Frequency: a Trans-regional Diachronic Survey". The frequency data came from a grand corpus with a number of sub-corpora representing the Chinese languages in the three regions of Hong Kong, mainland China and Taiwan and in the two time periods of the 1960s and 1980s–90s. Each sub-corpus includes about 5,000 different characters, as shown by their frequency lists.
From the data of these frequency lists, some important and interesting features of Chinese can be discovered:
  1. , and are the three most frequently used characters across the regions and time periods of the corpora. is number one in all the frequency lists.
  2. The 10 most frequently used characters across the three regions and two time periods are very consistent. That means a frequently used character in one region or period is very likely to be frequently used in another region or period.
  3. The 100 most frequently used characters in the 80s and 90s cover 41.00% of the Hong Kong texts of that period, 41.34% of the mainland texts, and 41.88% of the Taiwan texts. That is more than 4 out of every 10 characters for the three regions.
  4. The 1000 most frequently used characters in the 80s and 90s cover 89.25% of the Hong Kong texts of that period, 90.26% of the mainland texts, and 88.74% of the Taiwan texts.

    Chinese government survey

Large-scale surveys by the Ministry of Education and the State Language Commission of PRC over the years have shown that the use of Chinese characters and words has a strong distribution pattern. The number of characters used in modern Chinese has stayed stable at about 10,000 for a few years. The number of most frequently used characters with a coverage rate of 80%, 90%, and 99% is about 590, 960, and 2,400 respectively.
Chinese character frequency is essential to quantitative research of Chinese characters and has been applied to language teaching, dictionary composition, character lists compilation, Chinese character information processing, etc.

Orders

The orders or sorting methods of Chinese dictionaries and other lists of text entries are traditionally divided into three categories: form-based orders, sound-based orders and meaning-based orders. In modern Chinese, people also use frequency orders.

Form-based

In form-based ordering, characters and words are sorted according to various features of the forms or shapes of Chinese characters. Compared to sound-based orders, form-based orders have the advantages of allowing lookup of characters and words without knowing their pronunciations, as well as effective collation of large character sets without support from other sorting methods. There are two subcategories of form-based orders: stroke-based orders and component-based orders, which further includes radical-based orders, etc.

Sound-based

There are two major sound representation systems for Standard Chinese: pinyin and bopomofo. Accordingly, there is a pinyin alphabetical order and a bopomofo-based order.

Meaning-based

Meaning-based orders, also called semantics-based orders, arrange characters and words in a hierarchical structure of semantic categories.