Universal Character Set characters
The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.
UCS has a potential capacity of over 1 million characters. Each UCS character is abstractly represented by a code point, an integer between 0 and 1,114,111, used to represent each character within the internal logic of text processing software. As of Unicode, released in September 2025, 303,808 of these code points are allocated, 159,866 have been assigned characters, 137,468 are reserved for [|private use], 2,048 are used to enable the mechanism of [|surrogates], and 66 are designated as #|, leaving the remaining 810,304 unallocated. The number of encoded characters is made up as follows:
- 159,629 graphical characters
- 237 [|special purpose characters] for control and formatting.
In addition to the UCS, the supplementary Unicode Standard, provides other implementation details such as:
- mappings between UCS and other character sets
- different collations of characters and character strings for different languages
- an algorithm for laying out bidirectional text, where text on the same line may shift between left-to-right and right-to-left
- a case-folding algorithm
The UCS can be divided in various ways, such as by plane, block, character category, or character property.
Character reference overview
An HTML or XML numeric character reference refers to a character by its Universal Character Set/Unicode code point, and uses the formator
where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form. The x must be lowercase in XML documents. The nnnn or hhhh may be any number of digits and may include leading zeros. The hhhh may mix uppercase and lowercase, though uppercase is the usual style.
In contrast, a character entity reference refers to a character by the name of an entity which has the desired character as its replacement text. The entity must either be predefined or explicitly declared in a Document Type Definition. The format is the same as for any entity reference:
where name is the case-sensitive name of the entity. The semicolon is required.
Planes
Unicode and ISO divide the set of code points into 17 planes, each capable of containing 65536 distinct characters or 1,114,112 total. As of 2025 ISO and the Unicode Consortium has only allocated characters and blocks in seven of the 17 planes. The others remain empty and reserved for future use.Most characters are currently assigned to the first plane: the Basic Multilingual Plane. This is to help ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two octets. The characters outside the first plane usually have very specialized or rare use.
Each plane corresponds with the value of the one or two hexadecimal digits preceding the four final ones: hence U+24321 is in Plane 2, U+4321 is in Plane 0, and U+10A200 would be in Plane 16. Within one plane, the range of code points is hexadecimal 0000—FFFF, yielding a maximum of 65536 code points. Planes restrict code points to a subset of that range.
Blocks
Unicode adds a block property to UCS that further divides each plane into separate blocks. Each block is a grouping of characters by their use such as "mathematical operators" or "Hebrew script characters". When assigning characters to previously unassigned code points, the Consortium typically allocates entire blocks of similar characters: for example all the characters belonging to the same script or all similarly purposed symbols get assigned to a single block. Blocks may also maintain unassigned or reserved code points when the Consortium expects a block to require additional assignments.The first 256 code points in the UCS correspond with those of ISO 8859-1, the most popular 8-bit character encoding in the Western world. As a result, the first 128 characters are also identical to ASCII. Though Unicode refers to these as a Latin script block, these two blocks contain many characters that are commonly useful outside of the Latin script. In general, not all characters in a given block need be of the same script, and a given script can occur in several different blocks.
Categories
Unicode assigns to every UCS character a general category and subcategory. The general categories are: letter, mark, number, punctuation, symbol, or control.Types include:
- Modern, Historic, and Ancient Scripts. As of 2025, the UCS identifies 172 scripts that are, or have been, used throughout of the world. Many more are in various approval stages for future inclusion of the UCS.
- International Phonetic Alphabet. The UCS devotes several blocks to characters for the International Phonetic Alphabet.
- Combining Diacritical Marks. An important advance conceived by Unicode in designing the UCS and related algorithms for handling text was the introduction of combining diacritic marks. By providing accents that can combine with any letter character, the Unicode and the UCS reduce significantly the number of characters needed. While the UCS also includes precomposed characters, these were included primarily to facilitate support within UCS for non-Unicode text processing systems.
- Punctuation. Along with unifying diacritical marks, the UCS also sought to unify punctuation across scripts. Many scripts also contain punctuation, however, when that punctuation has no similar semantics in other scripts.
- Symbols. Many mathematics, technical, geometrical and other symbols are included within the UCS. This provides distinct symbols with their own code point or character rather than relying on switching fonts to provide symbolic glyphs.
- * Currency.
- * Letterlike. These symbols appear like combinations of many common Latin scripts letters such as. Unicode designates many of the letterlike symbols as compatibility characters usually because they can be in plain text by substituting glyphs for a composing sequence of characters: for example substituting the glyph for the composed sequence of characters.
- * Number Forms. Number forms primarily consist of precomposed fractions and Roman numerals. Like other areas of composing sequences of characters, the Unicode approach prefers the flexibility of composing fractions by combining characters together. In this case to create fractions, one combines numbers with the fraction slash character. As an example of the flexibility this approach provides, there are nineteen precomposed fraction characters included within the UCS. However, there are an infinity of possible fractions. By using composing characters the infinity of fractions is handled by 11 characters. No character set could include code points for every precomposed fraction. Ideally a text system should present the same glyphs for a fraction whether it is one of the precomposed fractions or a composing sequence of characters. However, web browsers are not typically that sophisticated with Unicode and text handling. Doing so ensures that precomposed fractions and combining sequence fractions will appear compatible next to each other.
- * Arrows.
- * Mathematical.
- * Geometric Shapes.
- * Legacy Computing.
- * Control Pictures Graphical representations of many control characters.
- * Box Drawing.
- * Block Elements.
- * Braille Patterns.
- * Optical Character Recognition.
- * Technical.
- * Dingbats.
- * Miscellaneous Symbols.
- * Emoticons.
- * Symbols and Pictographs.
- * Alchemical Symbols.
- * Game Pieces.
- * Chess Symbols
- * Tai Xuan Jing.
- * Yijing Hexagram Symbols.
- CJK. Devoted to ideographs and other characters to support languages in China, Japan, Korea, Taiwan, Vietnam, and Thailand.
- * Radicals and Strokes.
- * Ideographs. By far the largest portion of the UCS is devoted to ideographs used in languages of Eastern Asia. While the glyph representation of these ideographs have diverged in the languages that use them, the UCS unifies these Han characters in what Unicode refers to as Unihan. With Unihan, the text layout software must work together with the available fonts and these Unicode characters to produce the appropriate glyph for the appropriate language. Despite unifying these characters, the UCS still includes over 101,000 Unihan ideographs.
- Musical Notation.
- Duployan shorthands.
- Sutton SignWriting.
- Compatibility Characters. Several blocks in the UCS are devoted almost entirely to compatibility characters. Compatibility characters are those included for support of legacy text handling systems that do not make a distinction between character and glyph the way Unicode does. For example, many Arabic letters are represented by a different glyph when the letter appears at the end of a word than when the letter appears at the beginning of a word. Unicode's approach prefers to have these letters mapped to the same character for ease of internal machine text processing and storage. To complement this approach, the text software must select different glyph variants for display of the character based on its context. Over 4000 characters are included for such compatibility reasons.
- Control Characters.
- Surrogates. The UCS includes 2048 code points in the Basic Multilingual Plane for surrogate code point pairs. Together these surrogates allow any code point in the sixteen other planes to be addressed by using two surrogate code points. This provides a simple built-in method for encoding the 20.1 bit UCS within a 16 bit encoding such as UTF-16. In this way UTF-16 can represent any character within the BMP with a single 16-bit word. Characters outside the BMP are then encoded using two 16-bit words using the surrogate pairs.
- Private Use. The consortium provides several private use blocks and planes that can be assigned characters within various communities, as well as operating system and font vendors.
- . The consortium guarantees certain code points will never be assigned a character and calls these code points. These include the range U+FDD0..U+FDEF, and the last two code points of each plane.