Numeric character reference


A numeric character reference is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

Examples

In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma
Unicode characterNumerical baseNumerical reference in markupEffect
U+03A3DecimalΣΣ
U+03A3DecimalΣΣ
U+03A3HexadecimalΣΣ
U+03A3HexadecimalΣΣ
U+03A3HexadecimalΣΣ

In SGML, HTML, and XML, the following are all valid numeric character references for the Latin capital letter AE
Unicode characterNumerical baseNumerical reference in markupEffect
U+00C6DecimalÆÆ
U+00C6HexadecimalÆÆ

In SGML, HTML, and XML, the following are all valid numeric character references for the Latin small letter sharp s
Unicode characterNumerical baseNumerical reference in markupEffect
U+00DFDecimalßß
U+00DFHexadecimalßß

List of numeric character references for the printable ASCII characters:
Unicode characterCharacter
Reference
Character
Reference
Effect
U+0020  
U+0021!!!
U+0022"""
U+0023###
U+0024$$$
U+0025%%%
U+0026&&&
U+0027'''
U+0028((
U+002A***
U+002B+++
U+002C,,,
U+002D---
U+002E...
U+002F///
U+0030000
U+0031111
U+0032222
U+0033333
U+0034444
U+0035555
U+0036666
U+0037777
U+0038888
U+0039999
U+003A:::
U+003B&#59;&#x3B;;
U+003C&#60;&#x3C;<
U+003D&#61;&#x3D;=
U+003E&#62;&#x3E;>
U+003F&#63;&#x3F;?
U+0040&#64;&#x40;@
U+0041&#65;&#x41;A
U+0042&#66;&#x42;B
U+0043&#67;&#x43;C
U+0044&#68;&#x44;D
U+0045&#69;&#x45;E
U+0046&#70;&#x46;F
U+0047&#71;&#x47;G
U+0048&#72;&#x48;H
U+0049&#73;&#x49;I
U+004A&#74;&#x4A;J
U+004B&#75;&#x4B;K
U+004C&#76;&#x4C;L
U+004D&#77;&#x4D;M
U+004E&#78;&#x4E;N
U+004F&#79;&#x4F;O
U+0050&#80;&#x50;P
U+0051&#81;&#x51;Q
U+0052&#82;&#x52;R
U+0053&#83;&#x53;S
U+0054&#84;&#x54;T
U+0055&#85;&#x55;U
U+0056&#86;&#x56;V
U+0057&#87;&#x57;W
U+0058&#88;&#x58;X
U+0059&#89;&#x59;Y
U+005A&#90;&#x5A;Z
U+005B&#91;&#x5B;
U+005E&#94;&#x5E;^
U+005F&#95;&#x5F;_
U+0060&#96;&#x60;'
U+0061&#97;&#x61;a
U+0062&#98;&#x62;b
U+0063&#99;&#x63;c
U+0064&#100;&#x64;d
U+0065&#101;&#x65;e
U+0066&#102;&#x66;f
U+0067&#103;&#x67;g
U+0068&#104;&#x68;h
U+0069&#105;&#x69;i
U+006A&#106;&#x6A;j
U+006B&#107;&#x6B;k
U+006C&#108;&#x6C;l
U+006D&#109;&#x6D;m
U+006E&#110;&#x6E;n
U+006F&#111;&#x6F;o
U+0070&#112;&#x70;p
U+0071&#113;&#x71;q
U+0072&#114;&#x72;r
U+0073&#115;&#x73;s
U+0074&#116;&#x74;t
U+0075&#117;&#x75;u
U+0076&#118;&#x76;v
U+0077&#119;&#x77;w
U+0078&#120;&#x78;x
U+0079&#121;&#x79;y
U+007A&#122;&#x7A;z
U+007B&#123;&#x7B;
U+007E&#126;&#x7E;~

Discussion

Markup languages are typically defined in terms of UCS or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding.
Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence.
Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on ISO 8859 can only represent, at most, 256 unique characters as one 8-bit byte each.
Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism.
The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range to represent, or reference, any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are character references.
Character references that are based on the referenced character's UCS or Unicode code point are called numeric character references. In HTML 4 and in all versions of XHTML and XML, the code point can be expressed either as a decimal number or as a hexadecimal number. The syntax is as follows:
Character U+0026, followed by character U+0023, followed by one of the following choices:
  • one or more decimal digits zero through nine ; or
  • character U+0078 followed by one or more hexadecimal digits, which are zero through nine, Latin capital letter A through F, and Latin small letter a through f ;
all followed by character U+003B. Older versions of HTML disallowed the hexadecimal syntax.
The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable.
There is another kind of character reference called a character entity reference, which allows a character to be referred to by a name instead of a number. HTML defines some character entities, but not many; all other characters can only be included by direct encoding or using NCRs.