UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format 8-bit. As of 2026, almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,064 valid Unicode code points using a variable-width encoding of one to four one-byte code units.
Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file. Most software designed for any extended ASCII can read and write UTF-8, and this results in fewer internationalization issues than any alternative text encoding.
UTF-8 is dominant for all countries/languages on the internet, is used in most standards, often the only allowed encoding, and is supported by all modern operating systems and programming languages.
History
The International Organization for Standardization set out to compose a universal multi-byte character set in 1989. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, among other problems, and the biggest problem was probably that it did not have a clear separation between ASCII and non-ASCII: new UTF-1 tools would be backward compatible with ASCII-encoded text, but UTF-1-encoded text could confuse existing code expecting ASCII, because it could contain continuation bytes in the range – that meant something else in ASCII, e.g., for/, the Unix path directory separator.In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; multi-byte sequences would only include bytes with the high bit set. The name File System Safe UCS Transformation Format and most of the text of this proposal were later preserved in the final specification. In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties.
A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it self-synchronizing, letting a reader start anywhere and immediately detect character boundaries, at the cost of being somewhat less bit-efficient than the previous proposal. It also abandoned the use of biases that prevented [|overlong encodings]. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF. UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29, 1993. The Internet Engineering Task Force adopted UTF-8 in its Policy on Character Sets and Languages in RFC 2277 for future internet standards work in January 1998, replacing Single Byte Character Sets such as Latin-1 in older RFCs.
In November 2003, UTF-8 was restricted by to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at removed more than 48% of the four-byte sequences and all five- and six-byte sequences.
Description
UTF-8 encodes code points in one to four bytes, depending on the value of the code point. In the following table, the characters to, each representing a hexadecimal digit, are replaced by their constituent 4 bits to, from the positions :As an example, the character 桁 has the hexadecimal code point, which is in binary, which makes its UTF-8 encoding.
The first 128 code points need 1 byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also IPA extensions, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for the remaining 61,440 codepoints of the Basic Multilingual Plane, including most Chinese, Japanese and Korean characters. Four bytes are needed for the 1,048,576 non-BMP code points, which include emoji, less common CJK characters, and other useful characters.
UTF-8 is a prefix code and it is unnecessary to read past the last byte of a code point to decode it. Unlike many earlier multi-byte text encodings such as Shift-JIS, it is self-synchronizing so searches for short strings or characters are possible; and the start of a code point can be found from a random position by backing up at most 3 bytes. The values chosen for the lead bytes means sorting a list of UTF-8 strings puts them in the same order as sorting UTF-32 strings.
Overlong encodings
Using a row in the above table to encode a code point less than "First code point" is termed an overlong encoding. These are a security problem because they allow character sequences such as malicious JavaScript and../ to bypass security validations, which has been reported in numerous high-profile products such as Microsoft's IIS web server and Apache's Tomcat servlet container. Overlong encodings should therefore be considered an error and never decoded.Error handling
Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:- A "continuation byte" at the start of a character
- A non-continuation byte before the end of a character
- An overlong encoding
- A 4-byte sequence that decodes to a value greater than
states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." The Unicode Standard requires decoders to: "... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence." The standard now recommends replacing each error with the replacement character "" and continue decoding.
Some decoders consider the sequence as a single error. This is not a good idea as a search for a space character would find the one hidden in the error. Since Unicode 6 the standard has recommended a "best practice" where the error is either one continuation byte, or ends at the first byte that is disallowed, so is a two-byte error followed by a space. An error is no more than three bytes long, never contains the start of a valid character, and there are different possible errors. Many decoders instead make each byte be an error, in which case is two errors followed by a space; there are now only 128 different errors which makes it practical to store the errors in the output string, or replace them with characters from a legacy encoding.
Only a small subset of possible byte strings are error-free UTF-8: several bytes cannot appear; a byte with the high bit set cannot be alone; and in a truly random string a byte with a high bit set has only a chance of starting a valid UTF-8 character. This has the consequence of making it easy to detect if a legacy text encoding is accidentally used instead of UTF-8, making conversion of a system to UTF-8 easier and avoiding the need to require a Byte Order Mark or any other metadata.
Surrogates
Since RFC 3629, the high and low surrogates used by UTF-16 are not legal Unicode values, and their UTF-8 encodings must be treated as an invalid byte sequence. These encodings all start with followed by or higher. This rule is often ignored as surrogates are allowed in Windows filenames and this means there must be a way to store them in a string. UTF-8 that allows these surrogate halves has been called WTF-8, for "wobbly transformation format", while another variation that also encodes all non-BMP characters as two surrogates is called CESU-8.Byte map
The chart below gives the detailed meaning of each byte in a stream encoded in UTF-8.Byte-order mark
If the Unicode byte-order mark is at the start of a UTF-8 file, the first three bytes will be,,.The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding. While ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file. Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM.