Mojibake
Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.
This display may include the generic replacement character in places where the binary representation is considered invalid. A replacement can also involve multiple consecutive symbols, as viewed in one encoding, when the same binary code constitutes one symbol in the other encoding. This is either because of differing constant length encoding, or the use of variable length encodings.
Failed rendering of glyphs due to either missing fonts or missing glyphs in a font is a different issue that is not to be confused with mojibake. Symptoms of this failed rendering include blocks with the code point displayed in hexadecimal or using the generic replacement character. Importantly, these replacements are valid and are the result of correct error handling by the software.
Causes
To correctly reproduce the original text that was encoded, the correspondence between the encoded data and the notion of its encoding must be preserved. As mojibake is the instance of non-compliance between these, it can be achieved by manipulating the data itself, or just relabelling it.Mojibake is often seen with text data that have been tagged with a wrong encoding; it may not even be tagged at all, but moved between computers with different default encodings. A major source of trouble are communication protocols that rely on settings on each computer rather than sending or storing metadata together with the data.
The differing default settings between computers are in part due to differing deployments of Unicode among operating system families, and partly the legacy encodings' specializations for different writing systems of human languages. Whereas Linux distributions mostly switched to UTF-8 in 2004, Microsoft Windows generally uses UTF-16, and sometimes uses 8-bit code pages for text files in different languages.
For some writing systems, such as Japanese, several encodings have historically been employed, causing users to see mojibake relatively often. As an example, the word mojibake itself stored as EUC-JP might be incorrectly displayed as, "ハクサ嵂ス、ア", or "ハクサ郾ス、ア" if interpreted as Shift-JIS, or as "ʸ»ú²½¤±" in software that assumes text to be in the Windows-1252 or ISO 8859-1 encodings, usually labelled Western or Western European. This is further exacerbated if other locales are involved: the same text stored as UTF-8 appears as if interpreted as Shift-JIS, as "æ–‡å—化ã‘" if interpreted as Western, or as "鏂囧瓧鍖栥亼" if interpreted as being in a GBK locale.
Underspecification
If the encoding is not specified, it is up to the software to decide it by other means. Depending on the type of software, the typical solution is either configuration or charset detection heuristics, both of which are prone to mis-prediction.The encoding of text files is affected by locale setting, which depends on the user's language and brand of operating system, among other conditions. Therefore, the assumed encoding is systematically wrong for files that come from a computer with a different setting, or even from a differently localized piece of software within the same system. For Unicode, one solution is to use a byte order mark, but many parsers do not tolerate this for source code or other machine-readable text. Another solution is to store the encoding as metadata in the file system; file systems that support extended file attributes can store this as
user.charset. This also requires support in software that wants to take advantage of it, but does not disturb other software.While some encodings are easy to detect, such as UTF-8, there are many that are hard to distinguish. For example, a web browser may not be able to distinguish between a page coded in EUC-JP and another in Shift-JIS if the encoding is not assigned explicitly using HTTP headers sent along with the documents, or using the document's meta tags that are used to substitute for missing HTTP headers if the server cannot be configured to send the proper HTTP headers; see character encodings in HTML.
Mis-specification
Mojibake also occurs when the encoding is incorrectly specified. This often happens between encodings that are similar. For example, the Eudora email client for Windows was known to send emails labelled as ISO 8859-1 that were in reality Windows-1252. Windows-1252 contains extra printable characters in the C1 range, that were not displayed properly in software complying with the ISO standard; this especially affected software running under other operating systems such as Unix.User oversight
Of the encodings still in common use, many originated from taking ASCII and appending atop it; as a result, these encodings are partially compatible with each other. Examples of this include Windows-1252 and ISO 8859-1. People thus may mistake the expanded encoding set they are using with plain ASCII.Overspecification
When there are layers of protocols, each trying to specify the encoding based on different information, the least certain information may be misleading to the recipient.For example, consider a web server serving a static HTML file over HTTP. The character set may be communicated to the client in any number of 3 ways:
- in the HTTP header. This information can be based on server configuration or controlled by the application running on the server.
- in the file, as an HTML meta tag or the
encodingattribute of an XML declaration. This is the encoding that the author meant to save the particular file in. - in the file, as a byte order mark. This is the encoding that the author's editor actually saved it in. Unless an accidental encoding conversion has happened, this will be correct. It is, however, only available in Unicode encodings such as UTF-8 or UTF-16.
Lack of hardware or software support
Resolutions
Applications using UTF-8 as a default encoding may achieve a greater degree of interoperability because of its widespread use and backward compatibility with ASCII. UTF-8 also has the ability to be directly recognised by a simple algorithm, so that well-written software should be able to avoid mixing UTF-8 up with other encodings.The difficulty of resolving an instance of mojibake varies depending on the application within which it occurs and the causes of it. Two of the most common applications in which mojibake may occur are web browsers and word processors. Modern browsers and word processors often support a wide array of character encodings. Browsers often allow a user to change their rendering engine's encoding setting on the fly, while word processors allow the user to select the appropriate encoding when opening a file. It may take some trial and error for users to find the correct encoding.
The problem gets more complicated when it occurs in an application that normally does not support a wide range of character encoding, such as in a non-Unicode computer game. In this case, the user must change the operating system's encoding settings to match that of the game. However, changing the system-wide encoding settings can also cause Mojibake in pre-existing applications. In Windows XP or later, a user also has the option to use Microsoft AppLocale, an application that allows the changing of per-application locale settings. Even so, changing the operating system encoding settings is not possible on earlier operating systems such as Windows 98; to resolve this issue on earlier operating systems, a user would have to use third party font rendering applications.
Problems in different writing systems
English
Mojibake in English texts generally occurs in punctuation, such as em dashes, en dashes, and curly quotes, but rarely in character text, since most encodings agree with ASCII on the encoding of the English alphabet. For example, the pound sign £ will appear as£ if it was encoded by the sender as UTF-8 but interpreted by the recipient as one of the Western European encodings. If iterated using CP1252, this can lead to £, £, £, £, and so on.Similarly, the right single quotation mark, when encoded in UTF-8 and decoded using Windows-1252, becomes
’, ’, ’, and so on.In older eras, some computers had vendor-specific encodings which caused mismatch also for English text. Commodore brand 8-bit computers used PETSCII encoding, particularly notable for inverting the upper and lower case compared to standard ASCII. PETSCII printers worked fine on other computers of the era, but inverted the case of all letters. IBM mainframes use the EBCDIC encoding which does not match ASCII at all.