Code page

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.
The term "code page" originated from IBM's EBCDIC-based mainframe systems, but Microsoft, SAP, and Oracle Corporation are among the vendors that use this term. The majority of vendors identify their own character sets by a name. In the case when there is a plethora of character sets, identifying character sets through a number is a convenient way to distinguish them. Originally, the code page numbers referred to the page numbers in the IBM standard character set manual, a condition which has not held for a long time. Vendors that use a code page system allocate their own code page number to a character encoding, even if it is better known by another name; for example, UTF-8 has been assigned page numbers 1208 at IBM, 65001 at Microsoft, and 4110 at SAP.
Hewlett-Packard uses a similar concept in its HP-UX operating system and its Printer Command Language protocol for printers. The terminology, however, is different: What others call a character set, HP calls a symbol set, and what IBM or Microsoft call a code page, HP calls a symbol set code. HP developed a series of symbol sets, each with an associated symbol set code, to encode both its own character sets and other vendors’ character sets.
The multitude of character sets leads many vendors to recommend Unicode.

The code page numbering system

IBM introduced the concept of systematically assigning a small, but globally unique, 16 bit number to each character encoding that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware.
With the release of PC DOS version 3.3 IBM introduced the code page numbering system to regular PC users, as the code page numbers were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way.
After IBM and Microsoft ceased to cooperate in the 1990s, the two companies have maintained the list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one third-party vendor also has its own different list of numeric assignments. IBM's current assignments are listed in their CCSID repository, while Microsoft's assignments are documented within the MSDN. Additionally, a list of the names and approximate IANA abbreviations for the installed code pages on any given Windows machine can be found in the Registry on that machine.
Most well-known code pages, excluding those for the CJK languages and Vietnamese, fit all their code-points into eight bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.
The text mode of standard PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to eight may be stored in the display adapter for easy switching. There was a selection of third-party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages.

Relationship to ASCII

The majority of code pages in current use are supersets of ASCII, a 7-bit code representing 128 control codes and printable characters. In the distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a parity bit in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors used this extended range to encode characters used by various languages and graphical elements that allowed the imitation of primitive graphics on text-only output devices. No formal standard existed for these "extended ASCII character sets" and vendors referred to the variants as code pages, as IBM had always done for variants of EBCDIC encodings.

Relationship to Unicode

Unicode is an effort to include all characters from all currently and historically used human languages into single character enumeration, removing the need to distinguish between different code pages when handling digitally stored text. Unicode tries to retain backwards compatibility with many legacy code pages, copying some code pages 1:1 in the design process. An explicit design goal of Unicode was to allow round-trip conversion between all common legacy code pages, although this goal has not always been achieved.
Some vendors, namely IBM and Microsoft, have anachronistically assigned code page numbers to Unicode encodings. This convention allows code page numbers to be used as metadata to identify the correct decoding algorithm when encountering binary stored data.

IBM code pages

EBCDIC-based code pages

These code pages are used by IBM in its EBCDIC character sets for mainframe computers.

1 – USA WP, Original
2 – USA
3 – USA Accounting, Version A
4 – USA
5 – USA
6 – Latin America
7 – Germany F.R. / Austria
8 – Germany F.R.
9 – France, Belgium
10 – Canada
11 – Canada
12 – Italy
13 – Netherlands
14 – Spain
15 – Switzerland
16 – Switzerland
17 – Switzerland
18 – Sweden / Finland
19 – Sweden / Finland WP, version 2
20 – Denmark/Norway
21 – Brazil
22 – Portugal
23 – United Kingdom
24 – United Kingdom
25 – Japan
26 – Japan
27 – Greece
29 – Iceland
30 – Turkey
31 – South Africa
32 – Czechoslovakia
33 – Czechoslovakia
34 – Czechoslovakia
35 – Romania
36 – Romania
37 – USA/Canada - CECP
37-2 – The real 3279 APL codepage, as used by C/370. This is very close to 1047, except for caret and not-sign inverted. It is not officially recognized by IBM, even though SHARE has pointed out its existence.
38 – USA ASCII
39 – United Kingdom / Israel
40 – United Kingdom
251 – China
252 – Poland
254 – Hungary
256 – International #1
257 – International #2
258 – International #3
259 – Symbols, Set 7
260 – Canadian French - 116
264 – Print Train & Text processing extended
273 – Germany F.R./Austria - CECP
274 – Old Belgium Code Page
275 – Brazil - CECP
276 – Canada - 94
277 – Denmark, Norway - CECP
278 – Finland, Sweden - CECP
279 – French - 94
280 – Italy - CECP
281 – Japan - CECP
282 – Portugal - CECP
283 – Spain - 190
284 – Spain/Latin America - CECP
285 – United Kingdom - CECP
286 – Austria / Germany F.R. Alternate
287 – Denmark / Norway Alternate
288 – Finland / Sweden Alternate
289 – Spain Alternate
290 – Japanese Extended
293 – APL
297 – France
298 – Japan
300 – Japan DBCS
310 – Graphic Escape APL/TN
320 – Hungary
321 – Yugoslavia
322 – Turkey
330 – International #4
340 – EBCDIC, OCR
351 – GDDM default
352 – Printing and publishing option
353 – BCDIC-A
354 – BCDIC-B
355 – PTTC/BCD standard option
357 – PTTC/BCD H option
358 – PTTC/BCD Correspondence option
359 – PTTC/BCD Monocase option
360 – PTTC/BCD Duocase option
361 – EBCDIC Publishing International
363 – Symbols, set 8
382 – EBCDIC Publishing Austria, Germany F.R. Alternate
383 – EBCDIC Publishing Belgium
384 – EBCDIC Publishing Brazil
385 – EBCDIC Publishing Canada
386 – EBCDIC Publishing Denmark, Norway
387 – EBCDIC Publishing Finland, Sweden
388 – EBCDIC Publishing France
389 – EBCDIC Publishing Italy
390 – EBCDIC Publishing Japan
391 – EBCDIC Publishing Portugal
392 – EBCDIC Publishing Spain, Philippines
393 – EBCDIC Publishing Latin America
394 – EBCDIC Publishing China, UK, Ireland
395 – EBCDIC Publishing Australia, New Zealand, USA, Canada
396 – BookMaster Specials
410 – Cyrillic
420 – Arabic
421 – Maghreb/French
423 – Greek
424 – Hebrew
425 – Arabic / Latin for OS/390 Open Edition
435 – Teletext Isomorphic
500 – International #5
803 – Hebrew Character Set A
829 – Host Math Symbols- Publishing
830 – Math Format
831 – Portugal
833 – Korean Extended
834 – Korean Hangul
835 – Traditional Chinese DBCS
836 – Simplified Chinese Extended
837 – Simplified Chinese DBCS
838 – Thai with Low Marks & Accented Characters
839 – Thai DBCS
870 – Latin 2
871 – Iceland
875 – Greek
880 – Cyrillic
881 – United States - 5080 Graphics System
882 – United Kingdom - 5080 Graphics System
883 – Sweden - 5080 Graphics System
884 – Germany - 5080 Graphics System
885 – France - 5080 Graphics System
886 – Italy - 5080 Graphics System
887 – Japan - 5080 Graphics System
888 – France AZERTY - 5080 Graphics System
889 – Thailand
890 – Yugoslavia
892 – EBCDIC, OCR A
893 – EBCDIC, OCR B
905 – Latin 3
918 – Urdu Bilingual
924 – Latin 9
930 – Japan MIX
931 – Japan MIX
933 – Korea MIX
935 – Simplified Chinese MIX
937 – Traditional Chinese MIX
939 – Japan MIX
1001 – MICR
1002 – EBCDIC DCF Release 2 Compatibility
1003 – EBCDIC DCF, US Text subset
1005 – EBCDIC Isomorphic Text Communication
1007 – EBCDIC Arabic
1024 – EBCDIC T.61
1025 – Cyrillic, Multilingual
1026 – EBCDIC Turkey
1027 – Japanese Extended
1028 – EBCDIC Publishing Hebrew
1030 – Japanese Extended
1031 – Japanese Extended
1032 – MICR, E13-B Combined
1033 – MICR, CMC-7 Combined
1037 – Korea - 5080/6090 Graphics System
1039 – GML Compatibility
1047 – Latin 1/Open Systems
1068 – DCF Compatibility
1069 – Latin 4
1070 – USA / Canada Version 0
1071 – Germany F.R. / Austria
1072 – Belgium
1073 – Brazil
1074 – Denmark, Norway
1075 – Finland, Sweden
1076 – Italy
1077 – Japan
1078 – Portugal
1079 – Spain / Latin America Version 0
1080 – United Kingdom
1081 – France Version 0
1082 – Israel
1083 – Israel
1084 – International#5 Version 0
1085 – Iceland
1087 – Symbol Set
1091 – Modified Symbols, Set 7
1093 – IBM Logo
1097 – Farsi Bilingual
1110 – Latin 2
1112 – Baltic Multilingual
1113 – Latin 6
1122 – Estonia
1123 – Cyrillic, Ukraine
1130 – Vietnamese
1132 – Lao EBCDIC
1136 – Hitachi Katakana
1137 – Devanagari EBCDIC
1140 – USA, Canada, etc. ECECP
1141 – Austria, Germany ECECP
1142 – Denmark, Norway ECECP
1143 – Finland, Sweden ECECP
1144 – Italy ECECP
1145 – Spain, Latin America ECECP
1146 – UK ECECP
1147 – France ECECP with euro
1148 – International ECECP with euro
1149 – Icelandic ECECP with euro
1150 – Korean Extended with box characters
1151 – Simplified Chinese Extended with box characters
1152 – Traditional Chinese Extended with box characters
1153 – Latin 2 Multilingual with euro
1154 – Cyrillic, Multilingual with euro
1155 – Turkey with euro
1156 – Baltic Multi with euro
1157 – Estonia with euro
1158 – Cyrillic, Ukraine with euro
1159 – T-Chinese EBCDIC
1160 – Thai with Low Marks & Accented Characters with euro
1164 – Vietnamese with euro
1165 – Latin 2/Open Systems
1166 – Cyrillic Kazakh
1175 – Turkey with euro and lira
1278 – EBCDIC Adobe Standard Encoding
1279 – Hitachi Japanese Katakana Host
1300 – Generic Bar Code/OCR-B
1301 – Zip + 4 POSTNET Bar Code
1302 – Facing Identification Marks
1303 – EBCDIC Bar Code
1364 – Korea MIX
1371 – Traditional Chinese MIX
1376 – Traditional Chinese DBCS Host extension for HKSCS
1377 – Mixed Host HKSCS Growing
1378 – Traditional Chinese DBCS Host extension for HKSCS and Simplified Chinese
1379 – Mixed Host HKSCS and Simplified Chinese Growing
1388 – Simplified Chinese MIX
1390 – Simplified Chinese MIX Japan MIX
1399 – Japan MIX