Code page
In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.
The term "code page" originated from IBM's EBCDIC-based mainframe systems, but Microsoft, SAP, and Oracle Corporation are among the vendors that use this term. The majority of vendors identify their own character sets by a name. In the case when there is a plethora of character sets, identifying character sets through a number is a convenient way to distinguish them. Originally, the code page numbers referred to the page numbers in the IBM standard character set manual, a condition which has not held for a long time. Vendors that use a code page system allocate their own code page number to a character encoding, even if it is better known by another name; for example, UTF-8 has been assigned page numbers 1208 at IBM, 65001 at Microsoft, and 4110 at SAP.
Hewlett-Packard uses a similar concept in its HP-UX operating system and its Printer Command Language protocol for printers. The terminology, however, is different: What others call a character set, HP calls a symbol set, and what IBM or Microsoft call a code page, HP calls a symbol set code. HP developed a series of symbol sets, each with an associated symbol set code, to encode both its own character sets and other vendors’ character sets.
The multitude of character sets leads many vendors to recommend Unicode.
The code page numbering system
IBM introduced the concept of systematically assigning a small, but globally unique, 16 bit number to each character encoding that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware.With the release of PC DOS version 3.3 IBM introduced the code page numbering system to regular PC users, as the code page numbers were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way.
After IBM and Microsoft ceased to cooperate in the 1990s, the two companies have maintained the list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one third-party vendor also has its own different list of numeric assignments. IBM's current assignments are listed in their CCSID repository, while Microsoft's assignments are documented within the MSDN. Additionally, a list of the names and approximate IANA abbreviations for the installed code pages on any given Windows machine can be found in the Registry on that machine.
Most well-known code pages, excluding those for the CJK languages and Vietnamese, fit all their code-points into eight bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.
The text mode of standard PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to eight may be stored in the display adapter for easy switching. There was a selection of third-party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages.
Relationship to ASCII
The majority of code pages in current use are supersets of ASCII, a 7-bit code representing 128 control codes and printable characters. In the distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a parity bit in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors used this extended range to encode characters used by various languages and graphical elements that allowed the imitation of primitive graphics on text-only output devices. No formal standard existed for these "extended ASCII character sets" and vendors referred to the variants as code pages, as IBM had always done for variants of EBCDIC encodings.Relationship to Unicode
Unicode is an effort to include all characters from all currently and historically used human languages into single character enumeration, removing the need to distinguish between different code pages when handling digitally stored text. Unicode tries to retain backwards compatibility with many legacy code pages, copying some code pages 1:1 in the design process. An explicit design goal of Unicode was to allow round-trip conversion between all common legacy code pages, although this goal has not always been achieved.Some vendors, namely IBM and Microsoft, have anachronistically assigned code page numbers to Unicode encodings. This convention allows code page numbers to be used as metadata to identify the correct decoding algorithm when encountering binary stored data.
IBM code pages
EBCDIC-based code pages
These code pages are used by IBM in its EBCDIC character sets for mainframe computers.- 1 – USA WP, Original
- 2 – USA
- 3 – USA Accounting, Version A
- 4 – USA
- 5 – USA
- 6 – Latin America
- 7 – Germany F.R. / Austria
- 8 – Germany F.R.
- 9 – France, Belgium
- 10 – Canada
- 11 – Canada
- 12 – Italy
- 13 – Netherlands
- 14 – Spain
- 15 – Switzerland
- 16 – Switzerland
- 17 – Switzerland
- 18 – Sweden / Finland
- 19 – Sweden / Finland WP, version 2
- 20 – Denmark/Norway
- 21 – Brazil
- 22 – Portugal
- 23 – United Kingdom
- 24 – United Kingdom
- 25 – Japan
- 26 – Japan
- 27 – Greece
- 29 – Iceland
- 30 – Turkey
- 31 – South Africa
- 32 – Czechoslovakia
- 33 – Czechoslovakia
- 34 – Czechoslovakia
- 35 – Romania
- 36 – Romania
- 37 – USA/Canada - CECP
- 37-2 – The real 3279 APL codepage, as used by C/370. This is very close to 1047, except for caret and not-sign inverted. It is not officially recognized by IBM, even though SHARE has pointed out its existence.
- 38 – USA ASCII
- 39 – United Kingdom / Israel
- 40 – United Kingdom
- 251 – China
- 252 – Poland
- 254 – Hungary
- 256 – International #1
- 257 – International #2
- 258 – International #3
- 259 – Symbols, Set 7
- 260 – Canadian French - 116
- 264 – Print Train & Text processing extended
- 273 – Germany F.R./Austria - CECP
- 274 – Old Belgium Code Page
- 275 – Brazil - CECP
- 276 – Canada - 94
- 277 – Denmark, Norway - CECP
- 278 – Finland, Sweden - CECP
- 279 – French - 94
- 280 – Italy - CECP
- 281 – Japan - CECP
- 282 – Portugal - CECP
- 283 – Spain - 190
- 284 – Spain/Latin America - CECP
- 285 – United Kingdom - CECP
- 286 – Austria / Germany F.R. Alternate
- 287 – Denmark / Norway Alternate
- 288 – Finland / Sweden Alternate
- 289 – Spain Alternate
- 290 – Japanese Extended
- 293 – APL
- 297 – France
- 298 – Japan
- 300 – Japan DBCS
- 310 – Graphic Escape APL/TN
- 320 – Hungary
- 321 – Yugoslavia
- 322 – Turkey
- 330 – International #4
- 340 – EBCDIC, OCR
- 351 – GDDM default
- 352 – Printing and publishing option
- 353 – BCDIC-A
- 354 – BCDIC-B
- 355 – PTTC/BCD standard option
- 357 – PTTC/BCD H option
- 358 – PTTC/BCD Correspondence option
- 359 – PTTC/BCD Monocase option
- 360 – PTTC/BCD Duocase option
- 361 – EBCDIC Publishing International
- 363 – Symbols, set 8
- 382 – EBCDIC Publishing Austria, Germany F.R. Alternate
- 383 – EBCDIC Publishing Belgium
- 384 – EBCDIC Publishing Brazil
- 385 – EBCDIC Publishing Canada
- 386 – EBCDIC Publishing Denmark, Norway
- 387 – EBCDIC Publishing Finland, Sweden
- 388 – EBCDIC Publishing France
- 389 – EBCDIC Publishing Italy
- 390 – EBCDIC Publishing Japan
- 391 – EBCDIC Publishing Portugal
- 392 – EBCDIC Publishing Spain, Philippines
- 393 – EBCDIC Publishing Latin America
- 394 – EBCDIC Publishing China, UK, Ireland
- 395 – EBCDIC Publishing Australia, New Zealand, USA, Canada
- 396 – BookMaster Specials
- 410 – Cyrillic
- 420 – Arabic
- 421 – Maghreb/French
- 423 – Greek
- 424 – Hebrew
- 425 – Arabic / Latin for OS/390 Open Edition
- 435 – Teletext Isomorphic
- 500 – International #5
- 803 – Hebrew Character Set A
- 829 – Host Math Symbols- Publishing
- 830 – Math Format
- 831 – Portugal
- 833 – Korean Extended
- 834 – Korean Hangul
- 835 – Traditional Chinese DBCS
- 836 – Simplified Chinese Extended
- 837 – Simplified Chinese DBCS
- 838 – Thai with Low Marks & Accented Characters
- 839 – Thai DBCS
- 870 – Latin 2
- 871 – Iceland
- 875 – Greek
- 880 – Cyrillic
- 881 – United States - 5080 Graphics System
- 882 – United Kingdom - 5080 Graphics System
- 883 – Sweden - 5080 Graphics System
- 884 – Germany - 5080 Graphics System
- 885 – France - 5080 Graphics System
- 886 – Italy - 5080 Graphics System
- 887 – Japan - 5080 Graphics System
- 888 – France AZERTY - 5080 Graphics System
- 889 – Thailand
- 890 – Yugoslavia
- 892 – EBCDIC, OCR A
- 893 – EBCDIC, OCR B
- 905 – Latin 3
- 918 – Urdu Bilingual
- 924 – Latin 9
- 930 – Japan MIX
- 931 – Japan MIX
- 933 – Korea MIX
- 935 – Simplified Chinese MIX
- 937 – Traditional Chinese MIX
- 939 – Japan MIX
- 1001 – MICR
- 1002 – EBCDIC DCF Release 2 Compatibility
- 1003 – EBCDIC DCF, US Text subset
- 1005 – EBCDIC Isomorphic Text Communication
- 1007 – EBCDIC Arabic
- 1024 – EBCDIC T.61
- 1025 – Cyrillic, Multilingual
- 1026 – EBCDIC Turkey
- 1027 – Japanese Extended
- 1028 – EBCDIC Publishing Hebrew
- 1030 – Japanese Extended
- 1031 – Japanese Extended
- 1032 – MICR, E13-B Combined
- 1033 – MICR, CMC-7 Combined
- 1037 – Korea - 5080/6090 Graphics System
- 1039 – GML Compatibility
- 1047 – Latin 1/Open Systems
- 1068 – DCF Compatibility
- 1069 – Latin 4
- 1070 – USA / Canada Version 0
- 1071 – Germany F.R. / Austria
- 1072 – Belgium
- 1073 – Brazil
- 1074 – Denmark, Norway
- 1075 – Finland, Sweden
- 1076 – Italy
- 1077 – Japan
- 1078 – Portugal
- 1079 – Spain / Latin America Version 0
- 1080 – United Kingdom
- 1081 – France Version 0
- 1082 – Israel
- 1083 – Israel
- 1084 – International#5 Version 0
- 1085 – Iceland
- 1087 – Symbol Set
- 1091 – Modified Symbols, Set 7
- 1093 – IBM Logo
- 1097 – Farsi Bilingual
- 1110 – Latin 2
- 1112 – Baltic Multilingual
- 1113 – Latin 6
- 1122 – Estonia
- 1123 – Cyrillic, Ukraine
- 1130 – Vietnamese
- 1132 – Lao EBCDIC
- 1136 – Hitachi Katakana
- 1137 – Devanagari EBCDIC
- 1140 – USA, Canada, etc. ECECP
- 1141 – Austria, Germany ECECP
- 1142 – Denmark, Norway ECECP
- 1143 – Finland, Sweden ECECP
- 1144 – Italy ECECP
- 1145 – Spain, Latin America ECECP
- 1146 – UK ECECP
- 1147 – France ECECP with euro
- 1148 – International ECECP with euro
- 1149 – Icelandic ECECP with euro
- 1150 – Korean Extended with box characters
- 1151 – Simplified Chinese Extended with box characters
- 1152 – Traditional Chinese Extended with box characters
- 1153 – Latin 2 Multilingual with euro
- 1154 – Cyrillic, Multilingual with euro
- 1155 – Turkey with euro
- 1156 – Baltic Multi with euro
- 1157 – Estonia with euro
- 1158 – Cyrillic, Ukraine with euro
- 1159 – T-Chinese EBCDIC
- 1160 – Thai with Low Marks & Accented Characters with euro
- 1164 – Vietnamese with euro
- 1165 – Latin 2/Open Systems
- 1166 – Cyrillic Kazakh
- 1175 – Turkey with euro and lira
- 1278 – EBCDIC Adobe Standard Encoding
- 1279 – Hitachi Japanese Katakana Host
- 1300 – Generic Bar Code/OCR-B
- 1301 – Zip + 4 POSTNET Bar Code
- 1302 – Facing Identification Marks
- 1303 – EBCDIC Bar Code
- 1364 – Korea MIX
- 1371 – Traditional Chinese MIX
- 1376 – Traditional Chinese DBCS Host extension for HKSCS
- 1377 – Mixed Host HKSCS Growing
- 1378 – Traditional Chinese DBCS Host extension for HKSCS and Simplified Chinese
- 1379 – Mixed Host HKSCS and Simplified Chinese Growing
- 1388 – Simplified Chinese MIX
- 1390 – Simplified Chinese MIX Japan MIX
- 1399 – Japan MIX