IETF language tag
An IETF BCP 47 language tag is a standardized code that is used to identify human languages on the internet. The tag structure has been standardized by the Internet Engineering Task Force in Best Current Practice 47; the subtags are maintained by the IANA Language Subtag Registry.
To distinguish language variants for countries, regions, or writing systems, IETF language tags combine subtags from other standards such as ISO 639, ISO 15924, ISO 3166-1 and UN M.49.
For example, the tag stands for English; for Latin American Spanish; for Romansh Sursilvan; for Serbian written in Cyrillic script; for Min Nan Chinese using traditional Han characters, as spoken in Taiwan; for Cantonese using traditional Han characters, as spoken in Hong Kong; and for Zürich German.
It is used by computing standards such as HTTP, HTML, XML and PNG.
History
IETF language tags were first defined in, edited by Harald Tveit Alvestrand, published in March 1995. The tags used ISO 639 two-letter language codes and ISO 3166 two-letter country codes, and allowed registration of whole tags that included variant or script subtags of three to eight letters.In January 2001, this was updated by, which added the use of ISO 639-2 three-letter codes, permitted subtags with digits, and adopted the concept of language ranges from HTTP/1.1 to help with matching of language tags.
The next revision of the specification came in September 2006 with the publication of , edited by Addison Philips and Mark Davis, and . RFC 4646 introduced a more structured format for language tags, added the use of ISO 15924 four-letter script codes and UN M.49 three-digit geographical region codes, and replaced the old registry of tags with a new registry of subtags. The small number of previously defined tags that did not conform to the new structure were grandfathered in order to maintain compatibility with RFC 3066.
The current version of the specification,, was published in September 2009. The main purpose of this revision was to incorporate three-letter codes from ISO 639-3 and 639-5 into the Language Subtag Registry, in order to increase the interoperability between ISO 639 and BCP 47.
Syntax of language tags
Each language tag is composed of one or more "subtags" separated by hyphens. Each subtag is composed of basic Latin letters or digits only.With the exceptions of private-use language tags beginning with an x- prefix and grandfathered language tags, subtags occur in the following order:
- A single primary language subtag based on a two-letter language code from ISO 639-1 or a three-letter code from ISO 639-2, ISO 639-3 or ISO 639-5, or registered through the BCP 47 process and composed of five to eight letters;
- Up to three optional extended language subtags composed of three letters each, separated by hyphens;
- An optional script subtag, based on a four-letter script code from ISO 15924 ;
- An optional region subtag based on a two-letter country code from ISO 3166-1 alpha-2, or a three-digit code from UN M.49 for geographical regions;
- Optional variant subtags, separated by hyphens, each composed of five to eight letters, or of four characters starting with a digit;
- Optional extension subtags, separated by hyphens, each composed of a single character, with the exception of the letter x, and a hyphen followed by one or more subtags of two to eight characters each, separated by hyphens;
- An optional private-use subtag, composed of the letter x and a hyphen followed by subtags of one to eight characters each, separated by hyphens.
Optional script and region subtags are preferred to be omitted when they add no distinguishing information to a language tag. For example, es is preferred over es-Latn, as Spanish is fully expected to be written in the Latin script; ja is preferred over ja-JP, as Japanese as used in Japan does not differ markedly from Japanese as used elsewhere.
Not all linguistic regions can be represented with a valid region subtag: the subnational regional dialects of a primary language are registered as variant subtags. For example, the valencia variant subtag for the Valencian variant of the Catalan is registered in the Language Subtag Registry with the prefix ca. As this dialect is spoken almost exclusively in Spain, the region subtag ES can normally be omitted.
Furthermore, there are script tags that do not refer to traditional scripts such as Latin, or even scripts at all, and these usually begin with a Z. For example, Zsye refers to emojis, Zmth to mathematical notation, Zxxx to unwritten documents and Zyyy to undetermined scripts.
IETF language tags have been used as locale identifiers in many applications. It may be necessary for these applications to establish their own strategy for defining, encoding and matching locales if the strategy described in RFC 4647 is not adequate.
The use, interpretation and matching of IETF language tags is currently defined in RFC 5646 and RFC 4647. The Language Subtag Registry lists all currently valid public subtags. Private-use subtags are not included in the Registry as they are implementation-dependent and subject to private agreements between third parties using them. These private agreements are out of scope of BCP 47.
List of common primary language subtags
The following is a list of some of the more commonly used primary language subtags. The list represents only a small subset of primary language subtags; for full information, the Language Subtag Registry should be consulted directly.| English name | Native name | Subtag |
| Afrikaans | Afrikaans | af |
| Amharic | አማርኛ | am |
| Arabic | ar | |
| Mapudungun | Mapudungun | arn |
| Moroccan Arabic | ary | |
| Assamese | অসমীয়া | as |
| Azerbaijani | Azərbaycan | az |
| Bashkir | Башҡорт | ba |
| Belarusian | беларуская | be |
| Bulgarian | български | bg |
| Bengali | বাংলা | bn |
| Tibetan | བོད་ཡིག | bo |
| Breton | brezhoneg | br |
| Bosnian | bosanski/босански | bs |
| Catalan | català | ca |
| Central Kurdish | ckb | |
| Corsican | Corsu | co |
| Czech | čeština | cs |
| Welsh | Cymraeg | cy |
| Danish | dansk | da |
| German | Deutsch | de |
| Lower Sorbian | dolnoserbšćina | dsb |
| Divehi | dv | |
| Greek | Ελληνικά | el |
| English | English | en |
| Spanish | español | es |
| Estonian | eesti | et |
| Basque | euskara | eu |
| Persian | fa | |
| Finnish | suomi | fi |
| Filipino | Filipino | fil |
| Faroese | føroyskt | fo |
| French | français | fr |
| Frisian | Frysk | fy |
| Irish | Gaeilge | ga |
| Scottish Gaelic | Gàidhlig | gd |
| Gilbertese | Taetae ni Kiribati | gil |
| Galician | galego | gl |
| Swiss German | Schweizerdeutsch | gsw |
| Gujarati | ગુજરાતી | gu |
| Hausa | Hausa | ha |
| Hebrew | he | |
| Hindi | हिंदी | hi |
| Croatian | hrvatski | hr |
| Upper Sorbian | hornjoserbšćina | hsb |
| Hungarian | magyar | hu |
| Armenian | Հայերեն | hy |
| Indonesian | Bahasa Indonesia | id |
| Igbo | Igbo | ig |
| Yi | ꆈꌠꁱꂷ | ii |
| Icelandic | íslenska | is |
| Italian | italiano | it |
| Inuktitut | Inuktitut/ ᐃᓄᒃᑎᑐᑦ | iu |
| Japanese | 日本語 | ja |
| Georgian | ქართული | ka |
| Kazakh | Қазақша | kk |
| Greenlandic | kalaallisut | kl |
| Khmer | ខ្មែរ | km |
| Kannada | ಕನ್ನಡ | kn |
| Korean | 한국어 | ko |
| Konkani | कोंकणी | kok |
| Kurdish | Kurdî | ku |
| Kyrgyz | Кыргыз | ky |
| Luxembourgish | Lëtzebuergesch | lb |
| Lao | ລາວ | lo |
| Lithuanian | lietuvių | lt |
| Latvian | latviešu | lv |
| Maori | Reo Māori | mi |
| Macedonian | македонски јазик | mk |
| Malayalam | മലയാളം | ml |
| Mongolian | Монгол хэл/ ᠮᠤᠨᠭᠭᠤᠯ ᠬᠡᠯᠡ | mn |
| Mohawk | Kanien'kéha | moh |
| Marathi | मराठी | mr |
| Malay | Bahasa Malaysia | ms |
| Maltese | Malti | mt |
| Burmese | မြန်မာဘာသာ | my |
| Norwegian | norsk | nb |
| Nepali | नेपाली | ne |
| Dutch | Nederlands | nl |
| Norwegian | norsk | nn |
| Norwegian | norsk | no |
| Occitan | occitan | oc |
| Odia | ଓଡ଼ିଆ | or |
| Papiamento | Papiamentu | pap |
| Punjabi | ਪੰਜਾਬੀ | pa |
| Polish | polski | pl |
| Dari | prs | |
| Pashto | ps | |
| Portuguese | português | pt |
| K'iche | K'iche | quc |
| Quechua | runasimi | qu |
| Romansh | Rumantsch | rm |
| Romanian | română | ro |
| Russian | русский | ru |
| Kinyarwanda | Kinyarwanda | rw |
| Sanskrit | संस्कृत | sa |
| Yakut | саха | sah |
| Sindhi | sd | |
| Sami | davvisámegiella | se |
| Sinhala | සිංහල | si |
| Slovak | slovenčina | sk |
| Slovenian | slovenščina | sl |
| Sami | åarjelsaemiengiele | sma |
| Sami | julevusámegiella | smj |
| Sami | sämikielâ | smn |
| Sami | sääʹmǩiõll | sms |
| Albanian | shqip | sq |
| Serbian | srpski/српски | sr |
| Sesotho | Sesotho | st |
| Swedish | svenska | sv |
| Kiswahili | Kiswahili | sw |
| Syriac | syc | |
| Tamil | தமிழ் | ta |
| Telugu | తెలుగు | te |
| Tajik | Тоҷикӣ | tg |
| Thai | ไทย | th |
| Turkmen | türkmençe | tk |
| Tagalog | Tagalog | tl |
| Tswana | Setswana | tn |
| Turkish | Türkçe | tr |
| Tatar | Татарча | tt |
| Tamazight | Tamazight | tzm |
| Uyghur | ug | |
| Ukrainian | українська | uk |
| Urdu | ur | |
| Uzbek | Uzbek/Ўзбек | uz |
| Vietnamese | Tiếng Việt | vi |
| Wolof | Wolof | wo |
| Xhosa | isiXhosa | xh |
| Yiddish | יידיש | yi |
| Yoruba | Yoruba | yo |
| Chinese | 中文 | zh |
| Zulu | isiZulu | zu |