Collation


Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems, library catalogs, and reference books.
Collation differs from classification in that the classes themselves are not necessarily ordered. However, even if the order of the classes is irrelevant, the identifiers of the classes may be members of an ordered set, allowing a sorting algorithm to arrange the items by class.
Formally speaking, a collation method typically defines a total order on a set of possible identifiers, called sort keys, which consequently produces a total preorder on the set of items of information.
A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other. When an order has been defined in this way, a sorting algorithm can be used to put a list of any number of items into that order.
The main advantage of collation is that it makes it fast and easy for a user to find an element in the list, or to confirm that it is absent from the list. In automatic systems this can be done using a binary search algorithm or interpolation search; manual searching may be performed using a roughly similar procedure, though this will often be done unconsciously. Other advantages are that one can easily find the first or last elements on the list, or elements in a given range.

Ordering

Numerical and chronological

Strings representing numbers may be sorted based on the values of the numbers that they represent. For example, "−4", "2.5", "10", "89", "30,000". Pure application of this method may provide only a partial ordering on the strings, since different strings can represent the same number.
A similar approach may be taken with strings representing dates or other items that can be ordered chronologically or in some other natural fashion.

[|Alphabetical]

is the basis for many systems of collation where items of information are identified by strings consisting principally of letters from an alphabet. The ordering of the strings relies on the existence of a standard ordering for the letters of the alphabet in question.
To decide which of two strings comes first in alphabetical order, initially their first letters are compared. The string whose first letter appears earlier in the alphabet comes first in alphabetical order. If the first letters are the same, then the second letters are compared, and so on, until the order is decided. The result of arranging a set of strings in alphabetical order is that words with the same first letter are grouped together, and within such a group words with the same first two letters are grouped together, and so on.
Capital letters are typically treated as equivalent to their corresponding lowercase letters.
Certain limitations, complications, and special conventions may apply when alphabetical order is used:
  • When strings contain spaces or other word dividers, the decision must be taken whether to ignore these dividers or to treat them as symbols preceding all other letters of the alphabet. For example, if the first approach is taken then "car park" will come after "carbon" and "carp", whereas in the second approach "car park" will come before those two words. The first rule is used in many dictionaries, the second in telephone directories.
  • Abbreviations may be treated as if they were spelt out in full. For example, names containing "St." are often ordered as if they were written out as "Saint". There is also a traditional convention in English that surnames beginning Mc and M' are listed as if those prefixes were written Mac.
  • Strings that represent personal names will often be listed by alphabetical order of surname, even if the given name comes first. For example, Juan Hernandes and Brian O'Leary should be sorted as "Hernandes, Juan" and "O'Leary, Brian" even if they are not written this way.
  • Very common initial words, such as The in English, are often ignored for sorting purposes. So The Shining would be sorted as just "Shining" or "Shining, The".
  • When some of the strings contain numerals, various approaches are possible. Sometimes such characters are treated as if they came before or after all the letters of the alphabet. Another method is for numbers to be sorted alphabetically as they would be spelled: for example 1776 would be sorted as if spelled out "seventeen seventy-six", and 24 heures du Mans as if spelled "vingt-quatre...". When numerals or other symbols are used as special graphical forms of letters, as in 1337 for leet or Se7en for the movie title Seven, they may be sorted as if they were those letters.
  • Languages have different conventions for treating modified letters and certain letter combinations. For example, in Spanish the letter ñ is treated as a basic letter following n, and the digraphs ch and ll were formerly treated as basic letters following c and l, although they are now alphabetized as two-letter combinations. A list of such conventions for various languages can be found at.
In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German dictionaries and telephone directories use different approaches.

Root sorting

Some Arabic dictionaries, such as Hans Wehr's bilingual A Dictionary of Modern Written Arabic, group and sort Arabic words by semitic root. For example, the words kitāba, kitāb, kātib, maktaba, maktab, maktūb, are agglomerated under the triliteral root k-t-b, which denotes 'writing'.

Radical-and-stroke sorting

Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as the hanzi of Chinese and the kanji of Japanese, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character 妈 is sorted as a six-stroke character under the three-stroke primary radical 女.
The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word Tōkyō can be sorted as if it were spelled out in the Japanese characters of the hiragana syllabary as "to-u-ki-yo-u", using the conventional sorting order for these characters.
In addition, Chinese characters can also be sorted by stroke-based sorting. In Greater China, surname stroke ordering is a convention in some official documents where people's names are listed without hierarchy.

Automation

When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collation algorithm that allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an alphabetical or numerical ordering that follows the standard criteria as described in the preceding sections. However, not all of these criteria are easy to automate.
The simplest kind of automated collation is based on the numerical codes of the symbols in a character set, such as ASCII coding, with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of alphabetical ordering. So a computer program might treat the characters a, b, C, d, and $ as being ordered $, C, a, b, d. Therefore, strings beginning with C, M, or Z would be sorted before strings with lower-case a, b, etc. This is sometimes called ASCIIbetical order. This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones. It is therefore often applied with certain alterations, the most obvious being case conversion before comparison of ASCII values.
In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to the collating sequence – a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for alphabetical ordering in the language in question, dealing properly with differently cased letters, modified letters, digraphs, particular abbreviations, and so on, as mentioned above under Alphabetical order, and in detail in the Alphabetical order article. Such algorithms are potentially quite complex, possibly requiring several passes through the text.
Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch, while Turkish dictionaries treat o and ö as different letters, placing oyun before öbür.
A standard algorithm for collating any collection of strings composed of any standard Unicode symbols is the Unicode Collation Algorithm. This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository.