Apertium


Apertium is a free/open-source rule-based machine translation platform. It is free software and released under the terms of the GNU General Public License.

Overview

Apertium is a transfer-based machine translation system, which uses finite state transducers for all of its lexical transformations, and Constraint Grammar taggers as well as hidden Markov models or Perceptrons for part-of-speech tagging / word category disambiguation. A structural transfer component is responsible for word movement and agreement; most Apertium language pairs up until now have used "chunking" or shallow transfer rules, though newer pairs use rules defined in a Context-free grammar.
Many existing machine translation systems available at present are commercial or use proprietary technologies, which makes them very hard to adapt to new usages. Apertium code and data is free software and uses a language-independent specification, to allow for the ease of contributing to Apertium, more efficient development, and enhancing the project's overall growth.
At present, Apertium has released 51 stable language pairs, delivering fast translation with reasonably intelligible results. Being an open-source project, Apertium provides tools for potential developers to build their own language pair and contribute to the project.

History

Apertium originated as one of the machine translation engines in the project OpenTrad, which was funded by the Spanish government, and developed by the Transducens research group at the Universitat d'Alacant. It was originally designed to translate between closely related languages, although it has recently been expanded to treat more divergent language pairs. To create a new machine translation system, one just has to develop linguistic data in well-specified XML formats.
Language data developed for it currently support the Arabic, Aragonese, Asturian, Basque, Belarusian, Breton, Bulgarian, Catalan, Crimean Tatar, Danish, English, Esperanto, French, Galician, Hindi, Icelandic, Indonesian, Italian, Kazakh, Macedonian, Malaysian, Maltese, Northern Sami, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian, Sardinian, Serbo-Croatian, Silesian, Slovene, Spanish, Swedish, Tatar, Ukrainian, Urdu, and Welsh languages. A full list is available below. Several companies are also involved in the development of Apertium, including Prompsit Language Engineering, Imaxin Software and Eleka Ingeniaritza Linguistikoa.
The project has taken part in the 2009, 2010, 2011, 2012, 2013 and 2014 editions of Google Summer of Code and the 2010, 2011, 2012, 2013, 2014, 2015, 2016 and 2017 editions of Google Code-In.

Translation methodology

This is an overall, step-by-step view how Apertium works.
The diagram displays the steps that Apertium takes to translate a source-language text into a target-language text.
  1. Source language text is passed into Apertium for translation.
  2. The deformatter removes formatting markup that should be kept in place but not translated.
  3. The morphological analyser segments the text, and looks up segments in the language dictionaries, returning dictionary forms and tags for all matches. In pairs that involve agglutinative morphology, including a number of Turkic languages, a Helsinki Finite State Transducer is used. Otherwise, an Apertium-specific finite state transducer system called lttoolbox, is used.
  4. The morphological disambiguator resolves ambiguous segments by choosing one match. Apertium uses Constraint Grammar rules for most of its language pairs.
  5. Retokenisation uses a finite state transducer to match sequences of lexical units and may reorder or translate tags
  6. Lexical transfer looks up disambiguated source-language basewords to find their target-language equivalents. For lexical transfer, Apertium uses an XML-based dictionary format called bidix.
  7. Lexical selection chooses between alternative translations when the source text word has alternative meanings. Apertium uses a specific XML-based technology, apertium-lex-tools, to perform lexical selection.
  8. Structural transfer can consist of one-step chunking transfer, three-step chunking transfer or a CFG-based transfer module. The chunking modules flag grammatical differences between the source language and target language by creating a sequence of chunks containing markers for this. They then reorder or modify chunks in order to produce a grammatical translation in the target-language. The newer CFG-based module matches input sequences into possible parse trees, selecting the best-ranking one and applying transformation rules on the tree.
  9. The morphological generator uses the tags to deliver the correct target language surface form. The morphological generator is a morphological transducer, just like the morphological analyser. A morphological transducer both analyses and generates forms.
  10. The post-generator makes any necessary orthographic changes due to the contact of words.
  11. The reformatter replaces formatting markup that was removed by the deformatter in the first step.
  12. Apertium delivers the target-language translation.

    Supported languages

As of , the following 108 pairs and 51 languages and languages varieties are supported by Apertium.
  1. Afrikaans to Dutch
  2. Arabic to Maltese
  3. Aragonese to Catalan
  4. Aragonese to Spanish
  5. Arpitan to French
  6. Basque to English
  7. Basque to Spanish
  8. Belarusian to Russian
  9. Breton to French
  10. Bulgarian to Macedonian
  11. Catalan to Aragonese
  12. Catalan to English
  13. Catalan to Esperanto
  14. Catalan to French
  15. Catalan to Italian
  16. Catalan to Occitan
  17. Catalan to Aranese
  18. Catalan to Portuguese
  19. Catalan to Brazilian Portuguese
  20. Catalan to European Portuguese
  21. Catalan to Romanian
  22. Catalan to Sardinian
  23. Catalan to Spanish
  24. Crimean Tatar to Turkish
  25. Danish to Norwegian
  26. Danish to Norwegian
  27. Danish to Swedish
  28. Dutch to Afrikaans
  29. English to Catalan
  30. English to Valencian
  31. English to Esperanto
  32. English to Galician
  33. English to Serbo-Croatian
  34. English to Spanish
  35. Esperanto to English
  36. French to Arpitan
  37. French to Catalan
  38. French to Esperanto
  39. French to Occitan
  40. French to Gascon
  41. French to Spanish
  42. Galician to English
  43. Galician to Portuguese
  44. Galician to Spanish
  45. Hindi to Urdu
  46. Icelandic to English
  47. Icelandic to Swedish
  48. Indonesian to Malay
  49. Italian to Catalan
  50. Italian to Sardinian
  51. Italian to Spanish
  52. Kazakh to Tatar
  53. Macedonian to Bulgarian
  54. Macedonian to English
  55. Malay to Indonesian
  56. Maltese to Arabic
  57. Northern Sámi to Norwegian
  58. Norwegian to Danish
  59. Norwegian to Norwegian
  60. Norwegian to East Norwegian, vi→vi
  61. Norwegian to Swedish
  62. Norwegian to Danish
  63. Norwegian to Norwegian
  64. Norwegian to East Norwegian, vi→vi
  65. Norwegian to Swedish
  66. East Norwegian, vi→vi to Norwegian
  67. Occitan to Catalan
  68. Occitan to French
  69. Occitan to Spanish
  70. Aranese to Catalan
  71. Aranese to Spanish
  72. Gascon to French
  73. Polish to Silesian
  74. Portuguese to Catalan
  75. Portuguese to Galician
  76. Portuguese to Spanish
  77. Romanian to Catalan
  78. Romanian to Spanish
  79. Russian to Belarusian
  80. Russian to Ukrainian
  81. Sardinian to Italian
  82. Serbo-Croatian to English
  83. Serbo-Croatian to Macedonian
  84. Serbo-Croatian to Slovenian
  85. Silesian to Polish
  86. Slovenian to Serbo-Croatian
  87. Spanish to Aragonese
  88. Spanish to Asturian
  89. Spanish to Catalan
  90. Spanish to Valencian
  91. Spanish to English
  92. Spanish to Esperanto
  93. Spanish to French
  94. Spanish to Galician
  95. Spanish to Italian
  96. Spanish to Occitan
  97. Spanish to Aranese
  98. Spanish to Portuguese
  99. Spanish to Brazilian Portuguese
  100. Swedish to Danish
  101. Swedish to Icelandic
  102. Swedish to Norwegian
  103. Swedish to Norwegian
  104. Tatar to Kazakh
  105. Turkish to Crimean Tatar
  106. Ukrainian to Russian
  107. Urdu to Hindi
  108. Welsh to English

    End-user services and software

Online translation websites

*
*
*
  • Offline applications

*
*
*
Category:Free software programmed in C++
Category:Machine translation software
Category:Natural language processing software
Category:Natural language processing toolkits
Category:Products introduced in 2009
Category:Translation websites
Category:Software using the GNU General Public License