Machine translation


Machine translation is the use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic, and pragmatic nuances of both languages.
While some language models are capable of generating comprehensible results, machine translation tools remain limited by the complexity of language and emotion, often lacking depth and semantic precision. Its quality is influenced by linguistic, grammatical, tonal, and cultural differences, making it inadequate to replace real translators fully. Effective improvement in translation quality requires understanding of target society’s customs and historical context, human intervention and visual cues remain necessary in simultaneous interpretation, on the other hand, domain-specific customization, such as for technical documentation or official texts, can yield more stable results, and is commonly employed in multilingual websites and professional databases.
Initial approaches were mostly rule-based or statistical in nature. However, these methods have since been superseded by neural machine translation and large language models.

History

Origins

The origins of machine translation can be traced back to the work of Al-Kindi, a ninth-century Arabic cryptographer who developed techniques for systemic language translation, including cryptanalysis, frequency analysis, and probability and statistics, which are used in modern machine translation. The idea of machine translation later appeared in the 17th century. In 1629, René Descartes proposed a universal language, with equivalent ideas in different tongues sharing one symbol.
The idea of using digital computers for translation of natural languages was proposed as early as 1947 by England's A. D. Booth and Warren Weaver at Rockefeller Foundation in the same year. "The memorandum written by Warren Weaver in 1949 is perhaps the single most influential publication in the earliest days of machine translation." Others followed. A demonstration was made in 1954 on the APEXC machine at Birkbeck College of a rudimentary translation of English into French. Several papers on the topic were published at the time, and even articles in popular journals. A similar application, also pioneered at Birkbeck College at the time, was reading and composing Braille texts by computer.

1950s

The first researcher in the field, Yehoshua Bar-Hillel, began his research at MIT. A Georgetown University MT research team, led by Professor Michael Zarechnak, followed with a public demonstration of its Georgetown-IBM experiment system in 1954. MT research programs popped up in Japan and Russia, and the first MT conference was held in London.
David G. Hays "wrote about computer-assisted language processing as early as 1957" and "was project leader on computational linguistics
at Rand from 1955 to 1968."

1960–1975

Researchers continued to join the field as the Association for Machine Translation and Computational Linguistics was formed in the U.S. and the National Academy of Sciences formed the Automatic Language Processing Advisory Committee to study MT. Real progress was much slower, however, and after the ALPAC report, which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced. According to a 1972 report by the Director of Defense Research and Engineering, the feasibility of large-scale MT was reestablished by the success of the Logos MT system in translating military manuals into Vietnamese during that conflict.
The French Textile Institute also used MT to translate abstracts from and into French, English, German and Spanish ; Brigham Young University started a project to translate Mormon texts by automated translation.

1975–1980s

, which "pioneered the field under contracts from the U.S. government" in the 1960s, was used by Xerox to translate technical manuals. Beginning in the late 1980s, as computational power increased and became less expensive, more interest was shown in statistical models for machine translation. MT became more popular after the advent of computers. SYSTRAN's first implementation system was implemented in 1988 by the online service of the French Postal Service called Minitel. Various computer based translation companies were also launched, including Trados, which was the first to develop and market Translation Memory technology, though this is not the same as MT. The first commercial MT system for Russian / English / German-Ukrainian was developed at Kharkov State University.

1990s and early 2000s

By 1998, "for as little as $29.95" one could "buy a program for translating in one direction between English and a major European language of
your choice" to run on a PC.
MT on the web started with SYSTRAN offering free translation of small texts and then providing this via AltaVista Babelfish, which racked up 500,000 requests a day. The second free translation service on the web was Lernout & Hauspie's GlobaLink. Atlantic Magazine wrote in 1998 that "Systran's Babelfish and GlobaLink's Comprende" handled
"Don't bank on it" with a "competent performance."
Franz Josef Och won DARPA's speed MT competition. More innovations during this time included MOSES, the open-source statistical MT engine, a text/SMS translation service for mobiles in Japan, and a mobile phone with built-in speech-to-speech translation functionality for English, Japanese and Chinese. In 2012, Google announced that Google Translate translates roughly enough text to fill 1 million books in one day.

ANNs and LLMs in 2020s

Approaches

Before the advent of deep learning methods, statistical methods required a lot of rules accompanied by morphological, syntactic, and semantic annotations.

Rule-based

The rule-based machine translation approach was used mostly in the creation of dictionaries and grammar programs. Its biggest downfall was that everything had to be made explicit: orthographical variation and erroneous input must be made part of the source language analyser in order to cope with it, and lexical selection rules must be written for all instances of ambiguity.

Transfer-based machine translation

Transfer-based machine translation was similar to interlingual machine translation in that it created a translation from an intermediate representation that simulated the meaning of the original sentence. Unlike interlingual MT, it depended partially on the language pair involved in the translation.

Interlingual

Interlingual machine translation was one instance of rule-based machine-translation approaches. In this approach, the source language, i.e. the text to be translated, was transformed into an interlingual language, i.e. a "language neutral" representation that is independent of any language. The target language was then generated out of the interlingua. The only interlingual machine translation system that was made operational at the commercial level was the KANT system, which was designed to translate Caterpillar Technical English into other languages.

Dictionary-based

Machine translation used a method based on dictionary entries, which means that the words were translated as they are by a dictionary.

Statistical

Statistical machine translation tried to generate translations using statistical methods based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record of the Canadian parliament and EUROPARL, the record of the European Parliament. Where such corpora were available, good results were achieved translating similar texts, but such corpora were rare for many language pairs. The first statistical machine translation software was CANDIDE from IBM. In 2005, Google improved its internal translation capabilities by using approximately 200 billion words from United Nations materials to train their system; translation accuracy improved.
SMT's biggest downfall included it being dependent upon huge amounts of parallel texts, its problems with morphology-rich languages, and its inability to correct singleton errors.
Some work has been done in the utilization of multiparallel corpora, that is a body of text that has been translated into 3 or more languages. Using these methods, a text that has been translated into 2 or more languages may be utilized in combination to provide a more accurate translation into a third language compared with if just one of those source languages were used alone.

Neural MT

A deep learning-based approach to MT, neural machine translation has made rapid progress in recent years. However, the current consensus is that the so-called human parity achieved is not real, being based wholly on limited domains, language pairs, and certain test benchmarks i.e., it lacks statistical significance power.
Translations by neural MT tools like DeepL Translator, which is thought to usually deliver the best machine translation results as of 2022, typically still need post-editing by a human.
Instead of training specialized translation models on parallel datasets, one can also directly prompt generative large language models like GPT to translate a text. This approach is considered promising, but is still more resource-intensive than specialized translation models.

Issues

Studies using human evaluation have systematically identified various issues with the latest advanced MT outputs. Some quality evaluation studies have found that, in several languages, human translations outperform ChatGPT-produced translations in terminological accuracy and clarity of expression. Common issues include the translation of ambiguous parts whose correct translation requires common sense-like semantic language processing or context. There can also be errors in the source texts, missing high-quality training data and the severity of frequency of several types of problems may not get reduced with techniques used to date, requiring some level of human active participation.