Natural language processing
Natural language processing is the processing of natural language information by a computer. NLP is a subfield of computer science and is closely associated with artificial intelligence. NLP is also related to information retrieval, knowledge representation, computational linguistics, and linguistics more broadly.
Major processing tasks in an NLP system include: speech recognition, text classification, natural language understanding, and natural language generation.
History
Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that was not articulated as a problem separate from artificial intelligence. The proposed test includes a task that involves the automated interpretation and generation of natural language.Symbolic NLP (1950s – early 1990s)
The premise of symbolic NLP is often illustrated using John Searle's Chinese room thought experiment: Given a collection of rules, the computer emulates natural language understanding by applying those rules to the data it confronts.- 1950s: The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years of research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted in America until the late 1980s when the first statistical machine translation systems were developed.
- 1960s: Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapy, written by Joseph Weizenbaum between 1964 and 1966. Despite using minimal information about human thought or emotion, ELIZA was able to produce interactions that appeared human-like. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?". Ross Quillian's successful work on natural language was demonstrated with a vocabulary of only twenty words, because that was all that would fit in a computer memory at the time.
- 1970s: During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data. Examples are MARGIE, SAM, PAM, TaleSpin, QUALM, Politics, and Plot Units. During this time, the first chatterbots were written.
- 1980s: The 1980s and early 1990s mark the heyday of symbolic methods in NLP. Focus areas of the time included research on rule-based parsing, morphology, semantics, reference and other areas of natural language understanding. Other lines of research were continued, e.g., the development of chatterbots with Racter and Jabberwacky. An important development was the rising importance of quantitative evaluation in this period.
Statistical NLP (1990s–present)
- 1990s: Many of the notable early successes in statistical methods in NLP occurred in the field of machine translation, due especially to work at IBM Research, such as IBM alignment models. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, many systems relied on corpora that were specifically developed for the tasks they were designed to perform. This reliance has been a major limitation to their broader effectiveness and continues to affect similar systems. Consequently, significant research has focused on methods for learning effectively from limited amounts of data.
- 2000s: With the growth of the web, increasing amounts of raw language data have become available since the mid-1990s. Research has thus increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, large quantities of non-annotated data are available, which can often make up for the worse efficiency if the algorithm used has a low enough time complexity to be practical.
- 2003: word n-gram model, at the time the best statistical algorithm, is outperformed by a multi-layer perceptron
- 2010: Tomáš Mikolov with co-authors applied a simple recurrent neural network with a single hidden layer to language modeling, and in the following years he went on to develop Word2vec. In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing. This shift gained momentum due to results showing that such techniques can achieve state-of-the-art results in many natural language tasks, e.g., in language modeling and parsing. This is increasingly important in medicine and healthcare, where NLP helps analyze notes and text in electronic health records that would otherwise be inaccessible for study when seeking to improve care or protect patient privacy.
Approaches: Symbolic, statistical, neural networks
Machine learning approaches, which include both statistical and neural networks, on the other hand, have many advantages over the symbolic approach:
- both statistical and neural networks methods can focus more on the most common cases extracted from a corpus of texts, whereas the rule-based approach needs to provide rules for both rare cases and common ones equally.
- language models, produced by either statistical or neural networks methods, are more robust to both unfamiliar and erroneous input in comparison to the rule-based systems, which are also more costly to produce.
- the larger such a language model is, the more accurate it becomes, in contrast to rule-based systems that can gain accuracy only by increasing the amount and complexity of the rules leading to intractability problems.
- when the amount of training data is insufficient to successfully apply machine learning methods, e.g., for the machine translation of low-resource languages such as provided by the Apertium system,
- for preprocessing in NLP pipelines, e.g., tokenization, or
- for post-processing and transforming the output of NLP pipelines, e.g., for knowledge extraction from syntactic parses.
Statistical approach
The earliest decision trees, producing systems of hard if–then rules, were still very similar to the old rule-based approaches.
Only the introduction of hidden Markov models, applied to part-of-speech tagging, announced the end of the old rule-based approach.
Neural networks
A major drawback of statistical methods is that they require elaborate feature engineering. Since 2015, neural network–based methods have increasingly replaced traditional statistical approaches, using semantic networks and word embeddings to capture semantic properties of words.Intermediate tasks are not needed anymore.
Neural machine translation, based on then-newly invented sequence-to-sequence transformations, made obsolete the intermediate steps, such as word alignment, previously necessary for statistical machine translation.
Common NLP tasks
The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks.Though natural language processing tasks are closely intertwined, they can be subdivided into categories for convenience. A coarse division is given below.
Text and speech processing
; Optical character recognition; Speech recognition: Given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete". In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition. In most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.
; Speech segmentation: Given a sound clip of a person or people speaking, separate it into words. A subtask of speech recognition and typically grouped with it.
; Text-to-speech
; Word segmentation