Treebank
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.
Etymology
The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank. This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure. The term parsed corpus is often used interchangeably with the term treebank, with the emphasis on the primacy of sentences rather than trees.Construction
Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank.Image:example-tree.png|right|250px|thumb|Example phrase structure tree for John loves Mary
Image:Quranic-arabic-corpus.png|right|250px|thumb|Hybrid constituency/dependency tree from the Quranic Arabic Corpus
Some treebanks follow a specific linguistic theory in their syntactic annotation but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure and those that annotate dependency structure.
It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for John loves Mary, shown in the figure on the right/above, may be represented by simple labelled brackets in a text file, like this :
)
)
This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation.
Applications
From a computational linguistics perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems. Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate.In corpus linguistics, treebanks are used to study syntactic phenomena. Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena.
Another use of treebanks in theoretical linguistics and psycholinguistics is interaction evidence. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices.
In linguistics research, annotated treebank data has been used in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.
Semantic treebanks
A semantic treebank is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure. Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is the , developed at the University of Groningen and annotated using Discourse Representation Theory. An example of a shallow semantic treebank is PropBank, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in logical form.| Language | Treebank | Semantic Formalism | Distribution / License |
| Chinese | PropBank semantics | ||
| English | Abstract Meaning Representation Bank | Deep semantics | |
| English | FrameNet | Shallow semantics | |
| English | Universal Conceptual Cognitive Annotation | Deep semantics | |
| English | Deep semantics | ||
| English | Deep semantics | ||
| English | Deep semantics | ||
| Dutch | Deep semantics | ||
| German | Deep semantics | ||
| Italian | Deep semantics | ||
| English | Deep semantics | ||
| English | Deep semantics | ||
| English | Deep semantics | ||
| English | Deep semantics | ||
| English | PropBank semantics | ||
| Finnish | PropBank semantics | ||
| Finnish | PropBank semantics | ||
| French | PropBank semantics | ||
| German | PropBank semantics | ||
| Italian | PropBank semantics | ||
| Portuguese | PropBank semantics | ||
| Portuguese | PropBank semantics | ||
| Spanish | PropBank semantics | ||
| Turkish | PropBank semantics |
Syntactic treebanks
Many syntactic treebanks have been developed for a wide variety of languages:| Language | Treebank | Syntactic Formalism | Distribution / License |
| Abaza | , ATB | Dependency | |
| Afrikaans | , AfriBooms | Dependency | |
| Akkadian | , PISANDUB | Dependency | |
| Albanian | , TSA | Dependency | |
| Amharic | , ATT | Dependency | |
| Ancient Greek | , Perseus | Dependency | |
| Ancient Greek | , PROIEL | Dependency | |
| Greek | Dependency | ||
| Greek | Dependency | ||
| Arabic | Dependency | ||
| Arabic | Dependency | ||
| Arabic | , NYUAD | Dependency | |
| Arabic | , PADT | Dependency | |
| Arabic | , PUD | Dependency | |
| Arabic | Phrase structure | ||
| Armenian | , ArmTDP | Dependency | |
| Assyrian | , AS | Dependency | |
| Bambara | , CRB | Dependency | |
| Basque | , BDT | Dependency | |
| Belarusian | , HSE | Dependency | |
| Bhojpuri | , BhEn | Dependency | |
| Bhojpuri | , BHTB | Dependency | |
| Breton | , KEB | Dependency | |
| Bulgarian | , BTB | Dependency | |
| Bulgarian | HPSG | ||
| Buryat | , BDT | Dependency | |
| Cantonese | , HK | Dependency | |
| Catalan | Phrase structure | ||
| Catalan | , AnCora | Dependency | |
| Chinese | Case grammar | ||
| Chinese | , CFL | Dependency | |
| Chinese | , GSD | Dependency | |
| Chinese | , GSDSimp | Dependency | |
| Chinese | , HK | Dependency | |
| Chinese | , PUD | Dependency | |
| Chinese | Phrase structure | ||
| Chinese | Dependency | ||
| Arabic | Dependency | ||
| Classical Armenian | Dependency | ||
| Coptic | , Coptic Scriptorium | Dependency | |
| Croatian | Dependency | ||
| Croatian | , SET | Dependency | |
| Czech | Dependency | ||
| Czech | , CAC | Dependency | |
| Czech | , CLTT | Dependency | |
| Czech | , FicTree | Dependency | |
| Czech | , PDT | Dependency | |
| Czech | , PUD | Dependency | |
| Danish | Dependency | ||
| Danish | Phrase structure | ||
| Danish | , DDT | Dependency | |
| Danish | , DTB | Dependency | |
| Dutch | Phrase structure | ||
| Dutch | , Alpino | Dependency | |
| Dutch | , LassySmall | Dependency | |
| Dutch | Dependency | ||
| Dutch | Dependency | ||
| Egyptian | , Pre-Coptic | Dependency | |
| English | Combinatory categorial grammar | ||
| English | HPSG | ||
| English | Phrase structure | ||
| English | Dependency | ||
| English | , BhEn | Dependency | |
| English | , ESL | Dependency | |
| English | , EWT | Dependency | |
| English | , GUM | Dependency | |
| English | , GUMReddit | Dependency | |
| English | , LinES | Dependency | |
| English | , ParTUT | Dependency | |
| English | , Pronouns | Dependency | |
| English | , PUD | Dependency | |
| English | Phrase structure | ||
| English | Phrase structure | ||
| English | Phrase structure | ||
| English | Phrase structure | ||
| English | Phrase structure | ||
| English | HPSG | ||
| English | Phrase structure | ||
| English | Phrase structure | ||
| English | Dependency | ||
| English | Dependency | ||
| English | Phrase structure | ||
| English | Phrase structure | ||
| English | Dependency | ||
| English | Phrase structure | ||
| Erzya | , JR | Dependency | |
| Estonian | Phrase structure | ||
| Estonian | Dependency | ||
| Estonian | , EDT | Dependency | |
| Estonian | , EWT | Dependency | |
| Faroese | , FarPaHC | Dependency | |
| Faroese | , OFT | Dependency | |
| Finnish | Dependency | ||
| Finnish | , FTB | Dependency | |
| Finnish | , PUD | Dependency | |
| Finnish | , TDT | Dependency | |
| French | Dependency and macrosyntactic annotation | ||
| French | Phrase structure | ||
| French | , CrapBank | Dependency | |
| French | , FQB | Dependency | |
| French | , FTB | Dependency | |
| French | , GSD | Dependency | |
| French | , ParTUT | Dependency | |
| French | , PUD | Dependency | |
| French | , Sequoia | Dependency | |
| French | , Spoken | Dependency | |
| French | Phrase structure | ||
| French | Phrase structure | ||
| French | Phrase structure & Dependency | ||
| Galician | , CTG | Dependency | |
| Galician | , TreeGal | Dependency | |
| German | Dependency | ||
| German | , GSD | Dependency | |
| German | , LIT | Dependency | |
| German | , PUD | Dependency | |
| German | Phrase structure | ||
| German | Phrase structure | ||
| German | Phrase structure | ||
| German | Phrase structure | ||
| German | Phrase structure | ||
| German | Phrase structure | ||
| Gothic | Dependency | ||
| Gothic | , PROIEL | Dependency | |
| Greek | Dependency | ||
| Greek | , GDT | Dependency | |
| Hebrew | , HTB | Dependency | |
| Hebrew | Dependency | ||
| Hindi English | , HIENCS | Dependency | |
| Hindi | , HDTB | Dependency | |
| Hindi | , PUD | Dependency | |
| Hindi | Dependency | ||
| English | ; | Phrase structure | |
| English | Phrase structure | ||
| French | Phrase structure | ||
| Portuguese | Phrase structure | ||
| Hungarian | , Szeged | Dependency | |
| Hungarian | Phrase structure | ||
| Icelandic | Phrase structure | ||
| Icelandic | , IcePaHC | Dependency | |
| Icelandic | , PUD | Dependency | |
| Indonesian | , GSD | Dependency | |
| Indonesian | , PUD | Dependency | |
| Indonesian | Phrase structure | ||
| Irish | , IDT | Dependency | |
| Italian | Phrase structure and dependency | ||
| Italian | dependency | ||
| Italian | Phrase structure and dependency | ||
| Italian | , ISDT | Dependency | |
| Italian | , ParTUT | Dependency | |
| Italian | , PoSTWITA | Dependency | |
| Italian | , PUD | Dependency | |
| Italian | , TWITTIRO | Dependency | |
| Italian | , VIT | Dependency | |
| Italian | dependency | ||
| Italian | |||
| Italian | Dependency | ||
| Italian | dependency | ||
| Japanese | |||
| Japanese | , BCCWJ | Dependency | |
| Japanese | , GSD | Dependency | |
| Japanese | , KTC | Dependency | |
| Japanese | , Modern | Dependency | |
| Japanese | , PUD | Dependency | |
| Japanese | Phrase structure | ||
| Japanese | Phrase structure | ||
| Japanese | Dependency | ||
| Karelian | , KKPP | Dependency | |
| Kazakh | , KTB | Dependency | |
| Komi Permyak | , UH | Dependency | |
| Komi Zyrian | , IKDP | Dependency | |
| Komi Zyrian | , Lattice | Dependency | |
| Korean | , GSD | Dependency | |
| Korean | , Kaist | Dependency | |
| Korean | , Penn | Dependency | |
| Korean | , PUD | Dependency | |
| Korean | , Sejong | Dependency | |
| Korean | Phrase structure | ||
| Kurmanji | , MG | Dependency | |
| Latin | , ITTB | Dependency | |
| Latin | , LLCT | Dependency | |
| Latin | , Perseus | Dependency | |
| Latin | , PROIEL | Dependency | |
| Latin | Dependency | ||
| Latin | Dependency | ||
| Latin | Dependency | ||
| Latvian | , LVTB | Dependency | |
| Lithuanian | , ALKSNIS | Dependency | |
| Lithuanian | , HSE | Dependency | |
| Livvi | , KKPP | Dependency | |
| Magahi | , MGTB | Dependency | |
| Maltese | , MUDT | Dependency | |
| Marathi | , UFAL | Dependency | |
| Mbya Guarani | , Dooley | Dependency | |
| Mbya Guarani | , Thomas | Dependency | |
| Middle Irish | , CritMITB | Dependency | |
| Middle Irish | , DipMITB | Dependency | |
| Moksha | , JR | Dependency | |
| Naija | , NSC | Dependency | |
| North Sami | , Giella | Dependency | |
| Norwegian | LFG | ||
| Norwegian | , Bokmaal | Dependency | |
| Norwegian | , Nynorsk | Dependency | |
| Norwegian | , NynorskLIA | Dependency | |
| Old Church Slavonic | , PROIEL | Dependency | |
| Old Church Slavonic | Dependency | ||
| Old French | , SRCMF | Dependency | |
| Old Russian | , RNC | Dependency | |
| Old Russian | , TOROT | Dependency | |
| Old Russian | Dependency | ||
| Persian | Dependency | ||
| Persian | HPSG | ||
| Persian | , Seraji | Dependency | |
| Polish | HPSG | ||
| Polish | , LFG | Dependency | |
| Polish | , PDB | Dependency | |
| Polish | , PUD | Dependency | |
| Polish | Phrase structure and Dependency | ||
| Portuguese | , Bosque | Dependency | |
| Portuguese | , GSD | Dependency | |
| Portuguese | , PUD | Dependency | |
| Portuguese | Dependency, Phrase structure | ||
| Romanian | Dependency | ||
| Romanian | , Nonstandard | Dependency | |
| Romanian | , RRT | Dependency | |
| Romanian | , SiMoNERo | Dependency | |
| Russian | , GSD | Dependency | |
| Russian | , PUD | Dependency | |
| Russian | , SynTagRus | Dependency | |
| Russian | , Taiga | Dependency | |
| Russian | SynTagRus Dependency Treebank | Dependency | |
| Sanskrit | , UFAL | Dependency | |
| Sanskrit | , Vedic | Dependency | |
| Scottish Gaelic | , ARCOSG | Dependency | |
| Serbian | , SET | Dependency | |
| Sindhi | , MazharDootio | Dependency | |
| Skolt Sami | , Giellagas | Dependency | |
| Slovak | , SNK | Dependency | |
| Slovene | Dependency | ||
| Slovenian | , SSJ | Dependency | |
| Slovenian | , SST | Dependency | |
| Spanish | Phrase structure and dependency | ||
| Spanish | , AnCora | Dependency | |
| Spanish | , GSD | Dependency | |
| Spanish | , PUD | Dependency | |
| Spanish | Phrase structure | ||
| Swedish | Phrase structure and dependency | ||
| Swedish | Phrase structure | ||
| Swedish | , LinES | Dependency | |
| Swedish | , PUD | Dependency | |
| Swedish | , Talbanken | Dependency | |
| Swedish | Phrase structure | ||
| Swedish Sign Language | , SSLC | Dependency | |
| Swiss German | , UZH | Dependency | |
| Tagalog | , TRG | Dependency | |
| Tagalog | , Ugnayan | Dependency | |
| Tamil | , TTB | Dependency | |
| Telugu | , MTG | Dependency | |
| Thai | Dependency | ||
| Thai | , PUD | Dependency | |
| Thai | Phrase structure | ||
| Turkish | Dependency | ||
| Turkish | , BOUN | Dependency | |
| Turkish | , GB | Dependency | |
| Turkish | , IMST | Dependency | |
| Turkish | , PUD | Dependency | |
| Ukrainian | Dependency | ||
| Ukrainian | , IU | Dependency | |
| Upper Sorbian | , UFAL | Dependency | |
| Urdu | Phrase structure | ||
| Urdu | Phrase and Hyper Dependency Structure | ||
| Urdu | , UDTB | Dependency | |
| Uyghur | , UDT | Dependency | |
| Vietnamese | , VTB | Dependency | |
| Vietnamese | Phrase structure | ||
| Vietnamese | Dependency | ||
| Warlpiri | , UFAL | Dependency | |
| Welsh | , CCG | Dependency | |
| Wolof | , WTB | Dependency | |
| Yoruba | , YTB | Dependency |
To facilitate the further researches between multilingual tasks, some researchers discussed the universal annotation scheme for cross-languages. In this way, people try to utilize or merge the advantages of different treebanks corpora. For instance,
The universal annotation approach for dependency treebanks; and the universal annotation approach for phrase structure treebanks.