Treebank

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Etymology

The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank. This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure. The term parsed corpus is often used interchangeably with the term treebank, with the emphasis on the primacy of sentences rather than trees.

Construction

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank.
Some treebanks follow a specific linguistic theory in their syntactic annotation but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure and those that annotate dependency structure.
It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for John loves Mary, shown in the figure on the right/above, may be represented by simple labelled brackets in a text file, like this :

)
)
This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation.

Applications

From a computational linguistics perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems. Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate.
In corpus linguistics, treebanks are used to study syntactic phenomena. Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena.
Another use of treebanks in theoretical linguistics and psycholinguistics is interaction evidence. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices.
In linguistics research, annotated treebank data has been used in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.

Semantic treebanks

A semantic treebank is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure. Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is the, developed at the University of Groningen and annotated using Discourse Representation Theory. An example of a shallow semantic treebank is PropBank, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in logical form.

Language	Treebank	Semantic Formalism	Distribution / License
Chinese		PropBank semantics
English	Abstract Meaning Representation Bank	Deep semantics
English	FrameNet	Shallow semantics
English	Universal Conceptual Cognitive Annotation	Deep semantics
English		Deep semantics
English		Deep semantics
English		Deep semantics
Dutch		Deep semantics
German		Deep semantics
Italian		Deep semantics
English		Deep semantics
English		Deep semantics
English		Deep semantics
English		Deep semantics
English		PropBank semantics
Finnish		PropBank semantics
Finnish		PropBank semantics
French		PropBank semantics
German		PropBank semantics
Italian		PropBank semantics
Portuguese		PropBank semantics
Portuguese		PropBank semantics
Spanish		PropBank semantics
Turkish		PropBank semantics

Syntactic treebanks

Many syntactic treebanks have been developed for a wide variety of languages:

Language	Treebank	Syntactic Formalism	Distribution / License
Abaza	, ATB	Dependency
Afrikaans	, AfriBooms	Dependency
Akkadian	, PISANDUB	Dependency
Albanian	, TSA	Dependency
Amharic	, ATT	Dependency
Ancient Greek	, Perseus	Dependency
Ancient Greek	, PROIEL	Dependency
Greek (ancient)		Dependency
Greek (ancient)		Dependency
Arabic		Dependency
Arabic		Dependency
Arabic	, NYUAD	Dependency
Arabic	, PADT	Dependency
Arabic	, PUD	Dependency
Arabic		Phrase structure
Armenian	, ArmTDP	Dependency
Assyrian (Neo-Aramaic)	, AS	Dependency
Bambara	, CRB	Dependency
Basque	, BDT	Dependency
Belarusian	, HSE	Dependency
Bhojpuri	, BhEn	Dependency
Bhojpuri	, BHTB	Dependency
Breton	, KEB	Dependency
Bulgarian	, BTB	Dependency
Bulgarian		[Head-driven Phrase structure grammar\|phrase structure grammar\|HPSG]
Buryat	, BDT	Dependency
Cantonese	, HK	Dependency
Catalan		Phrase structure
Catalan	, AnCora	Dependency
Chinese		Case grammar
Chinese	, CFL	Dependency
Chinese	, GSD	Dependency
Chinese	, GSDSimp	Dependency
Chinese	, HK	Dependency
Chinese	, PUD	Dependency
Chinese		Phrase structure
Chinese		Dependency
Arabic (classical)		Dependency
Classical Armenian		Dependency
Coptic	, Coptic Scriptorium	Dependency
Croatian		Dependency
Croatian	, SET	Dependency
Czech		Dependency
Czech	, CAC	Dependency
Czech	, CLTT	Dependency
Czech	, FicTree	Dependency
Czech	, PDT	Dependency
Czech	, PUD	Dependency
Danish		Dependency
Danish		Phrase structure
Danish	, DDT	Dependency
Danish	, DTB	Dependency
Dutch		Phrase structure
Dutch	, Alpino	Dependency
Dutch	, LassySmall	Dependency
Dutch		Dependency
Dutch		Dependency
Egyptian	, Pre-Coptic	Dependency
English		Combinatory categorial grammar
English		HPSG
English		Phrase structure
English		Dependency
English	, BhEn	Dependency
English	, ESL	Dependency
English	, EWT	Dependency
English	, GUM	Dependency
English	, GUMReddit	Dependency
English	, LinES	Dependency
English	, ParTUT	Dependency
English	, Pronouns	Dependency
English	, PUD	Dependency
English		Phrase structure
English		Phrase structure
English		Phrase structure
English		Phrase structure
English		Phrase structure
English		HPSG
English		Phrase structure
English		Phrase structure
English		Dependency
English		Dependency
English		Phrase structure
English		Phrase structure
English		Dependency
English		Phrase structure
Erzya	, JR	Dependency
Estonian		Phrase structure
Estonian		Dependency
Estonian	, EDT	Dependency
Estonian	, EWT	Dependency
Faroese	, FarPaHC	Dependency
Faroese	, OFT	Dependency
Finnish		Dependency
Finnish	, FTB	Dependency
Finnish	, PUD	Dependency
Finnish	, TDT	Dependency
French (spoken)		Dependency and macrosyntactic annotation
French		Phrase structure
French	, CrapBank	Dependency
French	, FQB	Dependency
French	, FTB	Dependency
French	, GSD	Dependency
French	, ParTUT	Dependency
French	, PUD	Dependency
French	, Sequoia	Dependency
French	, Spoken	Dependency
French		Phrase structure
French		Phrase structure
French		Phrase structure & Dependency
Galician	, CTG	Dependency
Galician	, TreeGal	Dependency
German		Dependency
German	, GSD	Dependency
German	, LIT	Dependency
German	, PUD	Dependency
German		Phrase structure
German		Phrase structure
German		Phrase structure
German		Phrase structure
German		Phrase structure
German		Phrase structure
Gothic		Dependency
Gothic	, PROIEL	Dependency
Greek		Dependency
Greek	, GDT	Dependency
Hebrew	, HTB	Dependency
Hebrew		Dependency
Hindi English	, HIENCS	Dependency
Hindi	, HDTB	Dependency
Hindi	, PUD	Dependency
Hindi		Dependency
English (historical)	;	Phrase structure
English (historical)		Phrase structure
French (historical)		Phrase structure
Portuguese (historical)		Phrase structure
Hungarian	, Szeged	Dependency
Hungarian		Phrase structure
Icelandic		Phrase structure
Icelandic	, IcePaHC	Dependency
Icelandic	, PUD	Dependency
Indonesian	, GSD	Dependency
Indonesian	, PUD	Dependency
Indonesian		Phrase structure
Irish	, IDT	Dependency
Italian		Phrase structure and dependency
Italian		dependency
Italian		Phrase structure and dependency
Italian	, ISDT	Dependency
Italian	, ParTUT	Dependency
Italian	, PoSTWITA	Dependency
Italian	, PUD	Dependency
Italian	, TWITTIRO	Dependency
Italian	, VIT	Dependency
Italian		dependency
Italian
Italian		Dependency
Italian		dependency
Japanese
Japanese	, BCCWJ	Dependency
Japanese	, GSD	Dependency
Japanese	, KTC	Dependency
Japanese	, Modern	Dependency
Japanese	, PUD	Dependency
Japanese		Phrase structure
Japanese		Phrase structure
Japanese		Dependency
Karelian	, KKPP	Dependency
Kazakh	, KTB	Dependency
Komi Permyak	, UH	Dependency
Komi Zyrian	, IKDP	Dependency
Komi Zyrian	, Lattice	Dependency
Korean	, GSD	Dependency
Korean	, Kaist	Dependency
Korean	, Penn	Dependency
Korean	, PUD	Dependency
Korean	, Sejong	Dependency
Korean		Phrase structure
Kurmanji	, MG	Dependency
Latin	, ITTB	Dependency
Latin	, LLCT	Dependency
Latin	, Perseus	Dependency
Latin	, PROIEL	Dependency
Latin		Dependency
Latin		Dependency
Latin		Dependency
Latvian	, LVTB	Dependency
Lithuanian	, ALKSNIS	Dependency
Lithuanian	, HSE	Dependency
Livvi	, KKPP	Dependency
Magahi	, MGTB	Dependency
Maltese	, MUDT	Dependency
Marathi	, UFAL	Dependency
Mbya Guarani	, Dooley	Dependency
Mbya Guarani	, Thomas	Dependency
Middle Irish	, CritMITB	Dependency
Middle Irish	, DipMITB	Dependency
Moksha	, JR	Dependency
Naija	, NSC	Dependency
North Sami	, Giella	Dependency
Norwegian		LFG
Norwegian	, Bokmaal	Dependency
Norwegian	, Nynorsk	Dependency
Norwegian	, NynorskLIA	Dependency
Old Church Slavonic	, PROIEL	Dependency
Old Church Slavonic		Dependency
Old French	, SRCMF	Dependency
Old Russian	, RNC	Dependency
Old Russian	, TOROT	Dependency
Old Russian		Dependency
Persian		Dependency
Persian		HPSG
Persian	, Seraji	Dependency
Polish		HPSG
Polish	, LFG	Dependency
Polish	, PDB	Dependency
Polish	, PUD	Dependency
Polish		Phrase structure and Dependency
Portuguese	, Bosque	Dependency
Portuguese	, GSD	Dependency
Portuguese	, PUD	Dependency
Portuguese		Dependency, Phrase structure
Romanian		Dependency
Romanian	, Nonstandard	Dependency
Romanian	, RRT	Dependency
Romanian	, SiMoNERo	Dependency
Russian	, GSD	Dependency
Russian	, PUD	Dependency
Russian	, SynTagRus	Dependency
Russian	, Taiga	Dependency
Russian	SynTagRus Dependency Treebank	Dependency
Sanskrit	, UFAL	Dependency
Sanskrit	, Vedic	Dependency
Scottish Gaelic	, ARCOSG	Dependency
Serbian	, SET	Dependency
Sindhi	, MazharDootio	Dependency
Skolt Sami	, Giellagas	Dependency
Slovak	, SNK	Dependency
Slovene		Dependency
Slovenian	, SSJ	Dependency
Slovenian	, SST	Dependency
Spanish		Phrase structure and dependency
Spanish	, AnCora	Dependency
Spanish	, GSD	Dependency
Spanish	, PUD	Dependency
Spanish		Phrase structure
Swedish		Phrase structure and dependency
Swedish		Phrase structure
Swedish	, LinES	Dependency
Swedish	, PUD	Dependency
Swedish	, Talbanken	Dependency
Swedish		Phrase structure
Swedish Sign Language	, SSLC	Dependency
Swiss German	, UZH	Dependency
Tagalog	, TRG	Dependency
Tagalog	, Ugnayan	Dependency
Tamil	, TTB	Dependency
Telugu	, MTG	Dependency
Thai		Dependency
Thai	, PUD	Dependency
Thai		Phrase structure
Turkish		Dependency
Turkish	, BOUN	Dependency
Turkish	, GB	Dependency
Turkish	, IMST	Dependency
Turkish	, PUD	Dependency
Ukrainian		Dependency
Ukrainian	, IU	Dependency
Upper Sorbian	, UFAL	Dependency
Urdu		Phrase structure
Urdu		Phrase and Hyper Dependency Structure
Urdu	, UDTB	Dependency
Uyghur	, UDT	Dependency
Vietnamese	, VTB	Dependency
Vietnamese		Phrase structure
Vietnamese		Dependency
Warlpiri	, UFAL	Dependency
Welsh	, CCG	Dependency
Wolof	, WTB	Dependency
Yoruba	, YTB	Dependency

To facilitate the further researches between multilingual tasks, some researchers discussed the universal annotation scheme for cross-languages. In this way, people try to utilize or merge the advantages of different treebanks corpora. For instance,
The universal annotation approach for dependency treebanks; and the universal annotation approach for phrase structure treebanks.

Search tools

One of the key ways to extract evidence from a treebank is through search tools. Search tools for parsed corpora typically depend on the annotation scheme that was applied to the corpus. User interfaces range in sophistication from expression-based query systems aimed at computer programmers to full exploration environments aimed at general linguists. Wallis discusses the principles of searching treebanks in detail and reviews the state of the art around that time.