Treebank


In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Etymology

The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank. This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure. The term parsed corpus is often used interchangeably with the term treebank, with the emphasis on the primacy of sentences rather than trees.

Construction

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank.
Image:example-tree.png|right|250px|thumb|Example phrase structure tree for John loves Mary
Image:Quranic-arabic-corpus.png|right|250px|thumb|Hybrid constituency/dependency tree from the Quranic Arabic Corpus
Some treebanks follow a specific linguistic theory in their syntactic annotation but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure and those that annotate dependency structure.
It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for John loves Mary, shown in the figure on the right/above, may be represented by simple labelled brackets in a text file, like this :

)
)
This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation.

Applications

From a computational linguistics perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems. Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate.
In corpus linguistics, treebanks are used to study syntactic phenomena. Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena.
Another use of treebanks in theoretical linguistics and psycholinguistics is interaction evidence. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices.
In linguistics research, annotated treebank data has been used in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.

Semantic treebanks

A semantic treebank is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure. Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is the , developed at the University of Groningen and annotated using Discourse Representation Theory. An example of a shallow semantic treebank is PropBank, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in logical form.
LanguageTreebankSemantic FormalismDistribution / License
ChinesePropBank semantics
EnglishAbstract Meaning Representation BankDeep semantics
EnglishFrameNetShallow semantics
EnglishUniversal Conceptual Cognitive Annotation Deep semantics
EnglishDeep semantics
EnglishDeep semantics
EnglishDeep semantics
DutchDeep semantics
GermanDeep semantics
ItalianDeep semantics
EnglishDeep semantics
EnglishDeep semantics
EnglishDeep semantics
EnglishDeep semantics
EnglishPropBank semantics
FinnishPropBank semantics
FinnishPropBank semantics
FrenchPropBank semantics
GermanPropBank semantics
ItalianPropBank semantics
PortuguesePropBank semantics
PortuguesePropBank semantics
SpanishPropBank semantics
TurkishPropBank semantics

Syntactic treebanks

Many syntactic treebanks have been developed for a wide variety of languages:
LanguageTreebankSyntactic FormalismDistribution / License
Abaza, ATBDependency
Afrikaans, AfriBoomsDependency
Akkadian, PISANDUBDependency
Albanian, TSADependency
Amharic, ATTDependency
Ancient Greek, PerseusDependency
Ancient Greek, PROIELDependency
Greek Dependency
Greek Dependency
ArabicDependency
ArabicDependency
Arabic, NYUADDependency
Arabic, PADTDependency
Arabic, PUDDependency
ArabicPhrase structure
Armenian, ArmTDPDependency
Assyrian , ASDependency
Bambara, CRBDependency
Basque, BDTDependency
Belarusian, HSEDependency
Bhojpuri, BhEnDependency
Bhojpuri, BHTBDependency
Breton, KEBDependency
Bulgarian, BTBDependency
BulgarianHPSG
Buryat, BDTDependency
Cantonese, HKDependency
CatalanPhrase structure
Catalan, AnCoraDependency
ChineseCase grammar
Chinese, CFLDependency
Chinese, GSDDependency
Chinese, GSDSimpDependency
Chinese, HKDependency
Chinese, PUDDependency
ChinesePhrase structure
ChineseDependency
Arabic Dependency
Classical ArmenianDependency
Coptic, Coptic ScriptoriumDependency
CroatianDependency
Croatian, SETDependency
CzechDependency
Czech, CACDependency
Czech, CLTTDependency
Czech, FicTreeDependency
Czech, PDTDependency
Czech, PUDDependency
DanishDependency
DanishPhrase structure
Danish, DDTDependency
Danish, DTBDependency
DutchPhrase structure
Dutch, AlpinoDependency
Dutch, LassySmallDependency
DutchDependency
DutchDependency
Egyptian, Pre-Coptic Dependency
EnglishCombinatory categorial grammar
EnglishHPSG
EnglishPhrase structure
EnglishDependency
English, BhEnDependency
English, ESLDependency
English, EWTDependency
English, GUMDependency
English, GUMRedditDependency
English, LinESDependency
English, ParTUTDependency
English, PronounsDependency
English, PUDDependency
EnglishPhrase structure
EnglishPhrase structure
EnglishPhrase structure
EnglishPhrase structure
EnglishPhrase structure
EnglishHPSG
EnglishPhrase structure
EnglishPhrase structure
EnglishDependency
EnglishDependency
EnglishPhrase structure
EnglishPhrase structure
EnglishDependency
EnglishPhrase structure
Erzya, JRDependency
EstonianPhrase structure
EstonianDependency
Estonian, EDTDependency
Estonian, EWTDependency
Faroese, FarPaHCDependency
Faroese, OFTDependency
FinnishDependency
Finnish, FTBDependency
Finnish, PUDDependency
Finnish, TDTDependency
French Dependency and macrosyntactic annotation
FrenchPhrase structure
French, CrapBankDependency
French, FQBDependency
French, FTBDependency
French, GSDDependency
French, ParTUTDependency
French, PUDDependency
French, SequoiaDependency
French, SpokenDependency
FrenchPhrase structure
FrenchPhrase structure
FrenchPhrase structure & Dependency
Galician, CTGDependency
Galician, TreeGalDependency
GermanDependency
German, GSDDependency
German, LITDependency
German, PUDDependency
GermanPhrase structure
GermanPhrase structure
GermanPhrase structure
GermanPhrase structure
GermanPhrase structure
GermanPhrase structure
GothicDependency
Gothic, PROIELDependency
GreekDependency
Greek, GDTDependency
Hebrew, HTBDependency
HebrewDependency
Hindi English, HIENCSDependency
Hindi, HDTBDependency
Hindi, PUDDependency
HindiDependency
English ;Phrase structure
English Phrase structure
French Phrase structure
Portuguese Phrase structure
Hungarian, SzegedDependency
HungarianPhrase structure
IcelandicPhrase structure
Icelandic, IcePaHCDependency
Icelandic, PUDDependency
Indonesian, GSDDependency
Indonesian, PUDDependency
IndonesianPhrase structure
Irish, IDTDependency
ItalianPhrase structure and dependency
Italiandependency
ItalianPhrase structure and dependency
Italian, ISDTDependency
Italian, ParTUTDependency
Italian, PoSTWITADependency
Italian, PUDDependency
Italian, TWITTIRODependency
Italian, VITDependency
Italiandependency
Italian
ItalianDependency
Italiandependency
Japanese
Japanese, BCCWJDependency
Japanese, GSDDependency
Japanese, KTCDependency
Japanese, ModernDependency
Japanese, PUDDependency
JapanesePhrase structure
JapanesePhrase structure
JapaneseDependency
Karelian, KKPPDependency
Kazakh, KTBDependency
Komi Permyak, UHDependency
Komi Zyrian, IKDPDependency
Komi Zyrian, LatticeDependency
Korean, GSDDependency
Korean, KaistDependency
Korean, PennDependency
Korean, PUDDependency
Korean, SejongDependency
KoreanPhrase structure
Kurmanji, MGDependency
Latin, ITTBDependency
Latin, LLCTDependency
Latin, PerseusDependency
Latin, PROIELDependency
LatinDependency
LatinDependency
LatinDependency
Latvian, LVTBDependency
Lithuanian, ALKSNISDependency
Lithuanian, HSEDependency
Livvi, KKPPDependency
Magahi, MGTBDependency
Maltese, MUDTDependency
Marathi, UFALDependency
Mbya Guarani, DooleyDependency
Mbya Guarani, ThomasDependency
Middle Irish, CritMITBDependency
Middle Irish, DipMITBDependency
Moksha, JRDependency
Naija, NSCDependency
North Sami, GiellaDependency
NorwegianLFG
Norwegian, BokmaalDependency
Norwegian, NynorskDependency
Norwegian, NynorskLIADependency
Old Church Slavonic, PROIELDependency
Old Church SlavonicDependency
Old French, SRCMFDependency
Old Russian, RNCDependency
Old Russian, TOROTDependency
Old RussianDependency
PersianDependency
PersianHPSG
Persian, SerajiDependency
PolishHPSG
Polish, LFGDependency
Polish, PDBDependency
Polish, PUDDependency
PolishPhrase structure and Dependency
Portuguese, BosqueDependency
Portuguese, GSDDependency
Portuguese, PUDDependency
PortugueseDependency, Phrase structure
RomanianDependency
Romanian, NonstandardDependency
Romanian, RRTDependency
Romanian, SiMoNERoDependency
Russian, GSDDependency
Russian, PUDDependency
Russian, SynTagRusDependency
Russian, TaigaDependency
RussianSynTagRus Dependency Treebank Dependency
Sanskrit, UFALDependency
Sanskrit, VedicDependency
Scottish Gaelic, ARCOSGDependency
Serbian, SETDependency
Sindhi, MazharDootioDependency
Skolt Sami, GiellagasDependency
Slovak, SNKDependency
SloveneDependency
Slovenian, SSJDependency
Slovenian, SSTDependency
SpanishPhrase structure and dependency
Spanish, AnCoraDependency
Spanish, GSDDependency
Spanish, PUDDependency
SpanishPhrase structure
SwedishPhrase structure and dependency
SwedishPhrase structure
Swedish, LinESDependency
Swedish, PUDDependency
Swedish, TalbankenDependency
SwedishPhrase structure
Swedish Sign Language, SSLCDependency
Swiss German, UZHDependency
Tagalog, TRGDependency
Tagalog, UgnayanDependency
Tamil, TTBDependency
Telugu, MTGDependency
ThaiDependency
Thai, PUDDependency
ThaiPhrase structure
TurkishDependency
Turkish, BOUNDependency
Turkish, GBDependency
Turkish, IMSTDependency
Turkish, PUDDependency
UkrainianDependency
Ukrainian, IUDependency
Upper Sorbian, UFALDependency
UrduPhrase structure
UrduPhrase and Hyper Dependency Structure
Urdu, UDTBDependency
Uyghur, UDTDependency
Vietnamese, VTBDependency
VietnamesePhrase structure
VietnameseDependency
Warlpiri, UFALDependency
Welsh, CCGDependency
Wolof, WTBDependency
Yoruba, YTBDependency

To facilitate the further researches between multilingual tasks, some researchers discussed the universal annotation scheme for cross-languages. In this way, people try to utilize or merge the advantages of different treebanks corpora. For instance,
The universal annotation approach for dependency treebanks; and the universal annotation approach for phrase structure treebanks.