Russian National Corpus
The Russian National Corpus is a corpus of the Russian language that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences.
It currently contains more than 1 billion word forms that are automatically lemmatized and POS-/grammeme-tagged, i.e. all the possible morphological analyses for each orthographic form are ascribed to it. Lemmata, POS, grammatical items, and their combinations are searchable. Additionally, 6 million word forms are in the subcorpus with manually resolved homonymy.
The subcorpus with resolved morphological homonymy is also automatically accentuated. The whole corpus has a searchable tagging concerning lexical semantics, including morphosemantic POS subclasses, LS characteristics proper, derivation.
The RNC includes also the following subcorpora:
- a treebank of syntactical dependencies
- English⇔Russian, German⇒Russian, Ukrainian⇔Russian and Belorussian⇔Russian parallel corpora;
- a large separate corpus of modern newspapers ;
- a corpus of Russian poetry, where the rhyming words and poetic prosody is additionally tagged;
- a corpus of Russian dialects with specific dialect grammar tagging;
- a multimedia corpus with searchable tagged fragments of Russian-language movies;
- a corpus showing the history of Russian stress
- an educational subcorpus reflecting school standards.