General Regionally Annotated Corpus of Ukrainian
General Regionally Annotated Corpus of the Ukrainian Language is a text corpus of the Ukrainian language comprising more than 2 billion tokens, intended for linguistic research in grammar, vocabulary, and the history of the Ukrainian literary language, as well as for use in compiling dictionaries and grammars.
The corpus can be used for language study and also for preparing teaching materials, textbooks, learner’s dictionaries, and exercises using examples from real texts, taking into account frequency and collocational patterns, and so on. The corpus is not a model of standard Ukrainian: it may contain words and combinations that do not match current norms of the literary language.
The corpus covers the period from 1816 to 2025, and as of 29 November 2025 it contains more than 812,000 texts by about 35,000 authors.
Composition of the corpus
In the 10th version of the corpus, available for searching from 20 October 2020, 35% consists of fiction. Some fiction genres are выделені separately: children’s literature, folklore, dramatic works, and scripts.Among non-fiction texts:
journalistic writing, including newspaper collections from 1888–1893, 1905, 1913–1918, 1919–1943, modern newspapers from different regions, and texts from online news/information sites;
memoirs, letters, and diaries, including a sizeable corpus of Facebook texts representing blogs by people from all regions of Ukraine and the diaspora;
scholarly and educational texts: monographs, dissertations, academic articles, textbooks; large subcorpora of academic literature in history, ethnography, philosophy, and law are singled out separately;
religious texts, including two Ukrainian translations of the Bible;
speeches and interviews.
Some dictionaries that include phrasal examples and phraseology have also been incorporated, including the Ukrainian dictionary by Borys Hrinchenko and the Russian-Ukrainian idiomatic dictionary by I. Vyrhan and M. Pylynska. Using the corpus tools, these dictionaries can be searched not only for words, but also for lexico-grammatical patterns within examples and phraseological expressions.
About 20% of the texts in the corpus are translations. The corpus includes translations from more than 80 languages, most of all from English and Russian.
Dating
Texts in the corpus are dated by the year of writing, or by the latest year in which a work could have been written; translated texts are dated by the year the translation was produced. A publication year may also be indicated, corresponding to the edition from which the text is taken.Regional annotation
The corpus’s regional annotation is based on the modern administrative division of Ukraine. The corpus includes texts from all oblasts of Ukraine and from Crimea.A single text may belong to several regional subcorpora.
In addition to regional subcorpora, there are subcorpora of works by authors of the Ukrainian diaspora. These are mostly texts by emigrants of the 1940s, and to a lesser extent of 1917–1920s.
Morphological annotation
GRAC is based on the morphological analysis system, developed by specialists from the r2u group.The program analyzes the text and, for each word form, determines the lemma and tags.