Czech National Corpus

The Czech National Corpus is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus in the Faculty of Arts at Charles University in Prague. The collection is used for teaching and research in corpus linguistics. The ICNC collaborates with over 200 researchers and students, 270 publishers, and other similar research projects.

WWW ucnk.ff.cuni.cz/cs

Areas of focus

The Czech National Corpus focuses systematically on the following areas:Synchronic written corpora: the SYN-series corpora maps the Czech language of the 20th and 21st century and forms the core of the project. Texts are enriched with metadata, lemmatization, and morphological tagging.Contemporary spontaneous spoken Czech: The ORAL-series corpora contain contemporary, spontaneous spoken language used in informal situations through the entire Czech Republic.Multilingual parallel corpus: InterCorp is a large corpus of Czech texts aligned at the sentence level with translations to or from more than 30 languages. The core of the corpus consists of manually aligned and proofread fiction texts.Diachronic corpus of Czech: the DIAKORP corpus of historical Czech includes texts from 14th century onwards. The current focus of DIAKORP is on the 19th century. The long term goal of DIAKORP is to create a corpus covering the period of 1850–present and interconnecting the data with the SYN series.Specialised linguistic data: the ICNC is also involved in the collection of language data for specific research purposes, including DIALEKT, CzeSL, DEAF, or Jerome.