Croatian National Corpus
Croatian National Corpus is the biggest and the most important corpus of Croatian. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of . The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of Croatian started to appear even earlier. The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms.
The initial composition was divided in two constituents:
- 30-million corpus of contemporary Croatian where samples from texts from 1990 on were included. The criteria for inclusion of text samples were: written by native speakers, different fields, genres and topics. Translated text or poetry were excluded.
- Croatian Electronic Text Archive where the complete text were included, particularly serial publications which would imbalance the 30m if they were inserted there.
The last version of this corpus has 216.8 million tokens. The online search is available via web-interface search Bonito 2 which is a part of NoSketch Engine, limited version of the software Sketch Engine.