Language resource
In linguistics and language technology, a language resource is a " of linguistic material used in the construction, improvement and/or evaluation of language processing applications, in language and language-mediated research studies and applications."
According to Bird & Simons, this includes
- data, i.e. "any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of handwritten index cards. The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar",
- tools, i.e., "computational resources that facilitate creating, viewing, querying, or otherwise using language data", and
- advice, i.e., "any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data". The latter aspect is usually referred to as "best practices" or " standards".
Typology
As of May 2020, no widely used standard typology of language resources has been established. Important classes of language resources include- data
- # lexical resources, e.g., machine-readable dictionaries,
- # linguistic corpora, i.e., digital collections of natural language data,
- # linguistic data bases such as the Cross-Linguistic Linked Data collection,
- tools
- # linguistic annotations and tools for creating such annotations in a manual or semiautomated fashion,
- # applications for search and retrieval over such data, for automated annotation,
- metadata and vocabularies
- # vocabularies, repositories of linguistic terminology and language metadata, e.g., MetaShare, the ISO 12620 data category registry, or the Glottolog database.
Language resource publication, dissemination and creation
A major concern of the language resource community has been to develop infrastructures and platforms to present, discuss and disseminate language resources. Selected contributions in this regard include:- a series of International Conferences on Language Resources and Evaluation,
- the European Language Resources Association, and the Linguistic Data Consortium, which represent commercial hosting and dissemination platforms for language resources,
- the Open Languages Archives Community (OLAC), which provides and aggregates language resource metadata,
- the Language Resources and Evaluation Journal,
- the is a European platform for language technologies, data and resources.
- ISO Technical Committee 37: Terminology and other language and content resources, developing standards for all aspects of language resources,
- W3C Community Group Best Practices for Multilingual Linked Open Data, working on best practice recommendations for publishing language resources as Linked Data or in RDF,
- W3C Community Group Linked Data for Language Technology, working on linguistic annotations on the web and language resource metadata,
- W3C Community Group Ontology-Lexica, working on lexical resources,
- the Open Linguistics working group of the Open Knowledge Foundation, working on conventions for publishing and linking open language resources, developing the Linguistic Linked Open Data cloud,
- the Text Encoding Initiative (TEI), working on XML-based specifications for language resources and digitally edited text.