Biocuration


Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.

Biocuration as a profession

A biocurator is a professional scientist who curates, collects, annotates, and validates information that is disseminated by biological and model organism databases. It is a new profession, with the first mentions in the scientific literature dating of 2006 in the context of the work in databases like the Immune Epitope Database and Analysis Resource. Biocurators usually are PhD-level with a mix of experiences in wet lab and computational representations of knowledge.
The role of a biocurator encompasses quality control of primary biological research data intended for publication, extracting and organizing data from original scientific literature, and describing the data with standard annotation protocols and vocabularies that enable powerful queries and biological database interoperability. Biocurators communicate with researchers to ensure the accuracy of curated information and to foster data exchanges with research laboratories.
Biocurators are present in diverse research environments, but may not self-identify as biocurators. Projects such as ELIXIR and GOBLET promote training and support biocuration as a career path.
In 2011, biocuration was already recognized as a profession, but there were no formal degree courses to prepare curators for biological data in a targeted fashion. With the growth of the field, the University of Cambridge and the EMBL-EBI started to jointly offer a Postgraduate Certificate in Biocuration, considered as a step towards recognising biocuration as a discipline on its own. There is a perceived increase in demand of biocuration, and a need for additional biocuration training by graduate programs.
Organizations that employ biocurators, like Clinical Genome Resource, often provide specialized materials and training for biocuration.

Biological knowledgebases

The role of biocurators is best known among the field of biological knowledgebases. Such databases, like UniProt and PDB rely on professional biocurators to organize information. Among other things, biocurators work to improve the data quality, for example, by merging duplicated entries.
An important part of those knowledgebases are model organisms databases, which rely on biocurators to curate information regarding organisms of particular kinds. Some notable examples of model organism databases are FlyBase, PomBase, and ZFIN, dedicated to curate information about Drosophila, Schizosaccharomyces pombe and zebrafish respectively.

Curation and annotation

Biocuration is the integration of biological information into on-line databases in a semantically standardized way, using appropriate unique traceable identifiers, and providing necessary metadata including source and provenance.

Ontologies, controlled vocabularies and standard names

Biocurators commonly employ and take part in the creation and development of shared biomedical ontologies: structured, controlled vocabularies that encompass many biological and medical knowledge domains, such as the Open Biomedical Ontologies. These domains include genomics and proteomics, anatomy, animal and plant development, biochemistry, metabolic pathways, taxonomic classification, and mutant phenotypes. Given the variety of existing ontologies, there are guidelines that orient researchers on how to choose a suitable one.
The Unified Medical Language System is one such systems that integrates and distributes millions of terms used in the life sciences domain.
Biocurators enforce the consistent use of gene nomenclature guidelines and participate in the genetic nomenclature committees of various model organisms, often in collaboration with the HUGO Gene Nomenclature Committee. They also enforce other nomenclature guidelines like those provided by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology, one example of which is the Enzyme Commission EC number.
More generally, the use of persistent identifiers is praised by the community, so to improve clarity and facilitate knowledge

DNA annotation

In genome annotation for example, the identifiers defined by the ontologists and consortia are used to describe parts of the genome. For example, the gene ontology curates terms for biological processes, which are used to describe what we know about specific genes.

Text annotation

As of 2021, life sciences communication is still done primarily via free natural languages, like English or German, which hold a degree of ambiguity and make it hard to connect knowledge. So, besides annotating biological sequences, biocurators also annotate texts, linking words to unique identifiers. This aids in disambiguation, clarifying the meaning intended, and making the texts processable by computers. One application of text annotation is to specify the exact gene a scientist is referring to.
Publicly available text annotations make it possible to biologists to take further advantage of biomedical text. The Europe PMC has an Application Programming Interface which centralizes text annotations from a variety of sources and make them available in a Graphic User Interface called SciLite. The PubTator Central also provides annotations, but is fully based on computerized text-mining and does not provide a user interface. There are also programs that allow users to manually annotate the biomedical texts they are interested, such as the ezTag system.

Variant Curation

A type of biocuration within the field of medical genetics, variant curation is a process for assessment of genetic changes according to the likelihood that they may cause disease. This is an evidence-based process that uses data from a multitude of sources. These sources can include population data, computational data, functional data, segregation data, de novo data, allelic data, among others. It is a collaborative process that can be automated, however manual curation is considered to be the gold standard.
There is no single standardised process of variant curation; different researchers and organisations use different variant curation processes. However, a set of internationally accepted standards and guidelines for the interpretation of genetic variants have been jointly developed by the American College of Medical Genetics and the Association for Molecular Pathology. These are known as the ACMG/AMP guidelines. These guidelines provide a framework for classifying genetic variants as "pathogenic", "likely pathogenic", "uncertain significance", "likely benign" or "benign", in order from most likely to cause disease to least likely to cause disease. The guidelines also list various levels of evidence ranging from very strong, strong, moderate or supporting. The combination of types of evidence found, and the levels in which those pieces of evidence exist, allows for each variant to be classified along the scale from "pathogenic" to "benign".

International Society for Biocuration (ISB)

The International Society for Biocuration is a non-profit organisation that "promotes the field of biocuration and provides a forum for information exchange through meetings and workshops." It has grown from the International Biocuration Conferences and was founded in early 2009.
The ISB offers the Biocuration Career Award to biocurators in the community: the Biocurator Career Award and the ISB Award for Exceptional Contributions to Biocuration.
The official journal of the ISB, Database, is a venue specialized in articles about databases and biocuration.

Community curation

Traditionally, biocuration has been done by dedicated experts, which integrate data into databases. Community curation has emerged as a promising approach to improve the dissemination of knowledge from published data and provide a cost-effective way to improve the scalability of biocuration. In some cases, community help is leveraged in jamborees that introduce domain experts to curation tasks, carried during the event, while others rely on asynchronous contributions of experts and non-experts.

Biological databases

Several biological databases include author contributions in their functional curation strategy to some extent, which may range from associating gene identifiers with publications or free-text, to more structured and detailed annotation of sequences and functional data, outputting curation to the same standards as professional biocurators. Most community curation at Model Organism Databases involves annotation by original authors of published research to effectively obtain accurate identifiers for objects to be curated, or identify data-types for detailed curation. For example:
  • WormBase successfully solicits first-pass annotation from users and has integrated author curation with the micropublication process. WormBase also integrates text-mining to its platform, providing suggestions to community curators.
  • FlyBase sends email requests to authors of new publications, inviting them to list the genes and data types described via an online tool and has also mobilized the community to write gene summary paragraphs.
Other databases, such as PomBase, rely on publication authors to submit highly detailed, ontology-based annotations for their publications, and meta-data associated with genome-wide data-sets using controlled vocabularies. A web-based tool Canto; was developed to facilitate community submissions. Since Canto is freely available, generic and highly configurable, it has been adopted by other projects. Curation is subjected to review by professional curators resulting in high quality in-depth curation of all molecular data-types.
The widely used UniProt knowledgebase also has a community curation mechanism that allows researchers to add information about proteins.