Reference genome


A reference genome is a genome assembly that represents the complete genetic sequence of an organism as a continuous string of nucleotides. For an assembly to serve as a reference genome, it is typically accompanied by annotations, produced through a process known as DNA or genome annotation. The annotations specify the genomic coordinates of genes, exons, introns, and mRNA, and are often paired with corresponding transcript and protein sequences.
Reference genomes exist for a wide variety of species, including species of viruses, bacteria, fungi, plants and animals, and they differ in how they are constructed and represented. A reference may be derived from a single individual or from multiple individuals whose sequences are collapsed into one representative assembly - haplotype. Two main factors determine reference genome's assembly quality: the sequencing technology which affects sequence accuracy and the assembly level which indicates how complete the genome representation is.
The ideal is a chromosome-level assembly, which is a complete DNA sequence for each chromosome with no unplaced segments. However, achieving this remains technically challenging, especially for large or repetitive genomes. Earlier sequencing technologies often produced assemblies at the contig or scaffold level, with limited chromosomal context. The exact size of these fragments depends on the sequencing platform and bioinformatic methods available at the time.
For assemblies that are not fully resolved, summary statistics such as N50 and L50 are commonly used to characterise contiguity and assembly fragmentation; these metrics are explained in the Contigs and ''Scaffolds section.
Reference genomes are central to
omics'' research, particularly genomics. They provide a reference for "mapping" DNA sequence data from many individuals, enabling efficient identification of the genomic location of these sequences and the detection of polymorphisms through a process known as variant calling.
The limitations of this practice, such as reference bias and under-representation of population diversity, have led to the development of population-level reference sets and pangenomes.
Reference genomes and their annotations are publicly accessible through online genome browsers and archives such as Ensembl, the European Nucleotide Archive at EMBL-EBI, the UCSC Genome Browser, and NCBI.

Properties of reference genomes

Measures of length

The length of a genome can be measured in multiple different ways.
A simple way to measure genome length is to count the number of base pairs in the assembly.
The golden path is an alternative measure of length that omits redundant regions such as haplotypes and pseudo autosomal regions. It is usually constructed by layering sequencing information over a physical map to combine scaffold information. It is a 'best estimate' of what the genome will look like and typically includes gaps, making it longer than the typical base pair assembly.

Contigs and scaffolds

Reference genomes assembly requires reads overlapping, creating contigs, which are contiguous DNA regions of consensus sequences. If there are gaps between contigs, these can be filled by scaffolding, either by contigs amplification with PCR and sequencing or by Bacterial Artificial Chromosome Placed, whose chromosome, genomic coordinates and orientations are known; 2) Unlocalised, when only the chromosome is known but not the coordinates or orientation; 3) Unplaced, whose chromosome is not known.
The number of contigs and scaffolds, as well as their average lengths are relevant parameters, among many others, for a reference genome assembly quality assessment since they provide information about the continuity of the final mapping from the original genome. The smaller the number of scaffolds per chromosome, until a single scaffold occupies an entire chromosome, the greater the continuity of the genome assembly. Other related parameters are N50 and L50. N50 is the length of the contigs/scaffolds in which the 50% of the assembly is found in fragments of this length or greater, while L50 is the number of contigs/scaffolds whose length is N50. The higher the value of N50, the lower the value of L50, and vice versa, indicating high continuity in the assembly.

Mammalian genomes

The human and mouse reference genomes are maintained and improved by the Genome Reference Consortium, a group of fewer than 20 scientists from a number of genome research institutes, including the European Bioinformatics Institute, the National Center for Biotechnology Information, the Sanger Institute and McDonnell Genome Institute at Washington University in St. Louis. GRC continues to improve reference genomes by building new alignments that contain fewer gaps, and fixing misrepresentations in the sequence.

Human reference genome

The original human reference genome was derived from thirteen anonymous volunteers from Buffalo, New York. Donors were recruited by advertisement in The Buffalo News, on Sunday, March 23, 1997. The first ten male and ten female volunteers were invited to make an appointment with the project's genetic counselors and donate blood from which DNA was extracted. As a result of how the DNA samples were processed, about 80 percent of the reference genome came from eight people and one male, designated RP11, accounts for 66 percent of the total. The ABO blood group system differs among humans, but the human reference genome contains only an O allele, although the others are annotated.
As the cost of DNA sequencing falls, and new full genome sequencing technologies emerge, more genome sequences continue to be generated. In several cases people such as James D. Watson had their genome assembled using massive parallel DNA sequencing. Comparison between the reference and Watson's genome revealed 3.3  million single nucleotide polymorphism differences, while about 1.4 percent of his DNA could not be matched to the reference genome at all. For regions where there is known to be large-scale variation, sets of alternate loci are assembled alongside the reference locus.
The latest human reference genome assembly, released by the Genome Reference Consortium, was GRCh38 in 2017. Several patches were added to update it, the latest patch being GRCh38.p14, published on the 3rd of February 2022. This build only has 349 gaps across the entire assembly, which implies a great improvement in comparison with the first version, which had roughly 150,000 gaps. The gaps are mostly in areas such as telomeres, centromeres, and long repetitive sequences, with the biggest gap along the long arm of the Y chromosome, a region of ~30 Mb in length. The number of genomic clone libraries contributing to the reference has increased steadily to >60 over the years, although individual RP11 still accounts for 70% of the reference genome. Genomic analysis of this anonymous male suggests that he is of African-European ancestry. According to the GRC website, their next assembly release for the human genome is currently "indefinitely postponed".
In 2022, the Telomere-to-Telomere Consortium, an open, community-based effort, published the first completely assembled reference genome, without any gaps in the assembly. It did not contain a Y-chromosome until version 2.0. This assembly allows for the examination of centromeric and pericentromeric sequence evolution. The consortium employed rigorous methods to assemble, clean, and validate complex repeat regions which are particularly difficult to sequence. It used ultra-long–read sequencing to accurately sequence segmental duplications.
The T2T-CHM13 is sequenced from CHM13hTERT, a cell line from an essentially haploid hydatidiform mole. "CHM" stands for "Complete Hydatidiform Mole," and "13" is its line number. "hTERT" stands for "human Telomerase Reverse Transcriptase". The cell line has been transfected with the TERT gene, which is responsible for maintaining telomere length and thus contributes to the cell line's immortality. A hydatidiform mole contains two copies of the same parental genome, and thus is essentially haploid. This eliminates allelic variation and allows better sequencing accuracy.
Recent genome assemblies are as follows:
Release nameDate of releaseEquivalent UCSC version
GRCh39Indefinitely postponed-
T2T-CHM13January 2022hs1
GRCh38Dec 2013hg38
GRCh37Feb 2009hg19
NCBI Build 36.1Mar 2006hg18
NCBI Build 35May 2004hg17
NCBI Build 34Jul 2003hg16

Limitations

For much of a genome, the reference provides a good approximation of the DNA of any single individual. But in regions with high allelic diversity, such as the major histocompatibility complex in humans and the major urinary proteins of mice, the reference genome may differ significantly from other individuals. Due to the fact that the reference genome is a "single" distinct sequence, which gives its utility as an index or locator of genomic features, there are limitations in terms of how faithfully it represents the human genome and its variability. Most of the initial samples used for reference genome sequencing came from people of European ancestry. In 2010, it was found that, by de novo assembling genomes from African and Asian populations with the NCBI reference genome, these genomes had ~5Mb sequences that did not align against any region of the reference genome.
Following projects to the Human Genome Project seek to address a deeper and more diverse characerization of the human genetic variability, which the reference genome is not able to represent. The HapMap Project, active during the period 2002 -2010, with the purpose of creating a haplotypes map and their most common variations among different human populations. Up to 11 populations of different ancestry were studied, such as individuals of the Han ethnic group from China, Gujaratis from India, the Yoruba people from Nigeria or Japanese people, among others. The 1000 Genomes Project, carried out between 2008 and 2015, with the aim of creating a database that includes more than 95% of the variations present in the human genome and whose results can be used in studies of association with diseases such as diabetes, cardiovascular or autoimmune diseases. A total of 26 ethnic groups were studied in this project, expanding the scope of the HapMap project to new ethnic groups such as the Mende people of Sierra Leone, the Vietnamese people or the Bengali people. The Human Pangenome Project, which started its initial phase in 2019 with the creation of the Human Pangenome Reference Consortium, seeks to create the largest map of human genetic variability taking the results of previous studies as a starting point.