Human Pangenome Reference
The Human Pangenome Reference is a collection of genomes from a diverse cohort of individuals compiled by the Human Pangenome Reference Consortium.
This first draft pangenome comprises 47 phased, diploid assemblies from a diverse cohort of individuals and was intended to capture the genetic diversity of the human population. The development of this pangenome seeks to address perceived shortcomings in the current human reference genome by offering a more comprehensive and inclusive resource for genomic research and analysis.
The pangenome concept, originating from the study of prokaryotes, has been extended to multicellular eukaryotic organisms, including humans. The human pangenome has significant implications for population genetics, phylogenetics, and public health policy, as it can inform the genetic basis of diseases and personalized treatments by providing insights into the genetic diversity of human populations.
The new human pangenome reference integrates the missing 8% of the human genome sequence, adding over 100 million new bases. It aims to capture more population diversity than the previous reference sequence and is based on 94 high-quality haploid assemblies from individuals with broad genetic diversity. The generation of this reference genome focuses on eliminating gaps, incorporating complex genomic sequence features, and encompassing a broader spectrum of human genome diversity.
History
The human reference genome, initially drafted over 20 years ago, is a composite of merged haplotypes from more than 20 individuals, with a single individual contributing to approximately 70% of the sequence. However, it has limitations, including biases and errors, and, as would be the case for any linear human genome reference sequence, can not fully represent the global human genomic variation. The majority of genomic research has focused on individuals of European descent which leads to a bias in available datasets for analysis. Consequently, precision medicine primarily relies on genomic variations found within populations of European ancestry. This limited scope overlooks a significant portion of global genetic diversity crucial for understanding clinical phenotypes. To overcome this, the Human Pangenome Reference Consortium has been working on creating a more complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity integrating genome sequences from a diverse array of individuals. Its primary objectives include enhancing gene-disease association studies across populations and serving as an extensive genetic resource for future biomedical research and precision medicine endeavors.Capturing variants
These assemblies are reported to cover more than 99% of the expected sequence in each genome and exhibit an accuracy of over 99% at both the structural and base pair levels. The pangenome captures known variants and haplotypes, reveals new alleles at structurally complex loci, and adds 119 million base pairs of euchromatic polymorphic sequence and 1,115 gene duplications relative to the existing reference GRCh38, with roughly 90 million of the additional base pairs derived from structural variation. Using this draft pangenome for analyzing short-read data has shown a 34% reduction in small variant discovery errors and a 104% increase in the detection of structural variants per haplotype compared to GRCh38-based workflows.Representation of diversity
The PRC's efforts are part of a broader initiative to sequence and assemble genomes from individuals across diverse populations, with the goal of better representing the genomic landscape of human diversity. The consortium aims to increase the number of genome sequences to 350 by mid-2024, providing a more complete and inclusive resource for genomic research and analysis.The development of the human pangenome reference marks a notable advancement in genomics, as it offers a more accurate and diverse depiction of global genomic variation. This development is expected to enhance gene-disease association studies across populations, broaden the scope of genomics research to encompass the most repetitive and polymorphic regions of the genome, and serve as a valuable genetic resource for future studies.HPRC sample subpopulations includes ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest US; CHS, Han Chinese South; CLM, Colombian in Medellin, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division; KHV, Kinh in Ho Chi Minh City, Vietnam; MKK, Maasai in Kinyawa, Kenya; MSL, Mende in Sierra Leone; PEL, Peruvian in Lima, Peru; PJL, Punjabi in Lahore, Pakistan; PUR, Puerto Rican in Puerto Rico; YRI, Yoruba in Ibadan, Nigeria. The human pangenome reference is more comprehensive than previous reference sequences. It incorporates over 100 million new bases from 47 people with diverse ancestries, capturing more population diversity than previous references.
Human Pangenome generation
Sample selection and sequencing
The pangenome reference includes 47 fully phased diploid genomes. Among these, 29 genomes were entirely generated by HPRC, while the remaining 18 were produced by other efforts.These sequencing technologies were used to collect information: Pacific Biosciences high-fidelity with 39.7× HiFi sequence depth of coverage, Oxford Nanopore Technologies long-read sequencing, and Bionano optical maps and high-coverage Hi-C Illumina short-read sequencing. To analyze the 18 additional samples, they employed the nanopore unsheared long-read sequencing protocol, resulting in approximately 60× coverage of unsheared sequencing data.
Assembling genomes
The Trio-Hifiasm. tool was selected as the primary assembler following thorough benchmarking of multiple alternatives. Trio-Hifiasm leverages PacBio HiFi long-read sequences and parental Illumina short-read sequences to generate highly phased contig assemblies.Constructing the pangenome graph
Three different tools were used to construct the pangenome graph:- Minigraph: It represents a methodology that excels in the rapid execution of assembly to graph mapping through the utilization of the minimap2 algorithm, overall this method adds new detected SVs to the graph which was initially established based on a reference input, which in this case was GRCh38, in a greedy fashion.
- Minigraph-Cactus : This method aims to include smaller variants in the graph, ideally down to the SNP level, this allows the graph to represent most of the variations in the genomes and represents each haplotype as a path in the graph.
- Pangenome Graph Builder : an all-to-all comparison method that builds a graph that represents all alignments between genomes. This method has 3 phases:
- # Alignment: the wfmash aligner was used to generate all-vs-all alignments of input sequences.
- # Graph induction: seqwish
- # Graph normalization: smoothxg
Applications