Pan-genome graph construction
Pan-genome graph construction is the process of creating a graph-based representation of the collective genome of a species or a group of organisms. In such graphs, nodes often represent genomic sequences and edges represent adjacency relationships as they occur in individual genomes within a population. Thus, a pan-genome provides a way to encapsulate all genomic data for a species or clade rather than a single representative of such species or clade. This means that it can simultaneously capture multiple versions of any given locus.
Traditional linear reference genomes represent only a single consensus genome sequence, capturing just one version of each genomic locus. This approach is inherently limited, as it fails to account for genetic variations such as single-nucleotide polymorphism, insertions and deletions, and larger structural variants that commonly exist across populations. Linear reference genomes thus introduce biases by inadequately representing genomic diversity, potentially compromising the accuracy of analyses like variant calling and genotyping.
Pan-genome graphs address these limitations by incorporating all known genetic variations into their structure. This inclusive representation allows for unbiased analysis of genomic data, significantly improving sequencing read alignment, variant detection, and genotyping accuracy across diverse individuals. Advancements in both the quality and length of sequencing, alongside improved genome assembly techniques, have led to a rapidly growing collection of high quality genome assemblies, including haplotype-resolved human genome assemblies. As a result, pan-genome graphs have become an important paradigm in bioinformatics for analyzing population genomic data, improving read alignment, variant calling, and genotyping across diverse genomes.
Historical development
The concept of the pan-genome was first introduced in 2005 by Tettelin and colleagues during comparative analysis of multiple Streptococcus agalactiae bacterial genomes. In their study Tettelin et al. defined the pan-genome as consisting of a conserved "core genome" present in all strains and a variable "accessory genome" containing genes present in one or a subset of strains, highlighting the notion that any single genome contains only a fraction of a species' total gene repertoire. Early pan-genomic analyses in microbes focused on gene presence/absence using lists of genes or orthologous clusters, rather than nucleotide-level variation. As more genomes become available, researchers recognized the need to model genomic variation at the base-pair level and not just as gene lists.Graph-based representation of multiple sequences have roots in prior bioinformatic methods based on sequence graph. For instance, in early 2000s, partial order graphs were used to represent multiple sequence alignment and consensus sequences which can be seen as a kind of sequence graph capturing alternate alleles. In early 2010s, the first practical tools leveraging graph models for pan-genomes appeared. A notable example was the colored de Bruijn graph approach used by Cortex, which built a joint de Bruijn graph from multiple genomes or read sets to detect variations without a reference genome. Around the same time, different research group such as Dilthey et al. began constructing graphs for specific genomic regions in humans, such as the highly variable major histocompatibility complex locus and for collections of microbial genomes to represent all variations in one structure.
Recently, large-scale projects have implemented pan-genome graph construction for eukaryotic genomes. In 2023 the Human Pangenome Reference Consortium published a first draft human pan-genome, which contains 47 diverse, haplotype-resolved human genomes encoded in a graph structure. Similarly, graph pan-genomes have been built for crop species such as tomatoes, which were built from 838 genomes, enabling discovery of previously "missing" heritability and trait-linked variants that were not detectable with a single reference.
Graph-based methodologies
The main step in pan-genome graph building includes the initial alignment step and downstream processing for representation. Several graph-based methodologies have been developed to construct pan-genome representations, each with differing modeling methodology and graph construction algorithms. These methods can capture more complex rearrangements and efficiently represent large collections of sequences than traditional approaches such as multiple sequence alignments and K-mer–based strategies. Furthermore, graph methods are currently among the most effective at managing large-scale pan-genome data, even supporting the simultaneous representation of tens to hundreds of human haplotypes.Pan-genome graphs are often bidirected, meaning each node has two orientations, and edges account for all combinations of these orientations. Graph metrics, including the count of nodes, edges, and connected components, offer valuable insights into the granularity of represented variations as well as the complexity and accessibility of information within the pan-genome.
De Bruijn graphs
De Bruijn graphs are a classical data structure from genome assembly that have been adapted for pan-genome representation. In a de Bruijn graph, genomic sequences are broken into fixed-length k-mers. Each unique k-mer forms a node in the graph, and there is an edge from one k-mer to another if they overlap by k-1 bases. This graph encodes the sequence of each genome through a path of k-mer nodes. For pan-genomes, a single de Bruijn graph can be constructed from multiple genomes by taking the union of all k-mers present. To retain information about which genome contain a given k-mer, methods use colored de Bruijn graphs, where each node is annotated with one or more colors indicating the samples or strains in which that k-mer appears.The principle behind using de Bruijn graphs for pan-genomes is that the graph inherently compresses identical sequence regions and reveals variant sequences as alternative paths. This makes de Bruijn graph-based methods naturally reference-free; they do not require anchoring to a known genome, which is useful for discovering novel genes or rearrangements. A major advantage of de Bruijn graphs is their ability to handle repetitive sequences and high-diversity regions by breaking genomes into smaller k-mer pieces. However, a limitation is that the choice of k can affect the resolution of the graph, where a fixed k may be too small to capture structural variation or too large to include highly divergent regions.
De Bruijn graphs have proven to be highly scalable for massive datasets due to k-mer compaction. However, they have limited ability to represent structural variants or large indels, and they pose visualization and interpretation challenges because of their cyclic structures. Common applications include bacterial pan-genome analyses, short-read assembly, and k-mer-based population studies. Tools like Bifrost and mdbg optimize storage using colored compacted de Bruijn graphs.
Variation graphs
Variation graphs represent sequence homology and variation at the nucleotide level in a graph structure. In a variation graph model, nodes represent contiguous sequences, and edges connect nodes to indicate allowed adjacencies in some genome. The graph can be viewed as a merger of many individual genomes: each genome corresponds to a path that traverses the nodes in the order of its sequence. Unlike de Bruijn graphs, which are derived from fixed-length kmers, variation graphs are often derived from alignments or variant catalogs. For example, one can start with a reference genome and add alternate alleles as divergent branches at variant sites, or perform a multiple sequence alignment of several genomes and build a graph that embeds that alignment. Variation graphs accurately capture SNPs, indels, and structural variants, and they enable a complete, lossless representation of entire genomes, a relevant feature for clinical genomics. Nevertheless, graph complexity grows rapidly with the addition of haplotypes, and most implementations offer limited support for dynamic updates. Some applications include the Human Reference Pangenome and pathogen variant calling that integrates both SNPs and SVs. Tools like pggb and Minigraph enable chromosome-scale human pan-genomes.Partial Order alignment graphs
Partial order alignment represents a multiple sequence alignment as a directed acyclic graph instead of a linear consensus sequence. In a POA graph, each sequence in the pan-genome is a path through the graph, and aligned positions are merged into single nodes. This preserves the full alignment information without compressing it into a single linear profile. POA graphs capture small variants as branching and merging paths in the DAG, avoiding reference bias since no single genome is privileged as the reference. That being said, given that the POA graph must remain acyclic, graphs cannot represent structural rearrangements or complex genome variations that violate a single ordering of sequences. POA graphs resolve small-scale complexity through iterative smoothing, reducing spurious loops and underalignment in graphs. Furthermore, they are compatible with long-read sequencing error correction. However, they are computationally intensive due to repeated realignment passes and less effective for large-scale structural variation. Some applications include refining graph topology in repetitive regions and hybrid short-/long-read pan-genome polishing. Tools like abPOA generate consensus sequences and visualize alignment graphs.Cactus graphs
The Cactus graph is a graph-based structure specifically designed for whole-genome multiple alignments with complex rearrangements. A cactus graph is defined as a connected graph in which every edge lies in, at most, one simple cycle. In a cactus graph model of a pan-genome, nodes typically represent evolutionary breakpoints or homologous segments, and edges represent contiguous aligned regions between those breakpoints. When an inversion or rearrangement is present, it appears as a cycle in the graph connecting the segments, and because each edge is in only one cycle, each rearrangement is an isolated event in the graph structure. Although newer variations, such as Progressive Cactus, has improved significantly, the computational requirements remains high for such graphs. Cactus graphs minimize reference bias by treating all genomes equally and handle large structural variants and rearrangements efficiently. However, they require significant computational resources, and their accuracy depends on the quality of the initial alignment. Key applications include vertebrate-scale pangenomes and evolutionary studies of chromosomal rearrangements. Widely used implementations include Cactus and Minigraph-Cactus.Objectives of data structures
The pan-genome representation is challenging not only because it involves aligning massive quantities of sequence data, potentially on the order of hundreds of billions of bases, but also due to the difficulty of deciding which alignments should be incorporated, particularly for sequences that have been recently duplicated or contain repetitive elements.: Scaling pan-genome graph data structures to accommodate hundreds of genomes demands substantial computational power and engineering expertise. Nevertheless, every pangenome data structure must fulfill the following objectives- Construction and Maintenance: Must be constructible from multiple genomic sources, including linear reference genomes, haplotype panels, and raw sequencing data, with the ability to update stored information dynamically without requiring a complete reconstruction.
- Coordinate System: Must provide a well-defined coordinate system that allows for unambiguous identification of genetic loci and variations.
- Annotation of Biological Features: Must support the annotation of biological features across individual genomes, including genes, introns, transcription factor binding sites, epigenetic modifications, and regulatory elements.
- Data Retrieval: Must allow efficient retrieval of genomic sequences, genetic variants, and allele frequencies, while preserving haplotype information.
- Sequence Mapping and Variant Identification: Must facilitate the comparison of short and long sequences against its database.
- Comparative Genomics: Must enable genome comparisons within a single pangenome and among multiple pangenomes.
- Computational Efficiency: Must minimize memory usage while remaining compatible with computational tools that operate within a reasonable runtime.
Graph storage
Although several approaches exist for constructing pangenome graphs, they have largely converged on a shared data model known as the Graphical Fragment Assembly format. This standardization supports the development of a unified set of tools designed to operate within the pangenome graph framework.GFA format
GFA is a format intended to encode sequence graphs, whether they arise from assemblies, genome variation, gene splice patterns, or even overlaps among long-read sequences. It uses a tab-delimited text layout to record sequences and their relationships, relying on UTF-8 encoding restricted to codepoints no larger than 127. Fields include H for headers, indicating the file version; S for segments, representing DNA fragments or nucleotides; L for links, which connect oriented segments, with their overlap represented by a CIGAR string; J for jumps, defining connections between segments that cannot be associated with a specific overlap or sequence ; and P for paths, containing an ordered list of oriented segments supported by a link or a jump.The GFA format offers a standardized way to encode sequence nodes and their linkages in variation graphs. While other formats like Variant Call Format are integrated to embed variant information directly into the graph model, and succinct index structures like GCSA/GCSA2 exist, GFA's strength lies in providing a base for both structural connectivity and population variation data.