Pan-genome graph construction

Pan-genome graph construction is the process of creating a graph-based representation of the collective genome of a species or a group of organisms. In such graphs, nodes often represent genomic sequences and edges represent adjacency relationships as they occur in individual genomes within a population. Thus, a pan-genome provides a way to encapsulate all genomic data for a species or clade rather than a single representative of such species or clade. This means that it can simultaneously capture multiple versions of any given locus.
Traditional linear reference genomes represent only a single consensus genome sequence, capturing just one version of each genomic locus. This approach is inherently limited, as it fails to account for genetic variations such as single-nucleotide polymorphism, insertions and deletions, and larger structural variants that commonly exist across populations. Linear reference genomes thus introduce biases by inadequately representing genomic diversity, potentially compromising the accuracy of analyses like variant calling and genotyping.
Pan-genome graphs address these limitations by incorporating all known genetic variations into their structure. This inclusive representation allows for unbiased analysis of genomic data, significantly improving sequencing read alignment, variant detection, and genotyping accuracy across diverse individuals. Advancements in both the quality and length of sequencing, alongside improved genome assembly techniques, have led to a rapidly growing collection of high quality genome assemblies, including haplotype-resolved human genome assemblies. As a result, pan-genome graphs have become an important paradigm in bioinformatics for analyzing population genomic data, improving read alignment, variant calling, and genotyping across diverse genomes.

Historical development

The concept of the pan-genome was first introduced in 2005 by Tettelin and colleagues during comparative analysis of multiple Streptococcus agalactiae bacterial genomes. In their study Tettelin et al. defined the pan-genome as consisting of a conserved "core genome" present in all strains and a variable "accessory genome" containing genes present in one or a subset of strains, highlighting the notion that any single genome contains only a fraction of a species' total gene repertoire. Early pan-genomic analyses in microbes focused on gene presence/absence using lists of genes or orthologous clusters, rather than nucleotide-level variation. As more genomes become available, researchers recognized the need to model genomic variation at the base-pair level and not just as gene lists.
Graph-based representation of multiple sequences have roots in prior bioinformatic methods based on sequence graph. For instance, in early 2000s, partial order graphs were used to represent multiple sequence alignment and consensus sequences which can be seen as a kind of sequence graph capturing alternate alleles. In early 2010s, the first practical tools leveraging graph models for pan-genomes appeared. A notable example was the colored de Bruijn graph approach used by Cortex, which built a joint de Bruijn graph from multiple genomes or read sets to detect variations without a reference genome. Around the same time, different research group such as Dilthey et al. began constructing graphs for specific genomic regions in humans, such as the highly variable major histocompatibility complex locus and for collections of microbial genomes to represent all variations in one structure.
Recently, large-scale projects have implemented pan-genome graph construction for eukaryotic genomes. In 2023 the Human Pangenome Reference Consortium published a first draft human pan-genome, which contains 47 diverse, haplotype-resolved human genomes encoded in a graph structure. Similarly, graph pan-genomes have been built for crop species such as tomatoes, which were built from 838 genomes, enabling discovery of previously "missing" heritability and trait-linked variants that were not detectable with a single reference.

Graph-based methodologies

The main step in pan-genome graph building includes the initial alignment step and downstream processing for representation. Several graph-based methodologies have been developed to construct pan-genome representations, each with differing modeling methodology and graph construction algorithms. These methods can capture more complex rearrangements and efficiently represent large collections of sequences than traditional approaches such as multiple sequence alignments and K-mer–based strategies. Furthermore, graph methods are currently among the most effective at managing large-scale pan-genome data, even supporting the simultaneous representation of tens to hundreds of human haplotypes.
Pan-genome graphs are often bidirected, meaning each node has two orientations, and edges account for all combinations of these orientations. Graph metrics, including the count of nodes, edges, and connected components, offer valuable insights into the granularity of represented variations as well as the complexity and accessibility of information within the pan-genome.

De Bruijn graphs

De Bruijn graphs are a classical data structure from genome assembly that have been adapted for pan-genome representation. In a de Bruijn graph, genomic sequences are broken into fixed-length k-mers. Each unique k-mer forms a node in the graph, and there is an edge from one k-mer to another if they overlap by k-1 bases. This graph encodes the sequence of each genome through a path of k-mer nodes. For pan-genomes, a single de Bruijn graph can be constructed from multiple genomes by taking the union of all k-mers present. To retain information about which genome contain a given k-mer, methods use colored de Bruijn graphs, where each node is annotated with one or more colors indicating the samples or strains in which that k-mer appears.
The principle behind using de Bruijn graphs for pan-genomes is that the graph inherently compresses identical sequence regions and reveals variant sequences as alternative paths. This makes de Bruijn graph-based methods naturally reference-free; they do not require anchoring to a known genome, which is useful for discovering novel genes or rearrangements. A major advantage of de Bruijn graphs is their ability to handle repetitive sequences and high-diversity regions by breaking genomes into smaller k-mer pieces. However, a limitation is that the choice of k can affect the resolution of the graph, where a fixed k may be too small to capture structural variation or too large to include highly divergent regions.
De Bruijn graphs have proven to be highly scalable for massive datasets due to k-mer compaction. However, they have limited ability to represent structural variants or large indels, and they pose visualization and interpretation challenges because of their cyclic structures. Common applications include bacterial pan-genome analyses, short-read assembly, and k-mer-based population studies. Tools like Bifrost and mdbg optimize storage using colored compacted de Bruijn graphs.

Variation graphs

Variation graphs represent sequence homology and variation at the nucleotide level in a graph structure. In a variation graph model, nodes represent contiguous sequences, and edges connect nodes to indicate allowed adjacencies in some genome. The graph can be viewed as a merger of many individual genomes: each genome corresponds to a path that traverses the nodes in the order of its sequence. Unlike de Bruijn graphs, which are derived from fixed-length kmers, variation graphs are often derived from alignments or variant catalogs. For example, one can start with a reference genome and add alternate alleles as divergent branches at variant sites, or perform a multiple sequence alignment of several genomes and build a graph that embeds that alignment. Variation graphs accurately capture SNPs, indels, and structural variants, and they enable a complete, lossless representation of entire genomes, a relevant feature for clinical genomics. Nevertheless, graph complexity grows rapidly with the addition of haplotypes, and most implementations offer limited support for dynamic updates. Some applications include the Human Reference Pangenome and pathogen variant calling that integrates both SNPs and SVs. Tools like pggb and Minigraph enable chromosome-scale human pan-genomes.

Partial Order alignment graphs

Partial order alignment represents a multiple sequence alignment as a directed acyclic graph instead of a linear consensus sequence. In a POA graph, each sequence in the pan-genome is a path through the graph, and aligned positions are merged into single nodes. This preserves the full alignment information without compressing it into a single linear profile. POA graphs capture small variants as branching and merging paths in the DAG, avoiding reference bias since no single genome is privileged as the reference. That being said, given that the POA graph must remain acyclic, graphs cannot represent structural rearrangements or complex genome variations that violate a single ordering of sequences. POA graphs resolve small-scale complexity through iterative smoothing, reducing spurious loops and underalignment in graphs. Furthermore, they are compatible with long-read sequencing error correction. However, they are computationally intensive due to repeated realignment passes and less effective for large-scale structural variation. Some applications include refining graph topology in repetitive regions and hybrid short-/long-read pan-genome polishing. Tools like abPOA generate consensus sequences and visualize alignment graphs.

Cactus graphs

The Cactus graph is a graph-based structure specifically designed for whole-genome multiple alignments with complex rearrangements. A cactus graph is defined as a connected graph in which every edge lies in, at most, one simple cycle. In a cactus graph model of a pan-genome, nodes typically represent evolutionary breakpoints or homologous segments, and edges represent contiguous aligned regions between those breakpoints. When an inversion or rearrangement is present, it appears as a cycle in the graph connecting the segments, and because each edge is in only one cycle, each rearrangement is an isolated event in the graph structure. Although newer variations, such as Progressive Cactus, has improved significantly, the computational requirements remains high for such graphs. Cactus graphs minimize reference bias by treating all genomes equally and handle large structural variants and rearrangements efficiently. However, they require significant computational resources, and their accuracy depends on the quality of the initial alignment. Key applications include vertebrate-scale pangenomes and evolutionary studies of chromosomal rearrangements. Widely used implementations include Cactus and Minigraph-Cactus.

Objectives of data structures

The pan-genome representation is challenging not only because it involves aligning massive quantities of sequence data, potentially on the order of hundreds of billions of bases, but also due to the difficulty of deciding which alignments should be incorporated, particularly for sequences that have been recently duplicated or contain repetitive elements.: Scaling pan-genome graph data structures to accommodate hundreds of genomes demands substantial computational power and engineering expertise. Nevertheless, every pangenome data structure must fulfill the following objectives

Construction and Maintenance: Must be constructible from multiple genomic sources, including linear reference genomes, haplotype panels, and raw sequencing data, with the ability to update stored information dynamically without requiring a complete reconstruction.
Coordinate System: Must provide a well-defined coordinate system that allows for unambiguous identification of genetic loci and variations.
Annotation of Biological Features: Must support the annotation of biological features across individual genomes, including genes, introns, transcription factor binding sites, epigenetic modifications, and regulatory elements.
Data Retrieval: Must allow efficient retrieval of genomic sequences, genetic variants, and allele frequencies, while preserving haplotype information.
Sequence Mapping and Variant Identification: Must facilitate the comparison of short and long sequences against its database.
Comparative Genomics: Must enable genome comparisons within a single pangenome and among multiple pangenomes.
Computational Efficiency: Must minimize memory usage while remaining compatible with computational tools that operate within a reasonable runtime.

Graph storage

Although several approaches exist for constructing pangenome graphs, they have largely converged on a shared data model known as the Graphical Fragment Assembly format. This standardization supports the development of a unified set of tools designed to operate within the pangenome graph framework.

GFA format

GFA is a format intended to encode sequence graphs, whether they arise from assemblies, genome variation, gene splice patterns, or even overlaps among long-read sequences. It uses a tab-delimited text layout to record sequences and their relationships, relying on UTF-8 encoding restricted to codepoints no larger than 127. Fields include H for headers, indicating the file version; S for segments, representing DNA fragments or nucleotides; L for links, which connect oriented segments, with their overlap represented by a CIGAR string; J for jumps, defining connections between segments that cannot be associated with a specific overlap or sequence ; and P for paths, containing an ordered list of oriented segments supported by a link or a jump.
The GFA format offers a standardized way to encode sequence nodes and their linkages in variation graphs. While other formats like Variant Call Format are integrated to embed variant information directly into the graph model, and succinct index structures like GCSA/GCSA2 exist, GFA's strength lies in providing a base for both structural connectivity and population variation data.

Applications

Microbial genomics

Microbial species such as bacteria often have high levels of genome plasticity with high levels of horizontal gene transfer, gene content variation, and rearrangements. Thus, the pan-genome of a microbial species can encompass a core set of genes plus a large pool of accessory genes that vary by strain. In microbial genomics, gene-level pan-genome graphs are used to analyze the repertoire of genes. For instance, nodes representing genes can help identify which pathways are core vs. accessory, and edges can show conserved gene neighbourhoods or genomic islands. In this description, graphs can be used to track the gain or loss of pathogenicity islands, antibiotic resistance genes, or mobile elements across strains. Beyond gene presence, nucleotide-level graphs allow the detection of micro-variants within core genes and the mapping of sequencing reads from new strains to a graph of all known strains. Studies have shown that using a pan-genome graph instead of a single reference for read alignment in pathogens can improve mapping rates and variant calling.

Personalized medicine

In human genomics and medicine, pan-genome graphs promise to make genomic analyses more inclusive and accurate for individuals from diverse populations. The current practice of using a single human reference genome has well-known biases. For instance, if a person has genetic variants not present in the reference, reads carrying those variants may align poorly or not at all, leading to missed or miscalled variants. This is particularly problematic for structural variants or for populations under-represented in the reference genome. Accordingly, a human pan-genome graph, which incorporates sequences from many individuals, addresses this by providing alternate alleles at many loci. Pangenome graphs have transformed clinical genomics and personalized medicine by providing a comprehensive view of genetic diversity, especially in regions that are underrepresented in traditional reference genomes. This enhanced resolution not only facilitates tailored therapeutic strategies but also promotes global health equity, as demonstrated by initiatives focusing on Middle Eastern and South Asian populations.

Comparative genomic

Comparative genomics involves studying the similarities and differences in genome content and organization between different species or within a species. Pan-genome graphs are powerful in this arena because they provide a unifying framework to compare multiple genomes directly. For within-species comparisons, a graph can represent polymorphisms and structural variants, which enables analyses of genome diversity and evolution. For example, in plant genomics, a graph of many cultivars or wild accessions can reveal structural variations like the presence/absence of gene duplications or inversions that may underlie trait differences. For instance, the construction of a tomato pan-genome graph from 838 genomes enabled the discovery of previously "missing" heritability and trait-linked variants that were not detectable with a single reference. Additionally, pan-genome graphs allow for the integration and analysis of SVs across a diverse set of genomes, improving the accuracy in detecting rare SVs that are commonly overlooked with single reference genomes. For example, pan-genome graphs have substantially enhanced the analysis of SVs in rare genetic diseases by integrating SV data from haploid genomes into a unified graph structure, which enables precise delineation of complex and rare alleles. This approach not only boosts allele sharing between related individuals, confirming lower error rates but also facilitates the identification of ultra-rare SVs that disrupt coding sequences in known disease genes, such as the novel diagnostic deletion in KMT2E, ultimately offering a more robust framework for population genomic analyses and rare disease diagnostics.

Pan-Transcriptome analysis

By incorporating population-level genetic variation, spliced pan-genome graphs make it possible to identify transcripts that are unique to specific haplotypes. This increases the accuracy of gene expression measurements, especially for complex cases like allele-specific expression. In practice, tools can combine known transcripts with genomic variants to create a spliced pan-genome graph, generating haplotype-specific transcripts. Specialized aligners then map RNA-seq reads to this graph, capturing the full range of splicing and variation more effectively than traditional methods. Finally, expression levels for these HSTs are quantified, reducing bias and further improving accuracy. As a result, this approach can achieve very high accuracy in matching reads to an individual's specific haplotypes.

Genotyping

Genotyping accuracy and scalability, particularly for structural variants, have been enhanced by pangenome graphs because they address the limitations of linear reference genomes. Integrating population diversity into reference structures in pangenome graphs enables high-throughput genotyping. For instance, GraphTyper2 is a method for SV and small variant genotyping in a pangenome graph framework, paving the way for more comprehensive and bias‐reduced population-scale genome analyses. It employs graph-aware alignment via pangenome paths, mitigating biases inherent to linear references, improving SV representation in diverse populations such as pathogenic SVs in Icelandic families.