Compression of genomic sequencing data

High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.

General concepts

While standard data compression tools are being used to compress sequence data, this approach has been criticized to be extravagant because genomic sequences often contain repetitive content or many sequences exhibit high levels of similarity. Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited for compressing sequencing data.

Base variants

With the availability of a reference template, only differences need to be recorded, thereby greatly reducing the amount of information to be stored. The notion of relative compression is obvious especially in genome re-sequencing projects where the aim is to discover variations in individual genomes. The use of a reference single nucleotide polymorphism map, such as dbSNP, can be used to further improve the number of variants for storage.

Relative genomic coordinates

Another useful idea is to store relative genomic coordinates in lieu of absolute coordinates. For example, representing sequence variant bases in the format ‘Position1Base1Position2Base2…’, ‘123C125T130G’ can be shortened to ‘0C2T5G’, where the integers represent intervals between the variants. The cost is the modest arithmetic calculation required to recover the absolute coordinates plus the storage of the correction factor.

Prior information about the genomes

Further reduction can be achieved if all possible positions of substitutions in a pool of genome sequences are known in advance. For instance, if all locations of SNPs in a human population are known, then there is no need to record variant coordinate information. This approach, however, is rarely appropriate because such information is usually incomplete or unavailable.

Encoding genomic coordinates

Encoding schemes are used to convert coordinate integers into binary form to provide additional compression gains. Encoding designs, such as the Golomb code and the Huffman code, have been incorporated into genomic data compression tools. Of course, encoding schemes entail accompanying decoding algorithms. Choice of the decoding scheme potentially affects the efficiency of sequence information retrieval.

Algorithm design choices

A universal approach to compressing genomic data may not necessarily be optimal, as a particular method may be more suitable for specific purposes and aims. Thus, several design choices that potentially impacts compression performance may be important for consideration.

Encoding schemes

The application of different types of encoding schemes have been explored to encode variant bases and genomic coordinates. Fixed codes, such as the Golomb code and the Rice code, are suitable when the variant or coordinate distribution is well defined. Variable codes, such as the Huffman code, provide a more general entropy encoding scheme when the underlying variant and/or coordinate distribution is not well-defined.

List of genomic re-sequencing data compression tools

The compression ratio of currently available genomic data compression tools ranges between 65-fold and 1,200-fold for human genomes. Very close variants or revisions of the same genome can be compressed very efficiently. However, such compression is not indicative of the typical compression ratio for different genomes of the same organism. The most common encoding scheme amongst these tools is Huffman coding, which is used for lossless data compression.

Software	Description	Compression Ratio	Data Used for Evaluation	Approach/Encoding Scheme	Link	Use Licence	Reference
PetaSuite	Lossless compression tool for BAM and FASTQ.gz files; transparent on-the-fly readback through BAM and FASTQ.gz virtual files	60% to 90%	Human genome sequences from the 1000 Genomes Project		https://petagene.com	Commercial
Genozip	A universal compressor for genomic files – compresses FASTQ, SAM/BAM/CRAM, VCF/BCF, FASTA, GFF/GTF/GVF, PHYLIP, BED and 23andMe files		Human genome sequences from the 1000 Genomes Project	Genozip extensible framework	http://genozip.com	Commercial, but free for non-commercial use
Genomic Squeeze	Lossless compression tool designed for storing and analyzing sequencing read data	65% to 76%	Human genome sequences from the 1000 Genomes Project	Huffman coding	http://public.tgen.org/sqz	-Undeclared-
CRAM	Highly efficient and tunable reference-based compression of sequence data		European Nucleotide Archive	deflate and rANS	http://www.ebi.ac.uk/ena/software/cram-toolkit	Apache-2.0
Genome Compressor	A tool using a mixture of multiple Markov models for compressing reference and reference-free sequences		Human nuclear genome sequence	Arithmetic coding	http://bioinformatics.ua.pt/software/geco/ or https://pratas.github.io/geco/	GPLv3
GenomSys codecs	Lossless compression of BAM and FASTQ files into the standard format ISO/IEC 23092	60% to 90%	Human genome sequences from the 1000 Genomes Project	Context-adaptive binary arithmetic coding	https://www.genomsys.com	Commercial
fastafs	Compression of FASTA / UCSC2Bit files into random access compressed archives. Toolkit to mount FASTA files, indices and dictionary files virtually. This allows neat file system integration without the need to fully decompress archives for random / partial access.		FASTA files	Huffman coding as implemented by Zstd	https://github.com/yhoogstrate/fastafs	GPL-v2.0