While standard data compression tools (e.g., zip and rar) are being used to compress sequence data (e.g., GenBank flat file database), this approach has been criticized to be extravagant because genomic sequences often contain repetitive content (e.g., microsatellite sequences) or many sequences exhibit high levels of similarity (e.g., multiple genome sequences from the same species). Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited for compressing sequencing data.
With the availability of a reference template, only differences (e.g., single nucleotide substitutions and insertions/deletions) need to be recorded, thereby greatly reducing the amount of information to be stored. The notion of relative compression is obvious especially in genome re-sequencing projects where the aim is to discover variations in individual genomes. The use of a reference single nucleotide polymorphism (SNP) map, such as dbSNP, can be used to further improve the number of variants for storage.
Another useful idea is to store relative genomic coordinates in lieu of absolute coordinates. For example, representing sequence variant bases in the format ‘Position1Base1Position2Base2…’, ‘123C125T130G’ can be shortened to ‘0C2T5G’, where the integers represent intervals between the variants. The cost is the modest arithmetic calculation required to recover the absolute coordinates plus the storage of the correction factor (‘123’ in this example).
Further reduction can be achieved if all possible positions of substitutions in a pool of genome sequences are known in advance. For instance, if all locations of SNPs in a human population are known, then there is no need to record variant coordinate information (e.g., ‘123C125T130G’ can be abridged to ‘CTG’). This approach, however, is rarely appropriate because such information is usually incomplete or unavailable.
A universal approach to compressing genomic data may not necessarily be optimal, as a particular method may be more suitable for specific purposes and aims. Thus, several design choices that potentially impacts compression performance may be important for consideration.
Selection of a reference sequence for relative compression can affect compression performance. Choosing a consensus reference sequence over a more specific reference sequence (e.g., the revised Cambridge Reference Sequence) can result in higher compression ratio because the consensus reference may contain less bias in its data. Knowledge about the source of the sequence being compressed, however, may be exploited to achieve greater compression gains. The idea of using multiple reference sequences has been proposed. Brandon et al. (2009) alluded to the potential use of ethnic group-specific reference sequence templates, using the compression of mitochondrial DNA variant data as an example (see Figure 2). The authors found biased haplotype distribution in the mitochondrial DNA sequences of Africans, Asians, and Eurasians relative to the revised Cambridge Reference Sequence. Their result suggests that the revised Cambridge Reference Sequence may not always be optimal because a greater number of variants need to be stored when it is used against data from ethnically distant individuals. Additionally, a reference sequence can be designed based on statistical properties or engineered to improve the compression ratio.
The application of different types of encoding schemes have been explored to encode variant bases and genomic coordinates. Fixed codes, such as the Golomb code and the Rice code, are suitable when the variant or coordinate (represented as integer) distribution is well defined. Variable codes, such as the Huffman code, provide a more general entropy encoding scheme when the underlying variant and/or coordinate distribution is not well-defined (this is typically the case in genomic sequence data).
The compression ratio of currently available genomic data compression tools ranges between 65-fold and 1,200-fold for human genomes. Very close variants or revisions of the same genome can be compressed very efficiently (for example, 18,133 compression ratio was reported for two revisions of the same A. thaliana genome, which are 99.999% identical). However, such compression is not indicative of the typical compression ratio for different genomes (individuals) of the same organism. The most common encoding scheme amongst these tools is Huffman coding, which is used for lossless data compression.
Genomic Sequencing data compression tools compatible with standard genome sequencing files formats (BAM & FASTQ)
Giancarlo, R.; Scaturro, D.; Utro, F. (2009). "Textual data compression in computational biology: A synopsis". Bioinformatics. 25 (13): 1575–1586. doi:10.1093/bioinformatics/btp117. PMID 19251772. https://doi.org/10.1093%2Fbioinformatics%2Fbtp117
Nalbantog̃Lu, O. U.; Russell, D. J.; Sayood, K. (2010). "Data Compression Concepts and Algorithms and their Applications to Bioinformatics". Entropy. 12 (1): 34. doi:10.3390/e12010034. PMC 2821113. PMID 20157640. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2821113
Hosseini, Morteza; Pratas, Diogo; Pinho, Armando (2016). "A Survey on Data Compression Methods for Biological Sequences". Information. 7 (4): 56. doi:10.3390/info7040056. https://doi.org/10.3390%2Finfo7040056
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705231
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705231
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705231
Deorowicz, S.; Grabowski, S. (2011). "Robust relative compression of genomes with random access". Bioinformatics. 27 (21): 2979–2986. doi:10.1093/bioinformatics/btr505. PMID 21896510. https://doi.org/10.1093%2Fbioinformatics%2Fbtr505
Wang, C.; Zhang, D. (2011). "A novel compression tool for efficient storage of genome resequencing data". Nucleic Acids Research. 39 (7): e45. doi:10.1093/nar/gkr009. PMC 3074166. PMID 21266471. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166
Pinho, A. J.; Pratas, D.; Garcia, S. P. (2012). "GReEn: A tool for efficient compression of genome resequencing data". Nucleic Acids Research. 40 (4): e27. doi:10.1093/nar/gkr1124. PMC 3287168. PMID 22139935. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287168
Tembe, W.; Lowey, J.; Suh, E. (2010). "G-SQZ: Compact encoding of genomic sequence and quality data". Bioinformatics. 26 (17): 2192–2194. doi:10.1093/bioinformatics/btq346. PMID 20605925. /wiki/Doi_(identifier)
Christley, S.; Lu, Y.; Li, C.; Xie, X. (2009). "Human genomes as email attachments". Bioinformatics. 25 (2): 274–275. doi:10.1093/bioinformatics/btn582. PMID 18996942. https://doi.org/10.1093%2Fbioinformatics%2Fbtn582
Pavlichin, D. S.; Weissman, T.; Yona, G. (2013). "The human genome contracts again". Bioinformatics. 29 (17): 2199–2302. doi:10.1093/bioinformatics/btt362. PMID 23793748. https://doi.org/10.1093%2Fbioinformatics%2Fbtt362
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705231
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705231
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705231
Giancarlo, R.; Scaturro, D.; Utro, F. (2009). "Textual data compression in computational biology: A synopsis". Bioinformatics. 25 (13): 1575–1586. doi:10.1093/bioinformatics/btp117. PMID 19251772. https://doi.org/10.1093%2Fbioinformatics%2Fbtp117
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705231
Kuruppu, Shanika; Puglisi, Simon J.; Zobel, Justin (2011). "Reference Sequence Construction for Relative Compression of Genomes". String Processing and Information Retrieval. Lecture Notes in Computer Science. Vol. 7024. pp. 420–425. doi:10.1007/978-3-642-24583-1_41. ISBN 978-3-642-24582-4. S2CID 16007637. 978-3-642-24582-4
Grabowski, Szymon; Deorowicz, Sebastian (2011). "Engineering Relative Compression of Genomes". arXiv:1103.2351 [cs.CE]. /wiki/ArXiv_(identifier)
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705231
Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC 2705231. PMID 19447783. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705231
Deorowicz, S.; Grabowski, S. (2011). "Robust relative compression of genomes with random access". Bioinformatics. 27 (21): 2979–2986. doi:10.1093/bioinformatics/btr505. PMID 21896510. https://doi.org/10.1093%2Fbioinformatics%2Fbtr505
Wang, C.; Zhang, D. (2011). "A novel compression tool for efficient storage of genome resequencing data". Nucleic Acids Research. 39 (7): e45. doi:10.1093/nar/gkr009. PMC 3074166. PMID 21266471. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166
Pinho, A. J.; Pratas, D.; Garcia, S. P. (2012). "GReEn: A tool for efficient compression of genome resequencing data". Nucleic Acids Research. 40 (4): e27. doi:10.1093/nar/gkr1124. PMC 3287168. PMID 22139935. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287168
Tembe, W.; Lowey, J.; Suh, E. (2010). "G-SQZ: Compact encoding of genomic sequence and quality data". Bioinformatics. 26 (17): 2192–2194. doi:10.1093/bioinformatics/btq346. PMID 20605925. /wiki/Doi_(identifier)
Christley, S.; Lu, Y.; Li, C.; Xie, X. (2009). "Human genomes as email attachments". Bioinformatics. 25 (2): 274–275. doi:10.1093/bioinformatics/btn582. PMID 18996942. https://doi.org/10.1093%2Fbioinformatics%2Fbtn582
Pavlichin, D. S.; Weissman, T.; Yona, G. (2013). "The human genome contracts again". Bioinformatics. 29 (17): 2199–2302. doi:10.1093/bioinformatics/btt362. PMID 23793748. https://doi.org/10.1093%2Fbioinformatics%2Fbtt362
Pratas, D., Pinho, A. J., and Ferreira, P. J. S. G. Efficient compression of genomic sequences. Data Compression Conference, Snowbird, Utah, 2016.
Wang, C.; Zhang, D. (2011). "A novel compression tool for efficient storage of genome resequencing data". Nucleic Acids Research. 39 (7): e45. doi:10.1093/nar/gkr009. PMC 3074166. PMID 21266471. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166
"The Importance of Data Compression in the Field of Genomics". IEEE Pulse. 2019-04-26. Retrieved 2024-02-22. https://www.embs.org/pulse/articles/the-importance-of-data-compression-in-the-field-of-genomics/
Lan, Divon; Llamas, Bastien (14 September 2022). "Genozip 14 - advances in compression of BAM and CRAM files". bioRxiv. doi:10.1101/2022.09.12.507582. S2CID 252357508. /wiki/Doi_(identifier)
Lan, Divon; Hughes, Daniel S T; Llamas, Bastien (7 July 2023). "Deep FASTQ and BAM co-compression in Genozip 15". bioRxiv. doi:10.1101/2023.07.07.548069. S2CID 259764998. /wiki/Doi_(identifier)
Lan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (25 August 2021). "Genozip: a universal extensible genomic data compressor". Bioinformatics. 37 (16): 2225–2230. doi:10.1093/bioinformatics/btab102. PMC 8388020. PMID 33585897. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8388020
Tembe, W.; Lowey, J.; Suh, E. (2010). "G-SQZ: Compact encoding of genomic sequence and quality data". Bioinformatics. 26 (17): 2192–2194. doi:10.1093/bioinformatics/btq346. PMID 20605925. /wiki/Doi_(identifier)
CRAM benchmarking http://www.htslib.org/benchmarks/CRAM.html
CRAM format specification (version 3.0) https://samtools.github.io/hts-specs/CRAMv3.pdf
Pratas, D., Pinho, A. J., and Ferreira, P. J. S. G. Efficient compression of genomic sequences. Data Compression Conference, Snowbird, Utah, 2016.
"ISO/IEC 23092-2:2019 Information technology — Genomic information representation — Part 2: Coding of genomic information". iso.org. https://www.iso.org/standard/73536.html
Alberti, Claudio; Paridaens, Tom; Voges, Jan; Naro, Daniel; Ahmad, Junaid J.; Ravasi, Massimo; Renzi, Daniele; Zoia, Giorgio; Ochoa, Idoia; Mattavelli, Marco; Delgado, Jaime; Hernaez, Mikel (27 September 2018). "An introduction to MPEG-G, the new ISO standard for genomic information representation". bioRxiv 10.1101/426353. /wiki/BioRxiv_(identifier)
Hoogstrate, Youri; Jenster, Guido W.; van de Werken, Harmen J. G. (December 2021). "FASTAFS: file system virtualisation of random access compressed FASTA files". BMC Bioinformatics. 22 (1): 535. doi:10.1186/s12859-021-04455-3. PMC 8558547. PMID 34724897. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8558547
Deorowicz, S.; Grabowski, S. (2011). "Robust relative compression of genomes with random access". Bioinformatics. 27 (21): 2979–2986. doi:10.1093/bioinformatics/btr505. PMID 21896510. https://doi.org/10.1093%2Fbioinformatics%2Fbtr505
Wang, C.; Zhang, D. (2011). "A novel compression tool for efficient storage of genome resequencing data". Nucleic Acids Research. 39 (7): e45. doi:10.1093/nar/gkr009. PMC 3074166. PMID 21266471. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074166
Pinho, A. J.; Pratas, D.; Garcia, S. P. (2012). "GReEn: A tool for efficient compression of genome resequencing data". Nucleic Acids Research. 40 (4): e27. doi:10.1093/nar/gkr1124. PMC 3287168. PMID 22139935. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287168
Christley, S.; Lu, Y.; Li, C.; Xie, X. (2009). "Human genomes as email attachments". Bioinformatics. 25 (2): 274–275. doi:10.1093/bioinformatics/btn582. PMID 18996942. https://doi.org/10.1093%2Fbioinformatics%2Fbtn582
Pavlichin, D. S.; Weissman, T.; Yona, G. (2013). "The human genome contracts again". Bioinformatics. 29 (17): 2199–2302. doi:10.1093/bioinformatics/btt362. PMID 23793748. https://doi.org/10.1093%2Fbioinformatics%2Fbtt362