Alignment-free methods can broadly be classified into five categories: a) methods based on k-mer/word frequency, b) methods based on the length of common substrings, c) methods based on the number of (spaced) word matches, d) methods based on micro-alignments, e) methods based on information theory and f) methods based on graphical representation. Alignment-free approaches have been used in sequence similarity searches, clustering and classification of sequences, and more recently in phylogenetics (Figure 1).
Such molecular phylogeny analyses employing alignment-free approaches are said to be part of next-generation phylogenomics. A number of review articles provide in-depth review of alignment-free methods in sequence analysis.
The methodology involved in FFP based method starts by calculating the count of each possible k-mer (possible number of k-mers for nucleotide sequence: 4k, while that for protein sequence: 20k) in sequences. Each k-mer count in each sequence is then normalized by dividing it by total of all k-mers' count in that sequence. This leads to conversion of each sequence into its feature frequency profile. The pair wise distance between two sequences is then calculated Jensen–Shannon (JS) divergence between their respective FFPs. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc.
The FCGR methods have evolved from chaos game representation (CGR) technique, which provides scale independent representation for genomic sequences. The CGRs can be divided by grid lines where each grid square denotes the occurrence of oligonucleotides of a specific length in the sequence. Such representation of CGRs is termed as Frequency Chaos Game Representation (FCGR). This leads to representation of each sequence into FCGR. The pair wise distance between FCGRs of sequences can be calculated using the Pearson distance, the Hamming distance or the Euclidean distance.
While most alignment-free algorithms compare the word-composition of sequences, Spaced Words uses a pattern of care and don't care positions. The occurrence of a spaced word in a sequence is then defined by the characters at the match positions only, while the characters at the don't care positions are ignored. Instead of comparing the frequencies of contiguous words in the input sequences, this approach compares the frequencies of the spaced words according to the pre-defined pattern. Note that the pre-defined pattern can be selected by analysis of the Variance of the number of matches, the probability of the first occurrence on several models, or the Pearson correlation coefficient between the expected word frequency and the true alignment distance.
This measure
d
(
A
,
B
)
{\displaystyle d(A,B)}
is not symmetric, so one has to compute
d
s
(
A
,
B
)
=
d
s
(
B
,
A
)
=
(
d
(
A
,
B
)
+
d
(
B
,
A
)
)
/
2
{\displaystyle d_{s}(A,B)=d_{s}(B,A)=(d(A,B)+d(B,A))/2}
, which gives final ACS measure between the two strings (A and B). The subsequence/substring search can be efficiently performed by
using suffix trees.
This approach is a generalization of the ACS approach. To define the distance between two DNA or protein sequences, kmacs estimates for each position i of the first sequence the longest substring starting at i and matching a substring of the second sequence with up to k mismatches. It defines the average of these values as a measure of similarity between the sequences and turns this into a symmetric distance measure. Kmacs does not compute exact k-mismatch substrings, since this would be computational too costly, but approximates such substrings.
This approach is closely related to the ACS, which calculates the number of substitutions per site between two DNA sequences using the shortest
absent substring (termed as shustring).
These approachese are variants of the
D
2
{\displaystyle D_{2}}
statistics that counts the number of
k
{\displaystyle k}
-mer matches between two sequences. They improve the simple
D
2
{\displaystyle D_{2}}
statistics by taking the background distribution of the compared sequences into account.
This is an extremely fast method that uses the MinHash bottom sketch strategy for estimating the Jaccard index of the multi-sets of
k
{\displaystyle k}
-mers of two input sequences. That is, it estimates the ratio of
k
{\displaystyle k}
-mer matches to the total number of
k
{\displaystyle k}
-mers of the sequences. This can be used, in turn, to estimate the evolutionary distances between the compared sequences, measured as the number of substitutions per sequence position since the sequences evolved from their last common ancestor.
This approach calculates a distance value between two protein sequences based on the decay of the number of
k
{\displaystyle k}
-mer matches if
k
{\displaystyle k}
increases.
This method calculates the number
N
k
{\displaystyle N_{k}}
of
k
{\displaystyle k}
-mer or spaced-word matches
(SpaM) for different values for the word length or number of match positions
k
{\displaystyle k}
in the underlying pattern, respectively. The slope of an affine-linear function
F
{\displaystyle F}
that depends on
N
k
{\displaystyle N_{k}}
is calculated to estimate the Jukes-Cantor distance between the input sequences .
andi estimates phylogenetic distances between genomic sequences based on ungapped local alignments that are flanked by maximal exact word matches. Such word matches can be efficiently found using suffix arrays. The gapfree alignments between the exact word matches are then used to estimate phylogenetic distances between genome sequences. The resulting distance estimates are accurate for up to around 0.6 substitutions per position.
Prot-SpaM (Proteome-based Spaced-word Matches) is an implementation of the FSWM algorithm for partial or whole proteome sequences.
Multi-SpaM (MultipleSpaced-word Matches) is an approach to genome-based phylogeny reconstruction that extends the FSWM idea to multiple sequence comparison. Given a binary pattern P of match positions and don't-care positions, the program searches for P-blocks, i.e. local gap-free four-way alignments with matching nucleotides at the match positions of P and possible mismatches at the don't-care positions. Such four-way alignments are randomly sampled from a set of input genome sequences. For each P-block, an unrooted tree topology is calculated using RAxML. The program Quartet MaxCut is then used to calculate a supertree from these trees.
Base–base correlation (BBC) converts the genome sequence into a unique 16-dimensional numeric vector using the following equation,
T
i
j
(
K
)
=
∑
ℓ
=
1
K
P
i
j
(
ℓ
)
⋅
log
2
(
P
i
j
(
ℓ
)
P
i
P
j
)
{\displaystyle T_{ij}(K)=\sum _{\ell =1}^{K}P_{ij}(\ell )\cdot \log _{2}\left({\frac {P_{ij}(\ell )}{P_{i}P_{j}}}\right)}
The
P
i
{\displaystyle P_{i}}
and
P
j
{\displaystyle P_{j}}
denotes the probabilities of bases i and j in the genome. The
P
i
j
(
ℓ
)
{\displaystyle P_{ij}(\ell )}
indicates the probability of bases i and j at distance ℓ in the genome. The parameter K indicates the maximum distance between the bases i and j. The variation in the values of 16 parameters reflect variation in the genome content and length.
IC-PIC (information correlation and partial information correlation) based method employs the base correlation property of DNA sequence. IC and PIC were calculated using following formulas,
I
C
ℓ
=
−
2
∑
i
P
i
log
2
P
i
+
∑
i
j
P
i
j
(
ℓ
)
log
2
P
i
j
(
ℓ
)
{\displaystyle IC_{\ell }=-2\sum _{i}P_{i}\log _{2}P_{i}+\sum _{ij}P_{ij}(\ell )\log _{2}P_{ij}(\ell )}
P
I
C
i
j
(
ℓ
)
=
(
P
i
j
(
ℓ
)
−
P
i
P
j
(
ℓ
)
)
2
{\displaystyle PIC_{ij}(\ell )=(P_{ij}(\ell )-P_{i}P_{j}(\ell ))^{2}}
which defines the range of distance between bases.
In the context modeling complexity the next-symbol predictions, of one or more statistical models, are combined or competing to yield a prediction that is based on events recorded in the past. The algorithmic information content derived from each symbol prediction can be used to compute algorithmic information profiles with a time proportional to the length of the sequence. The process has been applied to DNA sequence analysis.
The use of iterated maps for sequence analysis was first introduced by HJ Jefferey in 1990 when he proposed to apply the Chaos Game to map genomic sequences into a unit square. That report coined the procedure as Chaos Game Representation (CGR). However, only 3 years later this approach was first dismissed as a projection of a Markov transition table by N Goldman. This objection was overruled by the end of that decade when the opposite was found to be the case – that CGR bijectively maps Markov transition is into a fractal, order-free (degree-free) representation. The realization that iterated maps provide a bijective map between the symbolic space and numeric space led to the identification of a variety of alignment-free approaches to sequence comparison and characterization. These developments were reviewed in late 2013 by JS Almeida in. A number of web apps such as https://github.com/usm/usm.github.com/wiki, are available to demonstrate how to encode and compare arbitrary symbolic sequences in a manner that takes full advantage of modern MapReduce distribution developed for cloud computing.
Vinga S, Almeida J (March 2003). "Alignment-free sequence comparison-a review". Bioinformatics. 19 (4): 513–523. doi:10.1093/bioinformatics/btg005. PMID 12611807. https://doi.org/10.1093%2Fbioinformatics%2Fbtg005
Rothberg J, Merriman B, Higgs G (September 2012). "Bioinformatics. Introduction". The Yale Journal of Biology and Medicine. 85 (3): 305–308. PMC 3447194. PMID 23189382. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3447194
Batzoglou S (March 2005). "The many faces of sequence alignment". Briefings in Bioinformatics. 6 (1): 6–22. doi:10.1093/bib/6.1.6. PMID 15826353. https://doi.org/10.1093%2Fbib%2F6.1.6
Mullan L (March 2006). "Pairwise sequence alignment--it's all about us!". Briefings in Bioinformatics. 7 (1): 113–115. doi:10.1093/bib/bbk008. PMID 16761368. /wiki/Doi_(identifier)
Kemena C, Notredame C (October 2009). "Upcoming challenges for multiple sequence alignment methods in the high-throughput era". Bioinformatics. 25 (19): 2455–2465. doi:10.1093/bioinformatics/btp452. PMC 2752613. PMID 19648142. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752613
Hide W, Burke J, Davison DB (1994). "Biological evaluation of d2, an algorithm for high-performance sequence comparison". Journal of Computational Biology. 1 (3): 199–215. doi:10.1089/cmb.1994.1.199. PMID 8790465. /wiki/Doi_(identifier)
Miller RT, Christoffels AG, Gopalakrishnan C, Burke J, Ptitsyn AA, Broveak TR, Hide WA (November 1999). "A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base". Genome Research. 9 (11): 1143–1155. doi:10.1101/gr.9.11.1143. PMC 310831. PMID 10568754. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC310831
Domazet-Lošo M, Haubold B (June 2011). "Alignment-free detection of local similarity among viral and bacterial genomes". Bioinformatics. 27 (11): 1466–1472. doi:10.1093/bioinformatics/btr176. PMID 21471011. https://doi.org/10.1093%2Fbioinformatics%2Fbtr176
Chan CX, Ragan MA (January 2013). "Next-generation phylogenomics". Biology Direct. 8: 3. doi:10.1186/1745-6150-8-3. PMC 3564786. PMID 23339707. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3564786
Chan CX, Ragan MA (January 2013). "Next-generation phylogenomics". Biology Direct. 8: 3. doi:10.1186/1745-6150-8-3. PMC 3564786. PMID 23339707. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3564786
Vinga S, Almeida J (March 2003). "Alignment-free sequence comparison-a review". Bioinformatics. 19 (4): 513–523. doi:10.1093/bioinformatics/btg005. PMID 12611807. https://doi.org/10.1093%2Fbioinformatics%2Fbtg005
Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (May 2014). "New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing". Briefings in Bioinformatics. 15 (3): 343–353. doi:10.1093/bib/bbt067. PMC 4017329. PMID 24064230. /wiki/Gesine_Reinert
Haubold B (May 2014). "Alignment-free phylogenetics and population genetics". Briefings in Bioinformatics. 15 (3): 407–418. doi:10.1093/bib/bbt083. PMID 24291823. https://doi.org/10.1093%2Fbib%2Fbbt083
Bonham-Carter O, Steele J, Bastola D (November 2014). "Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis". Briefings in Bioinformatics. 15 (6): 890–905. doi:10.1093/bib/bbt052. PMC 4296134. PMID 23904502. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4296134
Zielezinski A, Vinga S, Almeida J, Karlowski WM (October 2017). "Alignment-free sequence comparison: benefits, applications, and tools". Genome Biology. 18 (1): 186. doi:10.1186/s13059-017-1319-7. PMC 5627421. PMID 28974235. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5627421
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, et al. (March 2019). "Alignment-free inference of hierarchical and reticulate phylogenomic relationships". Briefings in Bioinformatics. 20 (2): 426–435. doi:10.1093/bib/bbx067. PMC 6433738. PMID 28673025. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6433738
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F (July 2018). "Alignment-Free Sequence Analysis and Applications". Annual Review of Biomedical Data Science. 1: 93–114. arXiv:1803.09727. Bibcode:2018arXiv180309727R. doi:10.1146/annurev-biodatasci-080917-013431. PMC 6905628. PMID 31828235. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6905628
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, et al. (July 2019). "Benchmarking of alignment-free sequence comparison methods". Genome Biology. 20 (1): 144. doi:10.1186/s13059-019-1755-7. PMC 6659240. PMID 31345254. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6659240
Sims GE, Jun SR, Wu GA, Kim SH (October 2009). "Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions". Proceedings of the National Academy of Sciences of the United States of America. 106 (40): 17077–17082. Bibcode:2009PNAS..10617077S. doi:10.1073/pnas.0909377106. PMC 2761373. PMID 19805074. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2761373
Sims GE, Kim SH (May 2011). "Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)". Proceedings of the National Academy of Sciences of the United States of America. 108 (20): 8329–8334. Bibcode:2011PNAS..108.8329S. doi:10.1073/pnas.1105168108. PMC 3100984. PMID 21536867. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3100984
Gao L, Qi J (March 2007). "Whole genome molecular phylogeny of large dsDNA viruses using composition vector method". BMC Evolutionary Biology. 7 (1): 41. Bibcode:2007BMCEE...7...41G. doi:10.1186/1471-2148-7-41. PMC 1839080. PMID 17359548. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839080
Wang H, Xu Z, Gao L, Hao B (August 2009). "A fungal phylogeny based on 82 complete genomes using the composition vector method". BMC Evolutionary Biology. 9 (1): 195. Bibcode:2009BMCEE...9..195W. doi:10.1186/1471-2148-9-195. PMC 3087519. PMID 19664262. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3087519
Kolekar P, Kale M, Kulkarni-Kale U (November 2012). "Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping". Molecular Phylogenetics and Evolution. 65 (2): 510–522. doi:10.1016/j.ympev.2012.07.003. PMID 22820020. /wiki/Doi_(identifier)
Hatje K, Kollmar M (2012). "A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method". Frontiers in Plant Science. 3: 192. doi:10.3389/fpls.2012.00192. PMC 3429886. PMID 22952468. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3429886
Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B (July 2014). "Fast alignment-free sequence comparison using spaced-word frequencies". Bioinformatics. 30 (14): 1991–1999. doi:10.1093/bioinformatics/btu177. PMC 4080745. PMID 24700317. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080745
Apostolico A, Denas O (October 2008). "Fast algorithms for computing sequence distances by exhaustive substring composition". Algorithms for Molecular Biology. 3: 13. doi:10.1186/1748-7188-3-13. PMC 2615014. PMID 18957094. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2615014
Apostolico A, Denas O, Dress A (September 2010). "Efficient tools for comparative substring analysis". Journal of Biotechnology. 149 (3): 120–126. doi:10.1016/j.jbiotec.2010.05.006. PMID 20682467. /wiki/Doi_(identifier)
Jeffrey HJ (April 1990). "Chaos game representation of gene structure". Nucleic Acids Research. 18 (8): 2163–2170. doi:10.1093/nar/18.8.2163. PMC 330698. PMID 2336393. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC330698
Wang Y, Hill K, Singh S, Kari L (February 2005). "The spectrum of genomic signatures: from dinucleotides to chaos game representation". Gene. 346: 173–185. doi:10.1016/j.gene.2004.10.021. PMID 15716010. /wiki/Doi_(identifier)
Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B (July 2014). "Fast alignment-free sequence comparison using spaced-word frequencies". Bioinformatics. 30 (14): 1991–1999. doi:10.1093/bioinformatics/btu177. PMC 4080745. PMID 24700317. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080745
Hahn L, Leimeister CA, Ounit R, Lonardi S, Morgenstern B (October 2016). "rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison". PLOS Computational Biology. 12 (10): e1005107. arXiv:1511.04001. Bibcode:2016PLSCB..12E5107H. doi:10.1371/journal.pcbi.1005107. PMC 5070788. PMID 27760124. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5070788
Noé L (Feb 14, 2017). "Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds". Algorithms for Molecular Biology. 12 (1): 1. doi:10.1186/s13015-017-0092-1. PMC 5310094. PMID 28289437. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5310094
Noé L, Martin DE (December 2014). "A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances". Journal of Computational Biology. 21 (12): 947–963. arXiv:1412.2587. Bibcode:2014arXiv1412.2587N. doi:10.1089/cmb.2014.0173. PMC 4253314. PMID 25393923. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4253314
Gusfield D (1997). Algorithms on strings, trees, and sequences: computer science and computational biology (Reprinted (with corr.) ed.). Cambridge [u.a.]: Cambridge Univ. Press. ISBN 9780521585194. 9780521585194
Ulitsky I, Burstein D, Tuller T, Chor B (March 2006). "The average common substring approach to phylogenomic reconstruction". Journal of Computational Biology. 13 (2): 336–350. CiteSeerX 10.1.1.106.5122. doi:10.1089/cmb.2006.13.336. PMID 16597244. /wiki/CiteSeerX_(identifier)
Weiner P (1973). "Linear pattern matching algorithms". 14th Annual Symposium on Switching and Automata Theory (swat 1973). pp. 1–11. CiteSeerX 10.1.1.474.9582. doi:10.1109/SWAT.1973.13. /wiki/CiteSeerX_(identifier)
He D (2006). "Using suffix tree to discover complex repetitive patterns in DNA sequences". 2006 International Conference of the IEEE Engineering in Medicine and Biology Society. Vol. 1. pp. 3474–7. doi:10.1109/IEMBS.2006.260445. ISBN 978-1-4244-0032-4. PMID 17945779. S2CID 5953866. 978-1-4244-0032-4
Välimäki N, Gerlach W, Dixit K, Mäkinen V (March 2007). "Compressed suffix tree--a basis for genome-scale sequence analysis". Bioinformatics. 23 (5): 629–630. doi:10.1093/bioinformatics/btl681. PMID 17237063. https://doi.org/10.1093%2Fbioinformatics%2Fbtl681
Leimeister CA, Morgenstern B (July 2014). "Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison". Bioinformatics. 30 (14): 2000–2008. doi:10.1093/bioinformatics/btu331. PMC 4080746. PMID 24828656. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080746
Haubold B, Pfaffelhuber P, Domazet-Loso M, Wiehe T (October 2009). "Estimating mutation distances from unaligned genomes". Journal of Computational Biology. 16 (10): 1487–1500. doi:10.1089/cmb.2009.0106. hdl:11858/00-001M-0000-000F-D624-D. PMID 19803738. /wiki/Doi_(identifier)
Leimeister CA, Morgenstern B (July 2014). "Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison". Bioinformatics. 30 (14): 2000–2008. doi:10.1093/bioinformatics/btu331. PMC 4080746. PMID 24828656. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080746
Morgenstern B, Schöbel S, Leimeister CA (2017). "Phylogeny reconstruction based on the length distribution of k-mismatch common substrings". Algorithms for Molecular Biology. 12: 27. doi:10.1186/s13015-017-0118-8. PMC 5724348. PMID 29238399. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5724348
Reinert G, Chew D, Sun F, Waterman MS (December 2009). "Alignment-free sequence comparison (I): statistics and power". Journal of Computational Biology. 16 (12): 1615–1634. doi:10.1089/cmb.2009.0198. PMC 2818754. PMID 20001252. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2818754
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (June 2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (1): 132. doi:10.1186/s13059-016-0997-x. PMC 4915045. PMID 27323842. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4915045
Bromberg R, Grishin NV, Otwinowski Z (June 2016). "Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer". PLOS Computational Biology. 12 (6): e1004985. Bibcode:2016PLSCB..12E4985B. doi:10.1371/journal.pcbi.1004985. PMC 4918981. PMID 27336403. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4918981
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B (2020). "The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances". PLOS ONE. 15 (2): e0228070. Bibcode:2020PLoSO..1528070R. doi:10.1371/journal.pone.0228070. PMC 7010260. PMID 32040534. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7010260
Sarmashghi S, Bohmann K, P Gilbert MT, Bafna V, Mirarab S (February 2019). "Skmer: assembly-free and alignment-free sample identification using genome skims". Genome Biology. 20 (1): 34. doi:10.1186/s13059-019-1632-4. PMC 6374904. PMID 30760303. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6374904
Yi H, Jin L (April 2013). "Co-phylog: an assembly-free phylogenomic approach for closely related organisms". Nucleic Acids Research. 41 (7): e75. doi:10.1093/nar/gkt003. PMC 3627563. PMID 23335788. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3627563
Haubold B, Klötzl F, Pfaffelhuber P (April 2015). "andi: fast and accurate estimation of evolutionary distances between closely related genomes". Bioinformatics. 31 (8): 1169–1175. doi:10.1093/bioinformatics/btu815. PMID 25504847. https://doi.org/10.1093%2Fbioinformatics%2Fbtu815
Leimeister CA, Sohrabi-Jahromi S, Morgenstern B (April 2017). "Fast and accurate phylogeny reconstruction using filtered spaced-word matches". Bioinformatics. 33 (7): 971–979. doi:10.1093/bioinformatics/btw776. PMC 5409309. PMID 28073754. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5409309
Lau AK, Dörrer S, Leimeister CA, Bleidorn C, Morgenstern B (December 2019). "Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage". BMC Bioinformatics. 20 (Suppl 20): 638. doi:10.1186/s12859-019-3205-7. PMC 6916211. PMID 31842735. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6916211
Leimeister CA, Schellhorn J, Dörrer S, Gerth M, Bleidorn C, Morgenstern B (March 2019). "Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences". GigaScience. 8 (3): giy148. doi:10.1093/gigascience/giy148. PMC 6436989. PMID 30535314. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6436989
Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B (March 2020). "'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees". NAR Genomics and Bioinformatics. 2 (1): lqz013. doi:10.1093/nargab/lqz013. PMC 7671388. PMID 33575565. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671388
Stamatakis A (November 2006). "RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models". Bioinformatics. 22 (21): 2688–2690. doi:10.1093/bioinformatics/btl446. PMID 16928733. https://doi.org/10.1093%2Fbioinformatics%2Fbtl446
Vinga S (May 2014). "Information theory applications for biological sequence analysis". Briefings in Bioinformatics. 15 (3): 376–389. doi:10.1093/bib/bbt068. PMC 7109941. PMID 24058049. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7109941
Liu Z, Meng J, Sun X (April 2008). "A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping". Biochemical and Biophysical Research Communications. 368 (2): 223–230. doi:10.1016/j.bbrc.2008.01.070. PMID 18230342. /wiki/Doi_(identifier)
Liu ZH, Sun X (2008). "Coronavirus phylogeny based on base-base correlation". International Journal of Bioinformatics Research and Applications. 4 (2): 211–220. doi:10.1504/ijbra.2008.018347. PMID 18490264. /wiki/Doi_(identifier)
Cheng J, Zeng X, Ren G, Liu Z (March 2013). "CGAP: a new comprehensive platform for the comparative analysis of chloroplast genomes". BMC Bioinformatics. 14: 95. doi:10.1186/1471-2105-14-95. PMC 3636126. PMID 23496817. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3636126
Gao Y, Luo L (January 2012). "Genome-based phylogeny of dsDNA viruses by a novel alignment-free method". Gene. 492 (1): 309–314. doi:10.1016/j.gene.2011.11.004. PMID 22100880. /wiki/Doi_(identifier)
Bennett, C.H., Gacs, P., Li, M., Vitanyi, P. and Zurek, W., Information distance, IEEE Trans. Inform. Theory, 44, 1407--1423
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P. and
Zhang, H., (2001) An information-based sequence distance and
its application to whole mitochondrial genome phylogeny.
Bioinformatics, 17:(2001), 149--154
M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi.
The similarity metric, IEEE Trans. Inform. Th., 50:12(2004),
3250--3264
R.L. Cilibrasi and P.M.B. Vitanyi, Clustering by compression,
IEEE Trans. Informat. Th., 51:4(2005), 1523--1545
Otu HH, Sayood K (November 2003). "A new sequence distance measure for phylogenetic tree construction". Bioinformatics. 19 (16): 2122–2130. doi:10.1093/bioinformatics/btg295. PMID 14594718. https://doi.org/10.1093%2Fbioinformatics%2Fbtg295
Pinho AJ, Garcia SP, Pratas D, Ferreira PJ (Nov 21, 2013). "DNA sequences at a glance". PLOS ONE. 8 (11): e79922. Bibcode:2013PLoSO...879922P. doi:10.1371/journal.pone.0079922. PMC 3836782. PMID 24278218. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3836782
Jeffrey HJ (April 1990). "Chaos game representation of gene structure". Nucleic Acids Research. 18 (8): 2163–2170. doi:10.1093/nar/18.8.2163. PMC 330698. PMID 2336393. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC330698
Goldman N (May 1993). "Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences". Nucleic Acids Research. 21 (10): 2487–2491. doi:10.1093/nar/21.10.2487. PMC 309551. PMID 8506142. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC309551
Almeida JS, Carriço JA, Maretzek A, Noble PA, Fletcher M (May 2001). "Analysis of genomic sequences by Chaos Game Representation". Bioinformatics. 17 (5): 429–437. doi:10.1093/bioinformatics/17.5.429. PMID 11331237. https://doi.org/10.1093%2Fbioinformatics%2F17.5.429
Almeida JS (May 2014). "Sequence analysis by iterated maps, a review". Briefings in Bioinformatics. 15 (3): 369–375. doi:10.1093/bib/bbt072. PMC 4017330. PMID 24162172. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017330
Almeida JS, Grüneberg A, Maass W, Vinga S (May 2012). "Fractal MapReduce decomposition of sequence alignment". Algorithms for Molecular Biology. 7 (1): 12. doi:10.1186/1748-7188-7-12. PMC 3394223. PMID 22551205. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394223
Vinga S, Carvalho AM, Francisco AP, Russo LM, Almeida JS (May 2012). "Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis". Algorithms for Molecular Biology. 7 (1): 10. doi:10.1186/1748-7188-7-10. PMC 3402988. PMID 22551152. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402988
Pratas D, Silva RM, Pinho AJ, Ferreira PJ (May 2015). "An alignment-free method to find and visualise rearrangements between pairs of DNA sequences". Scientific Reports. 5 (10203): 10203. Bibcode:2015NatSR...510203P. doi:10.1038/srep10203. PMC 4434998. PMID 25984837. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4434998
Hosseini M, Pratas D, Morgenstern B, Pinho AJ (May 2020). "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements". GigaScience. 9 (5): giaa048. doi:10.1093/gigascience/giaa048. PMC 7238676. PMID 32432328. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7238676
Chan CX, Ragan MA (January 2013). "Next-generation phylogenomics". Biology Direct. 8: 3. doi:10.1186/1745-6150-8-3. PMC 3564786. PMID 23339707. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3564786
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, et al. (March 2019). "Alignment-free inference of hierarchical and reticulate phylogenomic relationships". Briefings in Bioinformatics. 20 (2): 426–435. doi:10.1093/bib/bbx067. PMC 6433738. PMID 28673025. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6433738
Bernard G, Greenfield P, Ragan MA, Chan CX (Nov 20, 2018). "k-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank". mSystems. 3 (6): e00257–18. doi:10.1128/mSystems.00257-18. PMC 6247013. PMID 30505941. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6247013
Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (May 2014). "New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing". Briefings in Bioinformatics. 15 (3): 343–353. doi:10.1093/bib/bbt067. PMC 4017329. PMID 24064230. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017329
Břinda K, Sykulski M, Kucherov G (November 2015). "Spaced seeds improve k-mer-based metagenomic classification". Bioinformatics. 31 (22): 3584–3592. arXiv:1502.06256. Bibcode:2015Bioin..31.3584B. doi:10.1093/bioinformatics/btv419. PMID 26209798. S2CID 8626694. /wiki/ArXiv_(identifier)
Ounit R, Lonardi S (December 2016). "Higher classification sensitivity of short metagenomic reads with CLARK-S". Bioinformatics. 32 (24): 3823–3825. doi:10.1093/bioinformatics/btw542. PMID 27540266. https://doi.org/10.1093%2Fbioinformatics%2Fbtw542
Pratas D, Pinho AJ, Silva RM, Rodrigues JM, Hosseini M, Caetano T, Ferreira PJ (February 2018). "FALCON: a method to infer metagenomic composition of ancient DNA". bioRxiv 10.1101/267179. /wiki/BioRxiv_(identifier)
Wood DE, Salzberg SL (March 2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments". Genome Biology. 15 (3): R46. doi:10.1186/gb-2014-15-3-r46. PMC 4053813. PMID 24580807. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813
Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (May 2014). "New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing". Briefings in Bioinformatics. 15 (3): 343–353. doi:10.1093/bib/bbt067. PMC 4017329. PMID 24064230. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017329
Noé L, Martin DE (December 2014). "A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances". Journal of Computational Biology. 21 (12): 947–963. arXiv:1412.2587. Bibcode:2014arXiv1412.2587N. doi:10.1089/cmb.2014.0173. PMC 4253314. PMID 25393923. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4253314
Pinello L, Lo Bosco G, Yuan GC (May 2014). "Applications of alignment-free methods in epigenomics". Briefings in Bioinformatics. 15 (3): 419–430. doi:10.1093/bib/bbt078. PMC 4017331. PMID 24197932. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017331
La Rosa M, Fiannaca A, Rizzo R, Urso A (2013). "Alignment-free analysis of barcode sequences by means of compression-based methods". BMC Bioinformatics. 14 (Suppl 7): S4. doi:10.1186/1471-2105-14-S7-S4. PMC 3633054. PMID 23815444. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3633054
Haubold B (May 2014). "Alignment-free phylogenetics and population genetics". Briefings in Bioinformatics. 15 (3): 407–418. doi:10.1093/bib/bbt083. PMID 24291823. https://doi.org/10.1093%2Fbib%2Fbbt083
Domazet-Lošo M, Haubold B (June 2011). "Alignment-free detection of local similarity among viral and bacterial genomes". Bioinformatics. 27 (11): 1466–1472. doi:10.1093/bioinformatics/btr176. PMID 21471011. https://doi.org/10.1093%2Fbioinformatics%2Fbtr176
Kolekar P, Kale M, Kulkarni-Kale U (November 2012). "Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping". Molecular Phylogenetics and Evolution. 65 (2): 510–522. doi:10.1016/j.ympev.2012.07.003. PMID 22820020. /wiki/Doi_(identifier)
Kolekar P, Hake N, Kale M, Kulkarni-Kale U (March 2014). "WNV Typer: a server for genotyping of West Nile viruses using an alignment-free method based on a return time distribution". Journal of Virological Methods. 198: 41–55. doi:10.1016/j.jviromet.2013.12.012. PMID 24388930. https://doi.org/10.1016%2Fj.jviromet.2013.12.012
Struck D, Lawyer G, Ternes AM, Schmit JC, Bercoff DP (October 2014). "COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification". Nucleic Acids Research. 42 (18): e144. doi:10.1093/nar/gku739. PMC 4191385. PMID 25120265. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4191385
Dimitrov I, Naneva L, Doytchinova I, Bangov I (March 2014). "AllergenFP: allergenicity prediction by descriptor fingerprints". Bioinformatics. 30 (6): 846–851. doi:10.1093/bioinformatics/btt619. PMID 24167156. https://doi.org/10.1093%2Fbioinformatics%2Fbtt619
Gardner SN, Hall BG (Dec 9, 2013). "When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes". PLOS ONE. 8 (12): e81760. Bibcode:2013PLoSO...881760G. doi:10.1371/journal.pone.0081760. PMC 3857212. PMID 24349125. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3857212
Haubold B, Krause L, Horn T, Pfaffelhuber P (December 2013). "An alignment-free test for recombination". Bioinformatics. 29 (24): 3121–3127. doi:10.1093/bioinformatics/btt550. PMC 5994939. PMID 24064419. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5994939
Silva JM, Pratas D, Caetano T, Matos S (August 2022). "The complexity landscape of viral genomes". GigaScience. 11: 1–16. doi:10.1093/gigascience/giac079. PMC 9366995. PMID 35950839. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9366995
Silva JM, Pratas D, Caetano T, Matos S (2022), Pinho AJ, Georgieva P, Teixeira LF, Sánchez JA (eds.), "Feature-Based Classification of Archaeal Sequences Using Compression-Based Methods", Pattern Recognition and Image Analysis, Lecture Notes in Computer Science, vol. 13256, Cham: Springer International Publishing, pp. 309–320, doi:10.1007/978-3-031-04881-4_25, ISBN 978-3-031-04880-7, retrieved 2022-08-31 978-3-031-04880-7
Silva, Jorge Miguel; Almeida, João Rafael (2024-10-01). "Enhancing metagenomic classification with compression-based features". Artificial Intelligence in Medicine. 156: 102948. doi:10.1016/j.artmed.2024.102948. ISSN 0933-3657. PMID 39173422. https://linkinghub.elsevier.com/retrieve/pii/S0933365724001908
Silva, Jorge M; Pinho, Armando J; Pratas, Diogo (2024). "AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data". GigaScience. 13. doi:10.1093/gigascience/giae086. ISSN 2047-217X. PMC 11590114. PMID 39589438. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590114
Silva JM, Qi W, Pinho AJ, Pratas D (December 2022). "AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data". GigaScience. 12. doi:10.1093/gigascience/giad101. PMC 10716826. PMID 38091509. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10716826
Di Biasi L, Piotto S. ARISE: Artificial Intelligence Semantic Search Engine. WIVACE2021.
Leimeister CA, Morgenstern B (July 2014). "Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison". Bioinformatics. 30 (14): 2000–2008. doi:10.1093/bioinformatics/btu331. PMC 4080746. PMID 24828656. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080746
Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B (July 2014). "Fast alignment-free sequence comparison using spaced-word frequencies". Bioinformatics. 30 (14): 1991–1999. doi:10.1093/bioinformatics/btu177. PMC 4080745. PMID 24700317. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080745
Yi H, Jin L (April 2013). "Co-phylog: an assembly-free phylogenomic approach for closely related organisms". Nucleic Acids Research. 41 (7): e75. doi:10.1093/nar/gkt003. PMC 3627563. PMID 23335788. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3627563
Leimeister CA, Schellhorn J, Dörrer S, Gerth M, Bleidorn C, Morgenstern B (March 2019). "Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences". GigaScience. 8 (3): giy148. doi:10.1093/gigascience/giy148. PMC 6436989. PMID 30535314. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6436989
Leimeister CA, Sohrabi-Jahromi S, Morgenstern B (April 2017). "Fast and accurate phylogeny reconstruction using filtered spaced-word matches". Bioinformatics. 33 (7): 971–979. doi:10.1093/bioinformatics/btw776. PMC 5409309. PMID 28073754. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5409309
Sims GE, Jun SR, Wu GA, Kim SH (October 2009). "Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions". Proceedings of the National Academy of Sciences of the United States of America. 106 (40): 17077–17082. Bibcode:2009PNAS..10617077S. doi:10.1073/pnas.0909377106. PMC 2761373. PMID 19805074. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2761373
Xu Z, Hao B (July 2009). "CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes". Nucleic Acids Research. 37 (Web Server issue): W174 – W178. doi:10.1093/nar/gkp278. PMC 2703908. PMID 19398429. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2703908
Kolekar P, Kale M, Kulkarni-Kale U (November 2012). "Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping". Molecular Phylogenetics and Evolution. 65 (2): 510–522. doi:10.1016/j.ympev.2012.07.003. PMID 22820020. /wiki/Doi_(identifier)
Cheng J, Cao F, Liu Z (May 2013). "AGP: a multimethods web server for alignment-free genome phylogeny". Molecular Biology and Evolution. 30 (5): 1032–1037. doi:10.1093/molbev/mst021. PMC 7574599. PMID 23389766. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7574599
Domazet-Lošo M, Haubold B (June 2011). "Alignment-free detection of local similarity among viral and bacterial genomes". Bioinformatics. 27 (11): 1466–1472. doi:10.1093/bioinformatics/btr176. PMID 21471011. https://doi.org/10.1093%2Fbioinformatics%2Fbtr176
Höhl M, Rigoutsos I, Ragan MA (February 2007). "Pattern-based phylogenetic distance estimation and tree reconstruction". Evolutionary Bioinformatics Online. 2: 359–375. arXiv:q-bio/0605002. Bibcode:2006q.bio.....5002H. PMC 2674673. PMID 19455227. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2674673
Kolekar P, Kale M, Kulkarni-Kale U (November 2012). "Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping". Molecular Phylogenetics and Evolution. 65 (2): 510–522. doi:10.1016/j.ympev.2012.07.003. PMID 22820020. /wiki/Doi_(identifier)
Kolekar P, Hake N, Kale M, Kulkarni-Kale U (March 2014). "WNV Typer: a server for genotyping of West Nile viruses using an alignment-free method based on a return time distribution". Journal of Virological Methods. 198: 41–55. doi:10.1016/j.jviromet.2013.12.012. PMID 24388930. https://doi.org/10.1016%2Fj.jviromet.2013.12.012
Dimitrov I, Naneva L, Doytchinova I, Bangov I (March 2014). "AllergenFP: allergenicity prediction by descriptor fingerprints". Bioinformatics. 30 (6): 846–851. doi:10.1093/bioinformatics/btt619. PMID 24167156. https://doi.org/10.1093%2Fbioinformatics%2Fbtt619
Gardner SN, Hall BG (Dec 9, 2013). "When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes". PLOS ONE. 8 (12): e81760. Bibcode:2013PLoSO...881760G. doi:10.1371/journal.pone.0081760. PMC 3857212. PMID 24349125. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3857212
Wang Y, Liu L, Chen L, Chen T, Sun F (Jan 2, 2014). "Comparison of metatranscriptomic samples based on k-tuple frequencies". PLOS ONE. 9 (1): e84348. Bibcode:2014PLoSO...984348W. doi:10.1371/journal.pone.0084348. PMC 3879298. PMID 24392128. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879298
Haubold B, Krause L, Horn T, Pfaffelhuber P (December 2013). "An alignment-free test for recombination". Bioinformatics. 29 (24): 3121–3127. doi:10.1093/bioinformatics/btt550. PMC 5994939. PMID 24064419. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5994939
Pratas D, Silva RM, Pinho AJ, Ferreira PJ (May 2015). "An alignment-free method to find and visualise rearrangements between pairs of DNA sequences". Scientific Reports. 5 (10203): 10203. Bibcode:2015NatSR...510203P. doi:10.1038/srep10203. PMC 4434998. PMID 25984837. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4434998
Hosseini M, Pratas D, Morgenstern B, Pinho AJ (May 2020). "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements". GigaScience. 9 (5): giaa048. doi:10.1093/gigascience/giaa048. PMC 7238676. PMID 32432328. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7238676
Struck D, Lawyer G, Ternes AM, Schmit JC, Bercoff DP (October 2014). "COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification". Nucleic Acids Research. 42 (18): e144. doi:10.1093/nar/gku739. PMC 4191385. PMID 25120265. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4191385
Almeida JS, Grüneberg A, Maass W, Vinga S (May 2012). "Fractal MapReduce decomposition of sequence alignment". Algorithms for Molecular Biology. 7 (1): 12. doi:10.1186/1748-7188-7-12. PMC 3394223. PMID 22551205. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394223
Pratas D, Pinho AJ, Silva RM, Rodrigues JM, Hosseini M, Caetano T, Ferreira PJ (February 2018). "FALCON: a method to infer metagenomic composition of ancient DNA". bioRxiv 10.1101/267179. /wiki/BioRxiv_(identifier)
Wood DE, Salzberg SL (March 2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments". Genome Biology. 15 (3): R46. doi:10.1186/gb-2014-15-3-r46. PMC 4053813. PMID 24580807. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813
Pratas D, Silva JM (January 2021). "Persistent minimal sequences of SARS-CoV-2". Bioinformatics. 36 (21): 5129–5132. doi:10.1093/bioinformatics/btaa686. PMC 7559010. PMID 32730589. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7559010
"CLC Microbial Genomics Module". QIAGEN Bioinformatics. 2019. https://www.qiagenbioinformatics.com/products/clc-microbial-genomics-module/
Silva, Jorge Miguel; Almeida, João Rafael (2024-10-01). "Enhancing metagenomic classification with compression-based features". Artificial Intelligence in Medicine. 156: 102948. doi:10.1016/j.artmed.2024.102948. ISSN 0933-3657. PMID 39173422. https://linkinghub.elsevier.com/retrieve/pii/S0933365724001908
Silva JM, Qi W, Pinho AJ, Pratas D (December 2022). "AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data". GigaScience. 12. doi:10.1093/gigascience/giad101. PMC 10716826. PMID 38091509. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10716826
Silva, Jorge M; Pinho, Armando J; Pratas, Diogo (2024). "AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data". GigaScience. 13. doi:10.1093/gigascience/giae086. ISSN 2047-217X. PMC 11590114. PMID 39589438. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590114