Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment. BLAST and FASTA are suboptimal heuristics. If X {\displaystyle X} and Y {\displaystyle Y} are strings, where length ( X ) = n {\displaystyle \operatorname {length} (X)=n} and length ( Y ) = m {\displaystyle \operatorname {length} (Y)=m} , the Needleman–Wunsch algorithm finds an optimal alignment in O ( n m ) {\displaystyle O(nm)} time, using O ( n m ) {\displaystyle O(nm)} space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takes O ( n m ) {\displaystyle O(nm)} time, but needs only O ( min { n , m } ) {\displaystyle O(\min\{n,m\})} space and is much faster in practice.2 One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the longest common subsequence between two sets of data such as with the common diff tool.
The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that:3
X i {\displaystyle X_{i}} denotes the i-th character of X {\displaystyle X} , where 1 ⩽ i ⩽ length ( X ) {\displaystyle 1\leqslant i\leqslant \operatorname {length} (X)} . X i : j {\displaystyle X_{i:j}} denotes a substring of size j − i + 1 {\displaystyle j-i+1} , ranging from the i-th to the j-th character of X {\displaystyle X} . rev ( X ) {\displaystyle \operatorname {rev} (X)} is the reversed version of X {\displaystyle X} .
X {\displaystyle X} and Y {\displaystyle Y} are sequences to be aligned. Let x {\displaystyle x} be a character from X {\displaystyle X} , and y {\displaystyle y} be a character from Y {\displaystyle Y} . We assume that Del ( x ) {\displaystyle \operatorname {Del} (x)} , Ins ( y ) {\displaystyle \operatorname {Ins} (y)} and Sub ( x , y ) {\displaystyle \operatorname {Sub} (x,y)} are well defined integer-valued functions. These functions represent the cost of deleting x {\displaystyle x} , inserting y {\displaystyle y} , and replacing x {\displaystyle x} with y {\displaystyle y} , respectively.
We define NWScore ( X , Y ) {\displaystyle \operatorname {NWScore} (X,Y)} , which returns the last line of the Needleman–Wunsch score matrix S c o r e ( i , j ) {\displaystyle \mathrm {Score} (i,j)} :
Note that at any point, NWScore {\displaystyle \operatorname {NWScore} } only requires the two most recent rows of the score matrix. Thus, NWScore {\displaystyle \operatorname {NWScore} } is implemented in O ( min { length ( X ) , length ( Y ) } ) {\displaystyle O(\min\{\operatorname {length} (X),\operatorname {length} (Y)\})} space.
The Hirschberg algorithm follows:
In the context of observation (2), assume that X l + X r {\displaystyle X^{l}+X^{r}} is a partition of X {\displaystyle X} . Index y m i d {\displaystyle \mathrm {ymid} } is computed such that Y l = Y 1 : y m i d {\displaystyle Y^{l}=Y_{1:\mathrm {ymid} }} and Y r = Y y m i d + 1 : length ( Y ) {\displaystyle Y^{r}=Y_{\mathrm {ymid} +1:\operatorname {length} (Y)}} .
Let
X = AGTACGCA , Y = TATGC , Del ( x ) = − 2 , Ins ( y ) = − 2 , Sub ( x , y ) = { + 2 , if x = y − 1 , if x ≠ y . {\displaystyle {\begin{aligned}X&={\text{AGTACGCA}},\\Y&={\text{TATGC}},\\\operatorname {Del} (x)&=-2,\\\operatorname {Ins} (y)&=-2,\\\operatorname {Sub} (x,y)&={\begin{cases}+2,&{\text{if }}x=y\\-1,&{\text{if }}x\neq y.\end{cases}}\end{aligned}}}
The optimal alignment is given by
Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:
One starts with the top level call to Hirschberg ( AGTACGCA , TATGC ) {\displaystyle \operatorname {Hirschberg} ({\text{AGTACGCA}},{\text{TATGC}})} , which splits the first argument in half: X = AGTA + CGCA {\displaystyle X={\text{AGTA}}+{\text{CGCA}}} . The call to NWScore ( AGTA , Y ) {\displaystyle \operatorname {NWScore} ({\text{AGTA}},Y)} produces the following matrix:
Likewise, NWScore ( rev ( CGCA ) , rev ( Y ) ) {\displaystyle \operatorname {NWScore} (\operatorname {rev} ({\text{CGCA}}),\operatorname {rev} (Y))} generates the following matrix:
Their last lines (after reversing the latter) and sum of those are respectively
The maximum (shown in bold) appears at ymid = 2, producing the partition Y = TA + TGC {\displaystyle Y={\text{TA}}+{\text{TGC}}} .
The entire Hirschberg recursion (which we omit for brevity) produces the following tree:
The leaves of the tree contain the optimal alignment.
Hirschberg's algorithm. http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Hirsch/ ↩
"The Algorithm". http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec02/node10.html ↩
Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences". Communications of the ACM. 18 (6): 341–343. CiteSeerX 10.1.1.348.4774. doi:10.1145/360825.360861. MR 0375829. S2CID 207694727. /wiki/Dan_Hirschberg ↩