Hirschberg's algorithm

<h2 id="algorithm-information">Algorithm information</h2>
Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment. <a href="/facts/BLAST_(biotechnology)/uWufGxFH">BLAST</a> and <a href="/facts/FASTA/YetQcbUz">FASTA</a> are suboptimal <a href="/facts/Heuristic_(computer_science)/PM2qkoa2">heuristics</a>. If 
 
 
 
 X
 
 
 {\displaystyle X}
 
 and 
 
 
 
 Y
 
 
 {\displaystyle Y}
 
 are strings, where 
 
 
 
 length
 ⁡
 (
 X
 )
 =
 n
 
 
 {\displaystyle \operatorname {length} (X)=n}
 
 and 
 
 
 
 length
 ⁡
 (
 Y
 )
 =
 m
 
 
 {\displaystyle \operatorname {length} (Y)=m}
 
, the <a href="/facts/Needleman%25E2%2580%2593Wunsch_algorithm/B3HJ8xU5">Needleman–Wunsch algorithm</a> finds an optimal alignment in <a href="/facts/Big_O_Notation/weFFjSWg">
 
 
 
 O
 (
 n
 m
 )
 
 
 {\displaystyle O(nm)}
 
</a> time, using 
 
 
 
 O
 (
 n
 m
 )
 
 
 {\displaystyle O(nm)}
 
 space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takes 
 
 
 
 O
 (
 n
 m
 )
 
 
 {\displaystyle O(nm)}
 
 time, but needs only 
 
 
 
 O
 (
 min
 {
 n
 ,
 m
 }
 )
 
 
 {\displaystyle O(\min\{n,m\})}
 
 space and is much faster in practice.<a class="footnote-ref" id="fnref:2" href="#fn:2">2</a>
One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the <a href="/facts/Longest_common_subsequence_problem/2HKIa0yd">longest common subsequence</a> between two sets of data such as with the common <a href="/facts/Diff/MtzRWwql">diff</a> tool.
The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that:<a class="footnote-ref" id="fnref:3" href="#fn:3">3</a>

<ol><li>one can compute the optimal alignment score by only storing the current and previous row of the Needleman–Wunsch score matrix;</li>
<li>if 
 
 
 
 (
 Z
 ,
 W
 )
 =
 NW
 ⁡
 (
 X
 ,
 Y
 )
 
 
 {\displaystyle (Z,W)=\operatorname {NW} (X,Y)}
 
 is the optimal alignment of 
 
 
 
 (
 X
 ,
 Y
 )
 
 
 {\displaystyle (X,Y)}
 
, and 
 
 
 
 X
 =
 
 X
 
 l
 
 
 +
 
 X
 
 r
 
 
 
 
 {\displaystyle X=X^{l}+X^{r}}
 
 is an arbitrary partition of 
 
 
 
 X
 
 
 {\displaystyle X}
 
, there exists a partition 
 
 
 
 
 Y
 
 l
 
 
 +
 
 Y
 
 r
 
 
 
 
 {\displaystyle Y^{l}+Y^{r}}
 
 of 
 
 
 
 Y
 
 
 {\displaystyle Y}
 
 such that 
 
 
 
 NW
 ⁡
 (
 X
 ,
 Y
 )
 =
 NW
 ⁡
 (
 
 X
 
 l
 
 
 ,
 
 Y
 
 l
 
 
 )
 +
 NW
 ⁡
 (
 
 X
 
 r
 
 
 ,
 
 Y
 
 r
 
 
 )
 
 
 {\displaystyle \operatorname {NW} (X,Y)=\operatorname {NW} (X^{l},Y^{l})+\operatorname {NW} (X^{r},Y^{r})}
 
.</li></ol>
<h2 id="algorithm-description">Algorithm description</h2>

 
 
 
 
 X
 
 i
 
 
 
 
 {\displaystyle X_{i}}
 
 denotes the i-th character of 
 
 
 
 X
 
 
 {\displaystyle X}
 
, where 
 
 
 
 1
 ⩽
 i
 ⩽
 length
 ⁡
 (
 X
 )
 
 
 {\displaystyle 1\leqslant i\leqslant \operatorname {length} (X)}
 
. 
 
 
 
 
 X
 
 i
 :
 j
 
 
 
 
 {\displaystyle X_{i:j}}
 
 denotes a substring of size 
 
 
 
 j
 −
 i
 +
 1
 
 
 {\displaystyle j-i+1}
 
, ranging from the i-th to the j-th character of 
 
 
 
 X
 
 
 {\displaystyle X}
 
. 
 
 
 
 rev
 ⁡
 (
 X
 )
 
 
 {\displaystyle \operatorname {rev} (X)}
 
 is the reversed version of 
 
 
 
 X
 
 
 {\displaystyle X}
 
.

 
 
 
 X
 
 
 {\displaystyle X}
 
 and 
 
 
 
 Y
 
 
 {\displaystyle Y}
 
 are sequences to be aligned. Let 
 
 
 
 x
 
 
 {\displaystyle x}
 
 be a character from 
 
 
 
 X
 
 
 {\displaystyle X}
 
, and 
 
 
 
 y
 
 
 {\displaystyle y}
 
 be a character from 
 
 
 
 Y
 
 
 {\displaystyle Y}
 
. We assume that 
 
 
 
 Del
 ⁡
 (
 x
 )
 
 
 {\displaystyle \operatorname {Del} (x)}
 
, 
 
 
 
 Ins
 ⁡
 (
 y
 )
 
 
 {\displaystyle \operatorname {Ins} (y)}
 
 and 
 
 
 
 Sub
 ⁡
 (
 x
 ,
 y
 )
 
 
 {\displaystyle \operatorname {Sub} (x,y)}
 
 are well defined integer-valued functions. These functions represent the cost of deleting 
 
 
 
 x
 
 
 {\displaystyle x}
 
, inserting 
 
 
 
 y
 
 
 {\displaystyle y}
 
, and replacing 
 
 
 
 x
 
 
 {\displaystyle x}
 
 with 
 
 
 
 y
 
 
 {\displaystyle y}
 
, respectively.
We define 
 
 
 
 NWScore
 ⁡
 (
 X
 ,
 Y
 )
 
 
 {\displaystyle \operatorname {NWScore} (X,Y)}
 
, which returns the last line of the Needleman–Wunsch score matrix 
 
 
 
 
 S
 c
 o
 r
 e
 
 (
 i
 ,
 j
 )
 
 
 {\displaystyle \mathrm {Score} (i,j)}
 
:

function NWScore(X, Y)
 Score(0, 0) = 0 // 2 * (length(Y) + 1) array
 for j = 1 to length(Y)
 Score(0, j) = Score(0, j - 1) + Ins(Yj)
 for i = 1 to length(X) // Init array
 Score(1, 0) = Score(0, 0) + Del(Xi)
 for j = 1 to length(Y)
 scoreSub = Score(0, j - 1) + Sub(Xi, Yj)
 scoreDel = Score(0, j) + Del(Xi)
 scoreIns = Score(1, j - 1) + Ins(Yj)
 Score(1, j) = max(scoreSub, scoreDel, scoreIns)
 end
 // Copy Score[1] to Score[0]
 Score(0, :) = Score(1, :)
 end
 for j = 0 to length(Y)
 LastLine(j) = Score(1, j)
 return LastLine

Note that at any point, 
 
 
 
 NWScore
 
 
 {\displaystyle \operatorname {NWScore} }
 
 only requires the two most recent rows of the score matrix. Thus, 
 
 
 
 NWScore
 
 
 {\displaystyle \operatorname {NWScore} }
 
 is implemented in 
 
 
 
 O
 (
 min
 {
 length
 ⁡
 (
 X
 )
 ,
 length
 ⁡
 (
 Y
 )
 }
 )
 
 
 {\displaystyle O(\min\{\operatorname {length} (X),\operatorname {length} (Y)\})}
 
 space.
The Hirschberg algorithm follows:

function Hirschberg(X, Y)
 Z = ""
 W = ""
 if length(X) == 0
 for i = 1 to length(Y)
 Z = Z + '-'
 W = W + Yi
 end
 else if length(Y) == 0
 for i = 1 to length(X)
 Z = Z + Xi
 W = W + '-'
 end
 else if length(X) == 1 or length(Y) == 1
 (Z, W) = NeedlemanWunsch(X, Y)
 else
 xlen = length(X)
 xmid = length(X) / 2
 ylen = length(Y)

ScoreL = NWScore(X1:xmid, Y)
 ScoreR = NWScore(rev(Xxmid+1:xlen), rev(Y))
 ymid = <a href="/facts/Arg_max/TfSepkFV">arg max</a> ScoreL + rev(ScoreR)

(Z,W) = Hirschberg(X1:xmid, y1:ymid) + Hirschberg(Xxmid+1:xlen, Yymid+1:ylen)
    end
    return (Z, W)

In the context of observation (2), assume that 
 
 
 
 
 X
 
 l
 
 
 +
 
 X
 
 r
 
 
 
 
 {\displaystyle X^{l}+X^{r}}
 
 is a partition of 
 
 
 
 X
 
 
 {\displaystyle X}
 
. Index 
 
 
 
 
 y
 m
 i
 d
 
 
 
 {\displaystyle \mathrm {ymid} }
 
 is computed such that 
 
 
 
 
 Y
 
 l
 
 
 =
 
 Y
 
 1
 :
 
 y
 m
 i
 d
 
 
 
 
 
 {\displaystyle Y^{l}=Y_{1:\mathrm {ymid} }}
 
 and 
 
 
 
 
 Y
 
 r
 
 
 =
 
 Y
 
 
 y
 m
 i
 d
 
 +
 1
 :
 length
 ⁡
 (
 Y
 )
 
 
 
 
 {\displaystyle Y^{r}=Y_{\mathrm {ymid} +1:\operatorname {length} (Y)}}
 
.

<h2 id="example">Example</h2>
Let

 
 
 
 
 
 
 
 X
 
 
 
 =
 
 AGTACGCA
 
 ,
 
 
 
 
 Y
 
 
 
 =
 
 TATGC
 
 ,
 
 
 
 
 Del
 ⁡
 (
 x
 )
 
 
 
 =
 −
 2
 ,
 
 
 
 
 Ins
 ⁡
 (
 y
 )
 
 
 
 =
 −
 2
 ,
 
 
 
 
 Sub
 ⁡
 (
 x
 ,
 y
 )
 
 
 
 =
 
 
 {
 
 
 
 +
 2
 ,
 
 
 
 if 
 
 x
 =
 y
 
 
 
 
 −
 1
 ,
 
 
 
 if 
 
 x
 ≠
 y
 .
 
 
 
 
 
 
 
 
 
 
 
 
 {\displaystyle {\begin{aligned}X&={\text{AGTACGCA}},\\Y&={\text{TATGC}},\\\operatorname {Del} (x)&=-2,\\\operatorname {Ins} (y)&=-2,\\\operatorname {Sub} (x,y)&={\begin{cases}+2,&{\text{if }}x=y\\-1,&{\text{if }}x\neq y.\end{cases}}\end{aligned}}}

The optimal alignment is given by

 W = AGTACGCA
 Z = --TATGC-

Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:

 T A T G C
 0 -2 -4 -6 -8 -10
 A -2 -1 0 -2 -4 -6
 G -4 -3 -2 -1 0 -2
 T -6 -2 -4 0 -2 -1
 A -8 -4 0 -2 -1 -3
 C -10 -6 -2 -1 -3 1
 G -12 -8 -4 -3 1 -1
 C -14 -10 -6 -5 -1 3
 A -16 -12 -8 -7 -3 1

One starts with the top level call to 
 
 
 
 Hirschberg
 ⁡
 (
 
 AGTACGCA
 
 ,
 
 TATGC
 
 )
 
 
 {\displaystyle \operatorname {Hirschberg} ({\text{AGTACGCA}},{\text{TATGC}})}
 
, which splits the first argument in half: 
 
 
 
 X
 =
 
 AGTA
 
 +
 
 CGCA
 
 
 
 {\displaystyle X={\text{AGTA}}+{\text{CGCA}}}
 
. The call to 
 
 
 
 NWScore
 ⁡
 (
 
 AGTA
 
 ,
 Y
 )
 
 
 {\displaystyle \operatorname {NWScore} ({\text{AGTA}},Y)}
 
 produces the following matrix:

 T A T G C
 0 -2 -4 -6 -8 -10
 A -2 -1 0 -2 -4 -6
 G -4 -3 -2 -1 0 -2
 T -6 -2 -4 0 -2 -1
 A -8 -4 0 -2 -1 -3

Likewise, 
 
 
 
 NWScore
 ⁡
 (
 rev
 ⁡
 (
 
 CGCA
 
 )
 ,
 rev
 ⁡
 (
 Y
 )
 )
 
 
 {\displaystyle \operatorname {NWScore} (\operatorname {rev} ({\text{CGCA}}),\operatorname {rev} (Y))}
 
 generates the following matrix:

 C G T A T
 0 -2 -4 -6 -8 -10
 A -2 -1 -3 -5 -4 -6
 C -4 0 -2 -4 -6 -5
 G -6 -2 2 0 -2 -4
 C -8 -4 0 1 -1 -3

Their last lines (after reversing the latter) and sum of those are respectively

 ScoreL = [ -8 -4 0 -2 -1 -3 ]
 rev(ScoreR) = [ -3 -1 1 0 -4 -8 ]
 Sum = [-11 -5 1 -2 -5 -11]

The maximum (shown in bold) appears at ymid = 2, producing the partition 
 
 
 
 Y
 =
 
 TA
 
 +
 
 TGC
 
 
 
 {\displaystyle Y={\text{TA}}+{\text{TGC}}}
 
.
The entire Hirschberg recursion (which we omit for brevity) produces the following tree:

 (AGTACGCA,TATGC)
 / \
 (AGTA,TA) (CGCA,TGC)
 / \ / \
 (AG, ) (TA,TA) (CG,TG) (CA,C)
 / \ / \ 
 (T,T) (A,A) (C,T) (G,G)

The leaves of the tree contain the optimal alignment.

<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Longest_common_subsequence_problem/2HKIa0yd">Longest common subsequence</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">Hirschberg's algorithm. <a href="http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Hirsch/" target="_blank">http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Hirsch/</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">"The Algorithm". <a href="http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec02/node10.html" target="_blank">http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec02/node10.html</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences". Communications of the ACM. 18 (6): 341–343. CiteSeerX 10.1.1.348.4774. doi:10.1145/360825.360861. MR 0375829. S2CID 207694727. <a href="/wiki/Dan_Hirschberg" target="_blank">/wiki/Dan_Hirschberg</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
</ol>

Hirschberg's algorithm open-in-new

Hirschberg's algorithm