Pairwise summation

<h2 id="the-algorithm">The algorithm</h2>
In <a href="/facts/Pseudocode/XgN19duK">pseudocode</a>, the pairwise summation algorithm for an <a href="/facts/Array_data_type/gHJJ3XPw">array</a> x of length n ≥ 0 can be written:

s = pairwise(x[1...n])
 if n ≤ N base case: naive summation for a sufficiently small array
 s = 0
 for i = 1 to n
 s = s + x[i]
 else divide and conquer: recursively sum two halves of the array
 m = <a href="/facts/Floor_and_ceiling_functions/ssehyMMf">floor</a>(n / 2)
 s = pairwise(x[1...m]) + pairwise(x[m+1...n])
 end if

For some sufficiently small N, this algorithm switches to a naive loop-based summation as a <a href="/facts/Recursion/CxSKxJoW">base case</a>, whose error bound is O(Nε).<a class="footnote-ref" id="fnref:8" href="#fn:8">8</a> The entire sum has a worst-case error that grows asymptotically as O(ε log n) for large n, for a given condition number (see below).
In an algorithm of this sort (as for <a href="/facts/Divide_and_conquer_algorithm/pC1X7Ws7">divide and conquer algorithms</a> in general<a class="footnote-ref" id="fnref:9" href="#fn:9">9</a>), it is desirable to use a larger base case in order to <a href="/facts/Amortized_analysis/9fFYenJz">amortize</a> the overhead of the recursion. If N = 1, then there is roughly one recursive subroutine call for every input, but more generally there is one recursive call for (roughly) every N/2 inputs if the recursion stops at exactly n = N. By making N sufficiently large, the overhead of recursion can be made negligible (precisely this technique of a large base case for recursive summation is employed by high-performance FFT implementations<a class="footnote-ref" id="fnref:10" href="#fn:10">10</a>).
Regardless of N, exactly n−1 additions are performed in total, the same as for naive summation, so if the recursion overhead is made negligible then pairwise summation has essentially the same computational cost as for naive summation.
A variation on this idea is to break the sum into b blocks at each recursive stage, summing each block recursively, and then summing the results, which was dubbed a "superblock" algorithm by its proposers.<a class="footnote-ref" id="fnref:11" href="#fn:11">11</a> The above pairwise algorithm corresponds to b = 2 for every stage except for the last stage which is b = N.
Dalton, Wang & Blainey (2014) describe a iterative, "shift-reduce" formulation for pairwise summation. It can be <a href="/facts/Loop_unrolling/uasvbJWE">unrolled</a> and sped up using <a href="/facts/SIMD/gQHWQSpo">SIMD</a> instructions. The non-unrolled version is:<a class="footnote-ref" id="fnref:12" href="#fn:12">12</a>

double shift_reduce_sum(double ∗x, size_t n) {
 double stack[64], v;
 size_t p = 0;
 for (size_t i = 0; i < n; ++i) {
 v = x[i]; // shift
 for (size_t b = 1; i & b; b <<= 1, −−p) // reduce
 v += stack[p−1];
 stack[p++] = v;
 }
 double sum = 0.0;
 while (p)
 sum += stack[−−p];
 return sum;
}

<h2 id="accuracy">Accuracy</h2>
Suppose that one is summing n values xi, for i = 1, ..., n. The exact sum is:

S
          
            n
          
        
        =
        
          ∑
          
            i
            =
            1
          
          
            n
          
        
        
          x
          
            i
          
        
      
    
    {\displaystyle S_{n}=\sum _{i=1}^{n}x_{i}}

(computed with infinite precision).
With pairwise summation for a base case N = 1, one instead obtains 
 
 
 
 
 S
 
 n
 
 
 +
 
 E
 
 n
 
 
 
 
 {\displaystyle S_{n}+E_{n}}
 
, where the error 
 
 
 
 
 E
 
 n
 
 
 
 
 {\displaystyle E_{n}}
 
 is bounded above by:<a class="footnote-ref" id="fnref:13" href="#fn:13">13</a>

|
        
        
          E
          
            n
          
        
        
          |
        
        ≤
        
          
            
              ε
              
                log
                
                  2
                
              
              ⁡
              n
            
            
              1
              −
              ε
              
                log
                
                  2
                
              
              ⁡
              n
            
          
        
        
          ∑
          
            i
            =
            1
          
          
            n
          
        
        
          |
        
        
          x
          
            i
          
        
        
          |
        
      
    
    {\displaystyle |E_{n}|\leq {\frac {\varepsilon \log _{2}n}{1-\varepsilon \log _{2}n}}\sum _{i=1}^{n}|x_{i}|}

where ε is the <a href="/facts/Machine_precision/Ujsx0nA5">machine precision</a> of the arithmetic being employed (e.g. ε ≈ 10−16 for standard <a href="/facts/Double_precision/JYyXXYFM">double precision</a> floating point). Usually, the quantity of interest is the <a href="/facts/Relative_error/orSvvIMC">relative error</a> 
 
 
 
 
 |
 
 
 E
 
 n
 
 
 
 |
 
 
 /
 
 
 |
 
 
 S
 
 n
 
 
 
 |
 
 
 
 {\displaystyle |E_{n}|/|S_{n}|}
 
, which is therefore bounded above by:

|
              
              
                E
                
                  n
                
              
              
                |
              
            
            
              
                |
              
              
                S
                
                  n
                
              
              
                |
              
            
          
        
        ≤
        
          
            
              ε
              
                log
                
                  2
                
              
              ⁡
              n
            
            
              1
              −
              ε
              
                log
                
                  2
                
              
              ⁡
              n
            
          
        
        
          (
          
            
              
                
                  ∑
                  
                    i
                    =
                    1
                  
                  
                    n
                  
                
                
                  |
                
                
                  x
                  
                    i
                  
                
                
                  |
                
              
              
                |
                
                  
                    ∑
                    
                      i
                      =
                      1
                    
                    
                      n
                    
                  
                  
                    x
                    
                      i
                    
                  
                
                |
              
            
          
          )
        
        .
      
    
    {\displaystyle {\frac {|E_{n}|}{|S_{n}|}}\leq {\frac {\varepsilon \log _{2}n}{1-\varepsilon \log _{2}n}}\left({\frac {\sum _{i=1}^{n}|x_{i}|}{\left|\sum _{i=1}^{n}x_{i}\right|}}\right).}

In the expression for the relative error bound, the fraction (Σ|xi|/|Σxi|) is the <a href="/facts/Condition_number/9Tjtk6wh">condition number</a> of the summation problem. Essentially, the condition number represents the intrinsic sensitivity of the summation problem to errors, regardless of how it is computed.<a class="footnote-ref" id="fnref:14" href="#fn:14">14</a> The relative error bound of every (<a href="/facts/Backwards_stable/jRDdo8zV">backwards stable</a>) summation method by a fixed algorithm in fixed precision (i.e. not those that use <a href="/facts/Arbitrary-precision_arithmetic/cXxc8yB2">arbitrary-precision arithmetic</a>, nor algorithms whose memory and time requirements change based on the data), is proportional to this condition number.<a class="footnote-ref" id="fnref:15" href="#fn:15">15</a> An ill-conditioned summation problem is one in which this ratio is large, and in this case even pairwise summation can have a large relative error. For example, if the summands xi are uncorrelated random numbers with zero mean, the sum is a <a href="/facts/Random_walk/08v0jmfv">random walk</a> and the condition number will grow proportional to 
 
 
 
 
 
 n
 
 
 
 
 {\displaystyle {\sqrt {n}}}
 
. On the other hand, for random inputs with nonzero mean the condition number asymptotes to a finite constant as 
 
 
 
 n
 →
 ∞
 
 
 {\displaystyle n\to \infty }
 
. If the inputs are all <a href="/facts/Non-negative/z0xollg1">non-negative</a>, then the condition number is 1.
Note that the 
 
 
 
 1
 −
 ε
 
 log
 
 2
 
 
 ⁡
 n
 
 
 {\displaystyle 1-\varepsilon \log _{2}n}
 
 denominator is effectively 1 in practice, since 
 
 
 
 ε
 
 log
 
 2
 
 
 ⁡
 n
 
 
 {\displaystyle \varepsilon \log _{2}n}
 
 is much smaller than 1 until n becomes of order 21/ε, which is roughly 101015 in double precision.
In comparison, the relative error bound for naive summation (simply adding the numbers in sequence, rounding at each step) grows as 
 
 
 
 O
 (
 ε
 n
 )
 
 
 {\displaystyle O(\varepsilon n)}
 
 multiplied by the condition number.<a class="footnote-ref" id="fnref:16" href="#fn:16">16</a> In practice, it is much more likely that the rounding errors have a random sign, with zero mean, so that they form a random walk; in this case, naive summation has a <a href="/facts/Root_mean_square/Cfmg7dws">root mean square</a> relative error that grows as 
 
 
 
 O
 (
 ε
 
 
 n
 
 
 )
 
 
 {\displaystyle O(\varepsilon {\sqrt {n}})}
 
 and pairwise summation has an error that grows as 
 
 
 
 O
 (
 ε
 
 
 log
 ⁡
 n
 
 
 )
 
 
 {\displaystyle O(\varepsilon {\sqrt {\log n}})}
 
 on average.<a class="footnote-ref" id="fnref:17" href="#fn:17">17</a>

<h2 id="software-implementations">Software implementations</h2>
Pairwise summation is the default summation algorithm in <a href="/facts/NumPy/T6FnhWWD">NumPy</a><a class="footnote-ref" id="fnref:18" href="#fn:18">18</a> and the <a href="/facts/Julia_(programming_language)/AoB0PJ9C">Julia technical-computing language</a>,<a class="footnote-ref" id="fnref:19" href="#fn:19">19</a> where in both cases it was found to have comparable speed to naive summation (thanks to the use of a large base case).
Other software implementations include the HPCsharp library<a class="footnote-ref" id="fnref:20" href="#fn:20">20</a> for the <a href="/facts/C_Sharp_(programming_language)/UC2gxeyb">C#</a> language and the standard library summation<a class="footnote-ref" id="fnref:21" href="#fn:21">21</a> in <a href="/facts/D_(programming_language)/4qIzRjkF">D</a>.

<h2 id="references">References</h2>

<ol>
<li id="fn:1">Higham, Nicholas J. (1993), "The accuracy of floating point summation", SIAM Journal on Scientific Computing, 14 (4): 783–799, Bibcode:1993SJSC...14..783H, CiteSeerX 10.1.1.43.3535, doi:10.1137/0914050 <a href="/wiki/SIAM_Journal_on_Scientific_Computing" target="_blank">/wiki/SIAM_Journal_on_Scientific_Computing</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">Higham, Nicholas J. (1993), "The accuracy of floating point summation", SIAM Journal on Scientific Computing, 14 (4): 783–799, Bibcode:1993SJSC...14..783H, CiteSeerX 10.1.1.43.3535, doi:10.1137/0914050 <a href="/wiki/SIAM_Journal_on_Scientific_Computing" target="_blank">/wiki/SIAM_Journal_on_Scientific_Computing</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">Higham, Nicholas J. (1993), "The accuracy of floating point summation", SIAM Journal on Scientific Computing, 14 (4): 783–799, Bibcode:1993SJSC...14..783H, CiteSeerX 10.1.1.43.3535, doi:10.1137/0914050 <a href="/wiki/SIAM_Journal_on_Scientific_Computing" target="_blank">/wiki/SIAM_Journal_on_Scientific_Computing</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
<li id="fn:4">Higham, Nicholas J. (1993), "The accuracy of floating point summation", SIAM Journal on Scientific Computing, 14 (4): 783–799, Bibcode:1993SJSC...14..783H, CiteSeerX 10.1.1.43.3535, doi:10.1137/0914050 <a href="/wiki/SIAM_Journal_on_Scientific_Computing" target="_blank">/wiki/SIAM_Journal_on_Scientific_Computing</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></li>
<li id="fn:5">Manfred Tasche and Hansmartin Zeuner Handbook of Analytic-Computational Methods in Applied Mathematics Boca Raton, FL: CRC Press, 2000). <a href="#fnref:5" class="footnote-back-ref">↩</a></li>
<li id="fn:6">Manfred Tasche and Hansmartin Zeuner Handbook of Analytic-Computational Methods in Applied Mathematics Boca Raton, FL: CRC Press, 2000). <a href="#fnref:6" class="footnote-back-ref">↩</a></li>
<li id="fn:7">S. G. Johnson and M. Frigo, "Implementing FFTs in practice, in Fast Fourier Transforms, edited by C. Sidney Burrus (2008). <a href="http://cnx.org/content/m16336/latest/" target="_blank">http://cnx.org/content/m16336/latest/</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></li>
<li id="fn:8">Higham, Nicholas (2002). Accuracy and Stability of Numerical Algorithms (2 ed). SIAM. pp. 81–82. <a href="#fnref:8" class="footnote-back-ref">↩</a></li>
<li id="fn:9">Radu Rugina and Martin Rinard, "Recursion unrolling for divide and conquer programs," in Languages and Compilers for Parallel Computing, chapter 3, pp. 34–48. Lecture Notes in Computer Science vol. 2017 (Berlin: Springer, 2001). <a href="http://people.csail.mit.edu/rinard/paper/lcpc00.pdf" target="_blank">http://people.csail.mit.edu/rinard/paper/lcpc00.pdf</a> <a href="#fnref:9" class="footnote-back-ref">↩</a></li>
<li id="fn:10">S. G. Johnson and M. Frigo, "Implementing FFTs in practice, in Fast Fourier Transforms, edited by C. Sidney Burrus (2008). <a href="http://cnx.org/content/m16336/latest/" target="_blank">http://cnx.org/content/m16336/latest/</a> <a href="#fnref:10" class="footnote-back-ref">↩</a></li>
<li id="fn:11">Anthony M. Castaldo, R. Clint Whaley, and Anthony T. Chronopoulos, "Reducing floating-point error in dot product using the superblock family of algorithms," SIAM J. Sci. Comput., vol. 32, pp. 1156–1174 (2008). <a href="#fnref:11" class="footnote-back-ref">↩</a></li>
<li id="fn:12">Dalton, Barnaby; Wang, Amy; Blainey, Bob (16 February 2014). SIMDizing pairwise sums: a summation algorithm balancing accuracy with throughput. 2014 Workshop on Workshop on Programming Models for SIMD/Vector Processing - WPMVP ’14. pp. 65–70. doi:10.1145/2568058.2568070. <a href="/wiki/Doi_(identifier)" target="_blank">/wiki/Doi_(identifier)</a> <a href="#fnref:12" class="footnote-back-ref">↩</a></li>
<li id="fn:13">Higham, Nicholas J. (1993), "The accuracy of floating point summation", SIAM Journal on Scientific Computing, 14 (4): 783–799, Bibcode:1993SJSC...14..783H, CiteSeerX 10.1.1.43.3535, doi:10.1137/0914050 <a href="/wiki/SIAM_Journal_on_Scientific_Computing" target="_blank">/wiki/SIAM_Journal_on_Scientific_Computing</a> <a href="#fnref:13" class="footnote-back-ref">↩</a></li>
<li id="fn:14">L. N. Trefethen and D. Bau, Numerical Linear Algebra (SIAM: Philadelphia, 1997). <a href="#fnref:14" class="footnote-back-ref">↩</a></li>
<li id="fn:15">Higham, Nicholas J. (1993), "The accuracy of floating point summation", SIAM Journal on Scientific Computing, 14 (4): 783–799, Bibcode:1993SJSC...14..783H, CiteSeerX 10.1.1.43.3535, doi:10.1137/0914050 <a href="/wiki/SIAM_Journal_on_Scientific_Computing" target="_blank">/wiki/SIAM_Journal_on_Scientific_Computing</a> <a href="#fnref:15" class="footnote-back-ref">↩</a></li>
<li id="fn:16">Higham, Nicholas J. (1993), "The accuracy of floating point summation", SIAM Journal on Scientific Computing, 14 (4): 783–799, Bibcode:1993SJSC...14..783H, CiteSeerX 10.1.1.43.3535, doi:10.1137/0914050 <a href="/wiki/SIAM_Journal_on_Scientific_Computing" target="_blank">/wiki/SIAM_Journal_on_Scientific_Computing</a> <a href="#fnref:16" class="footnote-back-ref">↩</a></li>
<li id="fn:17">Manfred Tasche and Hansmartin Zeuner Handbook of Analytic-Computational Methods in Applied Mathematics Boca Raton, FL: CRC Press, 2000). <a href="#fnref:17" class="footnote-back-ref">↩</a></li>
<li id="fn:18">ENH: implement pairwise summation, github.com/numpy/numpy pull request #3685 (September 2013). <a href="https://github.com/numpy/numpy/pull/3685" target="_blank">https://github.com/numpy/numpy/pull/3685</a> <a href="#fnref:18" class="footnote-back-ref">↩</a></li>
<li id="fn:19">RFC: use pairwise summation for sum, cumsum, and cumprod, github.com/JuliaLang/julia pull request #4039 (August 2013). <a href="https://github.com/JuliaLang/julia/pull/4039" target="_blank">https://github.com/JuliaLang/julia/pull/4039</a> <a href="#fnref:19" class="footnote-back-ref">↩</a></li>
<li id="fn:20">https://github.com/DragonSpit/HPCsharp HPCsharp nuget package of high performance C# algorithms <a href="https://github.com/DragonSpit/HPCsharp" target="_blank">https://github.com/DragonSpit/HPCsharp</a> <a href="#fnref:20" class="footnote-back-ref">↩</a></li>
<li id="fn:21">"std.algorithm.iteration - D Programming Language". dlang.org. Retrieved 2021-04-23. <a href="https://dlang.org/phobos/std_algorithm_iteration.html#sum" target="_blank">https://dlang.org/phobos/std_algorithm_iteration.html#sum</a> <a href="#fnref:21" class="footnote-back-ref">↩</a></li>
</ol>

Pairwise summation open-in-new

Pairwise summation