Multiply–accumulate operation

<h2 id="in-floating-point-arithmetic">In floating-point arithmetic</h2>
When done with <a href="/facts/Integer/PUwwrawx">integers</a>, the operation is typically exact (computed <a href="/facts/Modular_arithmetic/0wtRa5dU">modulo</a> some <a href="/facts/Power_of_two/MA6WEpTt">power of two</a>). However, <a href="/facts/Floating-point_arithmetic/eIckahxe">floating-point</a> numbers have only a certain amount of mathematical <a href="/facts/Arithmetic_precision/RUfYtpgo">precision</a>. That is, digital floating-point arithmetic is generally not <a href="/facts/Associativity/EQX8lV6r">associative</a> or <a href="/facts/Distributivity/0LNBrVJl">distributive</a>. (See <a href="/facts/Floating-point_arithmetic/eIckahxe">Floating-point arithmetic § Accuracy problems</a>.)
Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add). <a href="/facts/IEEE_754-2008/gqqJPoYC">IEEE 754-2008</a> specifies that it must be performed with one rounding, yielding a more accurate result.<a class="footnote-ref" id="fnref:6" href="#fn:6">6</a>

<h2 id="fused-multiplyadd">Fused multiply–add</h2>
A fused multiply–add (FMA or fmadd)<a class="footnote-ref" id="fnref:7" href="#fn:7">7</a>
is a floating-point multiply–add operation performed in one step (<a href="/facts/Fused_operation/7TnU5LKu">fused operation</a>), with a single rounding. That is, where an unfused multiply–add would compute the product b × c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression a + (b × c) to its full precision before rounding the final result down to N significant bits.
A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:

<ul><li><a href="/facts/Dot_product/tNz8MLjT">Dot product</a></li>
<li><a href="/facts/Matrix_multiplication/lYDB6Ro2">Matrix multiplication</a></li>
<li><a href="/facts/Polynomial_evaluation/wWhhHnrL">Polynomial evaluation</a> (e.g., with <a href="/facts/Horner%2527s_rule/KIBtC85Q">Horner's rule</a>)</li>
<li><a href="/facts/Newton%2527s_method/VlRhI2FL">Newton's method</a> for evaluating functions (from the inverse function)</li>
<li><a href="/facts/Convolutions/PVPrdz9J">Convolutions</a> and <a href="/facts/Artificial_neural_networks/6V1jMlkx">artificial neural networks</a></li>
<li>Multiplication in <a href="/facts/Quadruple-precision_floating-point_format/50oPmapW">double-double arithmetic</a></li></ul>
Fused multiply–add can usually be relied on to give more accurate results. However, <a href="/facts/William_Morton_Kahan/J1GNQDyN">William Kahan</a> has pointed out that it can give problems if used unthinkingly.<a class="footnote-ref" id="fnref:8" href="#fn:8">8</a> If x2 − y2 is evaluated as ((x × x) − y × y) (following Kahan's suggested notation in which redundant parentheses direct the compiler to round the (x × x) term first) using fused multiply–add, then the result may be negative even when x = y due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated.
When implemented inside a <a href="/facts/Microprocessor/Fx9CaQht">microprocessor</a>, an FMA can be faster than a multiply operation followed by an add. However, standard industrial implementations based on the original IBM RS/6000 design require a 2N-bit adder to compute the sum properly.<a class="footnote-ref" id="fnref:9" href="#fn:9">9</a>
Another benefit of including this instruction is that it allows an efficient software implementation of <a href="/facts/Division_(mathematics)/FqLKNkmq">division</a> (see <a href="/facts/Division_algorithm/8xc3I1sJ">division algorithm</a>) and <a href="/facts/Square_root/AfrzfBdQ">square root</a> (see <a href="/facts/Methods_of_computing_square_roots/dRG2BtBp">methods of computing square roots</a>) operations, thus eliminating the need for dedicated hardware for those operations.<a class="footnote-ref" id="fnref:10" href="#fn:10">10</a>

<h3>Dot product instruction</h3>
Some machines combine multiple fused multiply add operations into a single step, e.g. performing a four-element dot-product on two 128-bit <a href="/facts/Single_instruction%2c_multiple_data/gQHWQSpo">SIMD</a> registers a0×b0 + a1×b1 + a2×b2 + a3×b3 with single cycle throughput.

<h3>Support</h3>
The FMA operation is included in <a href="/facts/IEEE_754-2008/gqqJPoYC">IEEE 754-2008</a>.
The <a href="/facts/C99/o5uMWjTQ">1999 standard</a> of the <a href="/facts/C_(programming_language)/Ky2No763">C programming language</a> supports the FMA operation through the fma() standard math library function and the automatic transformation of a multiplication followed by an addition (contraction of floating-point expressions), which can be explicitly enabled or disabled with standard pragmas (#pragma STDC FP_CONTRACT). The <a href="/facts/GNU_Compiler_Collection/t07jh01k">GCC</a> and <a href="/facts/Clang/MH7cjC67">Clang</a> C compilers do such transformations by default for processor architectures that support FMA instructions. With GCC, which does not support the aforementioned pragma,<a class="footnote-ref" id="fnref:11" href="#fn:11">11</a> this can be globally controlled by the -ffp-contract command line option.<a class="footnote-ref" id="fnref:12" href="#fn:12">12</a>
The fused multiply–add operation was introduced as "multiply–add fused" in the IBM <a href="/facts/POWER1/c3WdDTXs">POWER1</a> (1990) processor,<a class="footnote-ref" id="fnref:13" href="#fn:13">13</a> but has been added to numerous processors:

<ul><li>IBM <a href="/facts/POWER1/c3WdDTXs">POWER1</a> (1990)</li>
<li><a href="/facts/Hewlett-Packard/abuNSYNu">HP</a> <a href="/facts/PA-8000/f5sE1Izr">PA-8000</a> (1996) and above</li>
<li><a href="/facts/Hitachi%2c_Ltd./KeK38e82">Hitachi</a> <a href="/facts/SuperH/4GTLXc1N">SuperH SH-4</a> (1998)</li>
<li><a href="/facts/IBM/QszA7nwd">IBM</a> <a href="/facts/Z%2fArchitecture/FcKYkGn9">z/Architecture</a> (since 1998)</li>
<li><a href="/facts/Sony_Computer_Entertainment/tnH01NqQ">SCE</a>-<a href="/facts/Toshiba/QHRaw8Tr">Toshiba</a> <a href="/facts/Emotion_Engine/y9npRux0">Emotion Engine</a> (1999)</li>
<li>Intel <a href="/facts/Itanium/0wkjSDpe">Itanium</a> (2001)</li>
<li>STI <a href="/facts/Cell_(microprocessor)/p2MIQjMc">Cell</a> (2006)</li>
<li><a href="/facts/Fujitsu/6e5gvNSa">Fujitsu</a> <a href="/facts/SPARC64_VI/LSGPFyC9">SPARC64 VI</a> (2007) and above</li>
<li>(<a href="/facts/MIPS_architecture/efV4KgK4">MIPS</a>-compatible) <a href="/facts/Loongson/6AhDhaeU">Loongson</a>-2F (2008)<a class="footnote-ref" id="fnref:14" href="#fn:14">14</a></li>
<li><a href="/facts/RISC-V/pUSKDat1">RISC-V</a> instruction set (2010)</li>
<li>ARM processors with VFPv4 and/or NEONv2:
<ul><li><a href="/facts/ARM_Cortex-M4F/prW99nkV">ARM Cortex-M4F</a> (2010)</li>
<li>STM32 Cortex-M33 (VFMA operation)<a class="footnote-ref" id="fnref:15" href="#fn:15">15</a></li>
<li><a href="/facts/ARM_Cortex-A5/F8QS0Phr">ARM Cortex-A5</a> (2012)</li>
<li><a href="/facts/ARM_Cortex-A7_MPCore/Ch36KljD">ARM Cortex-A7</a> (2013)</li>
<li><a href="/facts/ARM_Cortex-A15_MPCore/z1Vccoow">ARM Cortex-A15</a> (2012)</li>
<li><a href="/facts/Krait_(CPU)/fIjdkwQx">Qualcomm Krait</a> (2012)</li>
<li><a href="/facts/Apple_A6/vbIa3hBp">Apple A6</a> (2012)</li>
<li>All <a href="/facts/ARM_architecture/SALsYA79">ARMv8</a> processors
<ul><li><a href="/facts/Fujitsu_A64FX/k5K0NoAt">Fujitsu A64FX</a> has "Four-operand FMA with Prefix Instruction".</li></ul></li></ul></li>
<li>x86 processors with <a href="/facts/FMA_instruction_set/mqglSHqk">FMA3 and/or FMA4 instruction set</a>
<ul><li>AMD <a href="/facts/Bulldozer_(processor)/fUPzoGrd">Bulldozer</a> (2011, FMA4 only)</li>
<li>AMD <a href="/facts/Piledriver_(microarchitecture)/Fhxly06H">Piledriver</a> (2012, FMA3 and FMA4)<a class="footnote-ref" id="fnref:16" href="#fn:16">16</a></li>
<li><a href="/facts/Intel_Haswell/C19Wlzhm">Intel Haswell</a> (2013, FMA3 only)<a class="footnote-ref" id="fnref:17" href="#fn:17">17</a></li>
<li>AMD <a href="/facts/Steamroller_(microarchitecture)/SxtLaauz">Steamroller</a> (2014, FMA3 and FMA4)</li>
<li>AMD <a href="/facts/Excavator_(microarchitecture)/Bzj9WKGV">Excavator</a> (2015, FMA3 and FMA4)</li>
<li>Intel <a href="/facts/Skylake_(microarchitecture)/qpcDcebM">Skylake</a> (2015, FMA3 only)</li>
<li>AMD <a href="/facts/Zen_(microarchitecture)/AjBvqDT7">Zen</a> (2017, FMA3 only)</li></ul></li>
<li><a href="/facts/Elbrus-8S/1XmvIkL2">Elbrus-8SV</a> (2018)</li>
<li>GPUs and GPGPU boards:
<ul><li><a href="/facts/List_of_AMD_graphics_processing_units/jaVxfufu">AMD GPUs</a> (2009) and newer
<ul><li><a href="/facts/TeraScale_(microarchitecture)/Yt6jwq9b">TeraScale 2 "Evergreen"</a>-series based</li>
<li><a href="/facts/Graphics_Core_Next/WYNvhDUP">Graphics Core Next</a>-based</li></ul></li>
<li><a href="/facts/List_of_Nvidia_graphics_processing_units/AXdfrvKo">Nvidia GPUs</a> (2010) and newer
<ul><li><a href="/facts/Fermi_(microarchitecture)/6YGKyDNQ">Fermi</a>-based (2010)</li>
<li><a href="/facts/Kepler_(microarchitecture)/NU6RAHIQ">Kepler</a>-based (2012)</li>
<li><a href="/facts/Maxwell_(microarchitecture)/wjnyCCAN">Maxwell</a>-based (2014)</li>
<li><a href="/facts/Pascal_(microarchitecture)/GnmTVMlb">Pascal</a>-based (2016)</li>
<li><a href="/facts/Volta_(microarchitecture)/bxD4hpFL">Volta</a>-based (2017)</li></ul></li>
<li>Intel GPUs since <a href="/facts/Intel_HD_and_Iris_Graphics/p87h2lSE">Sandy Bridge</a></li>
<li><a href="/facts/Intel_MIC/Fv5dhyKc">Intel MIC</a> (2012)</li>
<li><a href="/facts/Mali_(GPU)/hHeuRwXb">ARM Mali T600 Series</a> (2012) and above</li></ul></li>
<li>Vector Processors:
<ul><li><a href="/facts/NEC_SX-Aurora_TSUBASA/DRBhNFDU">NEC SX-Aurora TSUBASA</a></li></ul></li></ul>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Compound_operator_(computing)/7TnU5LKu">Compound operator</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">"The Feasibility of Ludgate's Analytical Machine". Archived from the original on 2019-08-07. Retrieved 2020-08-30. <a href="http://www.fano.co.uk/ludgate/" target="_blank">http://www.fano.co.uk/ludgate/</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">Lyakhov, Pavel; Valueva, Maria; Valuev, Georgii; Nagornov, Nikolai (January 2020). "A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units". Applied Sciences. 10 (24): 9052. doi:10.3390/app10249052. <a href="https://doi.org/10.3390%2Fapp10249052" target="_blank">https://doi.org/10.3390%2Fapp10249052</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">Tung Thanh Hoang; Sjalander, M.; Larsson-Edefors, P. (May 2009). "Double Throughput Multiply-Accumulate unit for FlexCore processor enhancements". 2009 IEEE International Symposium on Parallel & Distributed Processing. pp. 1–7. doi:10.1109/IPDPS.2009.5161212. ISBN 978-1-4244-3751-1. S2CID 14535090. <a href="978-1-4244-3751-1" target="_blank">978-1-4244-3751-1</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
<li id="fn:4">Kang, Jongsung; Kim, Taewhan (2020-03-01). "PV-MAC: Multiply-and-accumulate unit structure exploiting precision variability in on-device convolutional neural networks". Integration. 71: 76–85. doi:10.1016/j.vlsi.2019.11.003. ISSN 0167-9260. S2CID 211264132. <a href="https://www.sciencedirect.com/science/article/abs/pii/S0167926019302809" target="_blank">https://www.sciencedirect.com/science/article/abs/pii/S0167926019302809</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></li>
<li id="fn:5">"mad - ps". 20 November 2019. Retrieved 2021-08-14. <a href="https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/mad---ps" target="_blank">https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/mad---ps</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></li>
<li id="fn:6">Whitehead, Nathan; Fit-Florea, Alex (2011). "Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs" (PDF). nvidia. Retrieved 2013-08-31. <a href="https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf" target="_blank">https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></li>
<li id="fn:7">"fmadd instrs". IBM. <a href="https://www.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.alangref/idalangref_fmadd_instrs.htm" target="_blank">https://www.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.alangref/idalangref_fmadd_instrs.htm</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></li>
<li id="fn:8">Kahan, William (1996-05-31). "IEEE Standard 754 for Binary Floating-Point Arithmetic". <a href="/wiki/William_Morton_Kahan" target="_blank">/wiki/William_Morton_Kahan</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></li>
<li id="fn:9">Quinnell, Eric (May 2007). Floating-Point Fused Multiply–Add Architectures (PDF) (PhD thesis). Retrieved 2011-03-28. <a href="http://repositories.lib.utexas.edu/bitstream/handle/2152/3082/quinnelle60861.pdf" target="_blank">http://repositories.lib.utexas.edu/bitstream/handle/2152/3082/quinnelle60861.pdf</a> <a href="#fnref:9" class="footnote-back-ref">↩</a></li>
<li id="fn:10">Markstein, Peter (November 2004). Software Division and Square Root Using Goldschmidt's Algorithms (PDF). 6th Conference on Real Numbers and Computers. CiteSeerX 10.1.1.85.9648. <a href="http://www.informatik.uni-trier.de/Reports/TR-08-2004/rnc6_12_markstein.pdf" target="_blank">http://www.informatik.uni-trier.de/Reports/TR-08-2004/rnc6_12_markstein.pdf</a> <a href="#fnref:10" class="footnote-back-ref">↩</a></li>
<li id="fn:11">"Bug 20785 - Pragma STDC * (C99 FP) unimplemented". gcc.gnu.org. Retrieved 2022-02-02. <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=20785" target="_blank">https://gcc.gnu.org/bugzilla/show_bug.cgi?id=20785</a> <a href="#fnref:11" class="footnote-back-ref">↩</a></li>
<li id="fn:12">"Optimize Options (Using the GNU Compiler Collection (GCC))". gcc.gnu.org. Retrieved 2022-02-02. <a href="https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html" target="_blank">https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html</a> <a href="#fnref:12" class="footnote-back-ref">↩</a></li>
<li id="fn:13">Montoye, R. K.; Hokenek, E.; Runyon, S. L. (January 1990). "Design of the IBM RISC System/6000 floating-point execution unit". IBM Journal of Research and Development. 34 (1): 59–70. doi:10.1147/rd.341.0059. <a href="/wiki/Doi_(identifier)" target="_blank">/wiki/Doi_(identifier)</a> <a href="#fnref:13" class="footnote-back-ref">↩</a></li>
<li id="fn:14">"Godson-3 Emulates x86: New MIPS-Compatible Chinese Processor Has Extensions for x86 Translation". <a href="http://www.mdronline.com/mpr/h/2008/1103/224401.html" target="_blank">http://www.mdronline.com/mpr/h/2008/1103/224401.html</a> <a href="#fnref:14" class="footnote-back-ref">↩</a></li>
<li id="fn:15">"STM32 Cortex-M33 MCUs programming manual" (PDF). ST. Retrieved 2024-05-06. <a href="https://www.st.com/resource/en/programming_manual/pm0264-stm32-cortexm33-mcus-programming-manual-stmicroelectronics.pdf" target="_blank">https://www.st.com/resource/en/programming_manual/pm0264-stm32-cortexm33-mcus-programming-manual-stmicroelectronics.pdf</a> <a href="#fnref:15" class="footnote-back-ref">↩</a></li>
<li id="fn:16">Hollingsworth, Brent (October 2012). "New "Bulldozer" and "Piledriver" Instructions". AMD Developer Central. <a href="https://developer.amd.com/resources/developer-guides-manuals/new-bulldozer-and-piledriver-instructions/" target="_blank">https://developer.amd.com/resources/developer-guides-manuals/new-bulldozer-and-piledriver-instructions/</a> <a href="#fnref:16" class="footnote-back-ref">↩</a></li>
<li id="fn:17">"Intel adds 22nm octo-core 'Haswell' to CPU design roadmap". The Register. Archived from the original on 2012-02-17. Retrieved 2008-08-19. <a href="https://web.archive.org/web/20120217051330/http://www.reghardware.com/2008/08/19/idf_intel_architecture_roadmap/" target="_blank">https://web.archive.org/web/20120217051330/http://www.reghardware.com/2008/08/19/idf_intel_architecture_roadmap/</a> <a href="#fnref:17" class="footnote-back-ref">↩</a></li>
</ol>

Multiply–accumulate operation open-in-new

Multiply–accumulate operation