Smooth maximum

<h2 id="examples">Examples</h2>
<h3>Boltzmann operator</h3>

<p>For large positive values of the parameter 
  
    
      
        α
        >
        0
      
    
    {\displaystyle \alpha >0}
  
, the following formulation is a smooth, <a href="/facts/Differential_(calculus)/YmSDWCol">differentiable</a> approximation of the maximum function.  For negative values of the parameter that are large in absolute value, it approximates the minimum.
</p>

S
            
          
          
            α
          
        
        (
        
          x
          
            1
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
        =
        
          
            
              
                ∑
                
                  i
                  =
                  1
                
                
                  n
                
              
              
                x
                
                  i
                
              
              
                e
                
                  α
                  
                    x
                    
                      i
                    
                  
                
              
            
            
              
                ∑
                
                  i
                  =
                  1
                
                
                  n
                
              
              
                e
                
                  α
                  
                    x
                    
                      i
                    
                  
                
              
            
          
        
      
    
    {\displaystyle {\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {\sum _{i=1}^{n}x_{i}e^{\alpha x_{i}}}{\sum _{i=1}^{n}e^{\alpha x_{i}}}}}

<p>
  
    
      
        
          
            
              S
            
          
          
            α
          
        
      
    
    {\displaystyle {\mathcal {S}}_{\alpha }}
  
 has the following properties:
</p>
<ol><li>
  
    
      
        
          
            
              S
            
          
          
            α
          
        
        →
        max
      
    
    {\displaystyle {\mathcal {S}}_{\alpha }\to \max }
  
 as 
  
    
      
        α
        →
        ∞
      
    
    {\displaystyle \alpha \to \infty }
  
</li>
<li>
  
    
      
        
          
            
              S
            
          
          
            0
          
        
      
    
    {\displaystyle {\mathcal {S}}_{0}}
  
 is the <a href="/facts/Arithmetic_mean/L3Pv333o">arithmetic mean</a> of its inputs</li>
<li>
  
    
      
        
          
            
              S
            
          
          
            α
          
        
        →
        min
      
    
    {\displaystyle {\mathcal {S}}_{\alpha }\to \min }
  
 as 
  
    
      
        α
        →
        −
        ∞
      
    
    {\displaystyle \alpha \to -\infty }
  
</li></ol>
<p>The gradient of 
  
    
      
        
          
            
              S
            
          
          
            α
          
        
      
    
    {\displaystyle {\mathcal {S}}_{\alpha }}
  
 is closely related to <a href="/facts/Softmax_function/pvxeWV6L">softmax</a> and is given by
</p>

∇
          
            
              x
              
                i
              
            
          
        
        
          
            
              S
            
          
          
            α
          
        
        (
        
          x
          
            1
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
        =
        
          
            
              e
              
                α
                
                  x
                  
                    i
                  
                
              
            
            
              
                ∑
                
                  j
                  =
                  1
                
                
                  n
                
              
              
                e
                
                  α
                  
                    x
                    
                      j
                    
                  
                
              
            
          
        
        [
        1
        +
        α
        (
        
          x
          
            i
          
        
        −
        
          
            
              S
            
          
          
            α
          
        
        (
        
          x
          
            1
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
        )
        ]
        .
      
    
    {\displaystyle \nabla _{x_{i}}{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {e^{\alpha x_{i}}}{\sum _{j=1}^{n}e^{\alpha x_{j}}}}[1+\alpha (x_{i}-{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n}))].}

<p>This makes the softmax function useful for optimization techniques that use <a href="/facts/Gradient_descent/pFFrek0F">gradient descent</a>.
</p><p>This operator is sometimes called the Boltzmann operator,<a class="footnote-ref" id="fnref:1" href="#fn:1"><sup>1</sup></a> after the <a href="/facts/Boltzmann_distribution/XBU530uu">Boltzmann distribution</a>.
</p>
<h3>LogSumExp</h3>
<p class="note">Main article: <a href="/facts/LogSumExp/m6jQDk1a">LogSumExp</a></p>
<p>Another smooth maximum is <a href="/facts/LogSumExp/m6jQDk1a">LogSumExp</a>:
</p>

L
            S
            E
          
          
            α
          
        
        (
        
          x
          
            1
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
        =
        
          
            1
            α
          
        
        log
        ⁡
        
          ∑
          
            i
            =
            1
          
          
            n
          
        
        exp
        ⁡
        α
        
          x
          
            i
          
        
      
    
    {\displaystyle \mathrm {LSE} _{\alpha }(x_{1},\ldots ,x_{n})={\frac {1}{\alpha }}\log \sum _{i=1}^{n}\exp \alpha x_{i}}

<p>This can also be normalized if the 
  
    
      
        
          x
          
            i
          
        
      
    
    {\displaystyle x_{i}}
  
 are all non-negative, yielding a function with domain 
  
    
      
        [
        0
        ,
        ∞
        
          )
          
            n
          
        
      
    
    {\displaystyle [0,\infty )^{n}}
  
 and range 
  
    
      
        [
        0
        ,
        ∞
        )
      
    
    {\displaystyle [0,\infty )}
  
:
</p>

g
        (
        
          x
          
            1
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
        =
        log
        ⁡
        
          (
          
            
              ∑
              
                i
                =
                1
              
              
                n
              
            
            exp
            ⁡
            
              x
              
                i
              
            
            −
            (
            n
            −
            1
            )
          
          )
        
      
    
    {\displaystyle g(x_{1},\ldots ,x_{n})=\log \left(\sum _{i=1}^{n}\exp x_{i}-(n-1)\right)}

<p>The 
  
    
      
        (
        n
        −
        1
        )
      
    
    {\displaystyle (n-1)}
  
 term corrects for the fact that 
  
    
      
        exp
        ⁡
        (
        0
        )
        =
        1
      
    
    {\displaystyle \exp(0)=1}
  
 by canceling out all but one zero exponential, and 
  
    
      
        log
        ⁡
        1
        =
        0
      
    
    {\displaystyle \log 1=0}
  
 if all 
  
    
      
        
          x
          
            i
          
        
      
    
    {\displaystyle x_{i}}
  
 are zero.
</p>
<h3>Mellowmax</h3>
<p>The mellowmax operator<a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a> is defined as follows:
</p>

m
            m
          
          
            α
          
        
        (
        x
        )
        =
        
          
            1
            α
          
        
        log
        ⁡
        
          
            1
            n
          
        
        
          ∑
          
            i
            =
            1
          
          
            n
          
        
        exp
        ⁡
        α
        
          x
          
            i
          
        
      
    
    {\displaystyle \mathrm {mm} _{\alpha }(x)={\frac {1}{\alpha }}\log {\frac {1}{n}}\sum _{i=1}^{n}\exp \alpha x_{i}}

<p>It is a <a href="/facts/Non-expansive_function/RU0TpseI">non-expansive</a> operator. As 
  
    
      
        α
        →
        ∞
      
    
    {\displaystyle \alpha \to \infty }
  
, it acts like a maximum. As 
  
    
      
        α
        →
        0
      
    
    {\displaystyle \alpha \to 0}
  
, it acts like an arithmetic mean. As 
  
    
      
        α
        →
        −
        ∞
      
    
    {\displaystyle \alpha \to -\infty }
  
, it acts like a minimum. This operator can be viewed as a particular instantiation of the <a href="/facts/Quasi-arithmetic_mean/LeU10jRp">quasi-arithmetic mean</a>. It can also be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence. The operator has previously been utilized in other areas, such as power engineering.<a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a>
</p>
<h3>p-Norm</h3>
<p class="note">Main article: <a href="/facts/P-norm/gbad0Rv1">P-norm</a></p>
<p>Another smooth maximum is the <a href="/facts/P-norm/gbad0Rv1">p-norm</a>:
</p>

‖
        (
        
          x
          
            1
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
        
          ‖
          
            p
          
        
        =
        
          
            (
            
              
                ∑
                
                  i
                  =
                  1
                
                
                  n
                
              
              
                |
              
              
                x
                
                  i
                
              
              
                
                  |
                
                
                  p
                
              
            
            )
          
          
            
              1
              p
            
          
        
      
    
    {\displaystyle \|(x_{1},\ldots ,x_{n})\|_{p}=\left(\sum _{i=1}^{n}|x_{i}|^{p}\right)^{\frac {1}{p}}}

<p>which converges to 
  
    
      
        ‖
        (
        
          x
          
            1
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
        
          ‖
          
            ∞
          
        
        =
        
          max
          
            1
            ≤
            i
            ≤
            n
          
        
        
          |
        
        
          x
          
            i
          
        
        
          |
        
      
    
    {\displaystyle \|(x_{1},\ldots ,x_{n})\|_{\infty }=\max _{1\leq i\leq n}|x_{i}|}
  
 as 
  
    
      
        p
        →
        ∞
      
    
    {\displaystyle p\to \infty }
  
.
</p><p>An advantage of the p-norm is that it is a <a href="/facts/Norm_(mathematics)/xIbR4uE1">norm</a>.  As such it is <a href="/facts/Scale_invariant/S96sxg2R">scale invariant</a> (<a href="/facts/Homogeneous_function/wF9XX7s5">homogeneous</a>): 
  
    
      
        ‖
        (
        λ
        
          x
          
            1
          
        
        ,
        …
        ,
        λ
        
          x
          
            n
          
        
        )
        
          ‖
          
            p
          
        
        =
        
          |
        
        λ
        
          |
        
        ⋅
        ‖
        (
        
          x
          
            1
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
        
          ‖
          
            p
          
        
      
    
    {\displaystyle \|(\lambda x_{1},\ldots ,\lambda x_{n})\|_{p}=|\lambda |\cdot \|(x_{1},\ldots ,x_{n})\|_{p}}
  
, and it satisfies the <a href="/facts/Triangle_inequality/KubLjkwr">triangle inequality</a>.
</p>
<h3>Smooth maximum unit</h3>
<p>The following binary operator is called the Smooth Maximum Unit (SMU):<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a>
</p>

max
                    
                      ε
                    
                  
                  (
                  a
                  ,
                  b
                  )
                
              
              
                
                =
                
                  
                    
                      a
                      +
                      b
                      +
                      
                        |
                      
                      a
                      −
                      b
                      
                        
                          |
                        
                        
                          ε
                        
                      
                    
                    2
                  
                
              
            
            
              
              
                
                =
                
                  
                    
                      a
                      +
                      b
                      +
                      
                        
                          (
                          a
                          −
                          b
                          
                            )
                            
                              2
                            
                          
                          +
                          ε
                        
                      
                    
                    2
                  
                
              
            
          
        
      
    
    {\displaystyle {\begin{aligned}\textstyle \max _{\varepsilon }(a,b)&={\frac {a+b+|a-b|_{\varepsilon }}{2}}\\&={\frac {a+b+{\sqrt {(a-b)^{2}+\varepsilon }}}{2}}\end{aligned}}}

<p>where 
  
    
      
        ε
        ≥
        0
      
    
    {\displaystyle \varepsilon \geq 0}
  
 is a parameter. As 
  
    
      
        ε
        →
        0
      
    
    {\displaystyle \varepsilon \to 0}
  
, 
  
    
      
        
          |
        
        ⋅
        
          
            |
          
          
            ε
          
        
        →
        
          |
        
        ⋅
        
          |
        
      
    
    {\displaystyle |\cdot |_{\varepsilon }\to |\cdot |}
  
 and thus 
  
    
      
        
          
            max
            
              ε
            
          
          →
          max
        
      
    
    {\displaystyle \textstyle \max _{\varepsilon }\to \max }
  
.
</p>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/LogSumExp/m6jQDk1a">LogSumExp</a></li>
<li><a href="/facts/Softmax_function/pvxeWV6L">Softmax function</a></li>
<li><a href="/facts/Generalized_mean/Xhqjzhnc">Generalized mean</a></li></ul>

<p><a href="https://www.johndcook.com/soft_maximum.pdf">https://www.johndcook.com/soft_maximum.pdf</a>
</p><p>M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," <i>in Proc. ESANN</i>, Apr. 2014, pp. 271-276.
(<a href="https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf">https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf</a>)
</p>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Asadi, Kavosh; Littman, Michael L. (2017). "An Alternative Softmax Operator for Reinforcement Learning". PMLR. 70: 243–252. arXiv:1612.05628. Retrieved January 6, 2023. <a href="/wiki/Michael_L._Littman" target="_blank">/wiki/Michael_L._Littman</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Asadi, Kavosh; Littman, Michael L. (2017). "An Alternative Softmax Operator for Reinforcement Learning". PMLR. 70: 243–252. arXiv:1612.05628. Retrieved January 6, 2023. <a href="/wiki/Michael_L._Littman" target="_blank">/wiki/Michael_L._Littman</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Safak, Aysel (February 1993). "Statistical analysis of the power sum of multiple correlated log-normal components". IEEE Transactions on Vehicular Technology. 42 (1): {58–61. doi:10.1109/25.192387. Retrieved January 6, 2023. <a href="https://ieeexplore.ieee.org/document/192387" target="_blank">https://ieeexplore.ieee.org/document/192387</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Biswas, Koushik; Kumar, Sandeep; Banerjee, Shilpak; Ashish Kumar Pandey (2021). "SMU: Smooth activation function for deep networks using smoothing maximum technique". arXiv:2111.04682 [cs.LG]. <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
</ol>

Smooth maximum open-in-new

Smooth maximum