Subgradient method

<h2 id="classical-subgradient-rules">Classical subgradient rules</h2>
<p>Let 
  
    
      
        f
        :
        
          
            R
          
          
            n
          
        
        →
        
          R
        
      
    
    {\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} }
  
 be a <a href="/facts/Convex_function/IbzG5SLF">convex function</a> with domain 
  
    
      
        
          
            R
          
          
            n
          
        
        .
      
    
    {\displaystyle \mathbb {R} ^{n}.}
  
  
A classical subgradient method iterates

x
          
            (
            k
            +
            1
            )
          
        
        =
        
          x
          
            (
            k
            )
          
        
        −
        
          α
          
            k
          
        
        
          g
          
            (
            k
            )
          
        
         
      
    
    {\displaystyle x^{(k+1)}=x^{(k)}-\alpha _{k}g^{(k)}\ }

where 
  
    
      
        
          g
          
            (
            k
            )
          
        
      
    
    {\displaystyle g^{(k)}}
  
 denotes <i>any</i> <a href="/facts/Subgradient/IK3s3Cqw">subgradient</a> of 
  
    
      
        f
         
      
    
    {\displaystyle f\ }
  
 at 
  
    
      
        
          x
          
            (
            k
            )
          
        
        ,
         
      
    
    {\displaystyle x^{(k)},\ }
  
 and 
  
    
      
        
          x
          
            (
            k
            )
          
        
      
    
    {\displaystyle x^{(k)}}
  
 is the 
  
    
      
        
          k
          
            t
            h
          
        
      
    
    {\displaystyle k^{th}}
  
 iterate of 
  
    
      
        x
        .
      
    
    {\displaystyle x.}
  
 
If 
  
    
      
        f
         
      
    
    {\displaystyle f\ }
  
 is differentiable, then its only subgradient is the gradient vector 
  
    
      
        ∇
        f
      
    
    {\displaystyle \nabla f}
  
 itself.
It may happen that 
  
    
      
        −
        
          g
          
            (
            k
            )
          
        
      
    
    {\displaystyle -g^{(k)}}
  
 is not a descent direction for 
  
    
      
        f
         
      
    
    {\displaystyle f\ }
  
  at 
  
    
      
        
          x
          
            (
            k
            )
          
        
        .
      
    
    {\displaystyle x^{(k)}.}
  
  We therefore maintain a list 
  
    
      
        
          f
          
            
              b
              e
              s
              t
            
          
        
         
      
    
    {\displaystyle f_{\rm {best}}\ }
  
 that keeps track of the lowest objective function value found so far, i.e.

f
          
            
              b
              e
              s
              t
            
          
          
            (
            k
            )
          
        
        =
        min
        {
        
          f
          
            
              b
              e
              s
              t
            
          
          
            (
            k
            −
            1
            )
          
        
        ,
        f
        (
        
          x
          
            (
            k
            )
          
        
        )
        }
        .
      
    
    {\displaystyle f_{\rm {best}}^{(k)}=\min\{f_{\rm {best}}^{(k-1)},f(x^{(k)})\}.}

</p>
<h3>Step size rules</h3>
<p>Many different types of step-size rules are used by subgradient methods.  This article notes five classical step-size rules for which convergence <a href="/facts/Mathematical_proof/7ECDrU80">proofs</a> are known:
</p>
<ul><li>Constant step size, 
  
    
      
        
          α
          
            k
          
        
        =
        α
        .
      
    
    {\displaystyle \alpha _{k}=\alpha .}
  
</li>
<li>Constant step length, 
  
    
      
        
          α
          
            k
          
        
        =
        γ
        
          /
        
        ‖
        
          g
          
            (
            k
            )
          
        
        
          ‖
          
            2
          
        
        ,
      
    
    {\displaystyle \alpha _{k}=\gamma /\lVert g^{(k)}\rVert _{2},}
  
 which gives 
  
    
      
        ‖
        
          x
          
            (
            k
            +
            1
            )
          
        
        −
        
          x
          
            (
            k
            )
          
        
        
          ‖
          
            2
          
        
        =
        γ
        .
      
    
    {\displaystyle \lVert x^{(k+1)}-x^{(k)}\rVert _{2}=\gamma .}
  
</li>
<li>Square summable but not summable step size, i.e. any step sizes satisfying 
  
    
      
        
          α
          
            k
          
        
        ≥
        0
        ,
        
        
          ∑
          
            k
            =
            1
          
          
            ∞
          
        
        
          α
          
            k
          
          
            2
          
        
        <
        ∞
        ,
        
        
          ∑
          
            k
            =
            1
          
          
            ∞
          
        
        
          α
          
            k
          
        
        =
        ∞
        .
      
    
    {\displaystyle \alpha _{k}\geq 0,\qquad \sum _{k=1}^{\infty }\alpha _{k}^{2}<\infty ,\qquad \sum _{k=1}^{\infty }\alpha _{k}=\infty .}
  
</li>
<li>Nonsummable diminishing, i.e. any step sizes satisfying 
  
    
      
        
          α
          
            k
          
        
        ≥
        0
        ,
        
        
          lim
          
            k
            →
            ∞
          
        
        
          α
          
            k
          
        
        =
        0
        ,
        
        
          ∑
          
            k
            =
            1
          
          
            ∞
          
        
        
          α
          
            k
          
        
        =
        ∞
        .
      
    
    {\displaystyle \alpha _{k}\geq 0,\qquad \lim _{k\to \infty }\alpha _{k}=0,\qquad \sum _{k=1}^{\infty }\alpha _{k}=\infty .}
  
</li>
<li>Nonsummable diminishing step lengths, i.e. 
  
    
      
        
          α
          
            k
          
        
        =
        
          γ
          
            k
          
        
        
          /
        
        ‖
        
          g
          
            (
            k
            )
          
        
        
          ‖
          
            2
          
        
        ,
      
    
    {\displaystyle \alpha _{k}=\gamma _{k}/\lVert g^{(k)}\rVert _{2},}
  
 where 
  
    
      
        
          γ
          
            k
          
        
        ≥
        0
        ,
        
        
          lim
          
            k
            →
            ∞
          
        
        
          γ
          
            k
          
        
        =
        0
        ,
        
        
          ∑
          
            k
            =
            1
          
          
            ∞
          
        
        
          γ
          
            k
          
        
        =
        ∞
        .
      
    
    {\displaystyle \gamma _{k}\geq 0,\qquad \lim _{k\to \infty }\gamma _{k}=0,\qquad \sum _{k=1}^{\infty }\gamma _{k}=\infty .}
  
</li></ul>
<p>For all five rules, the step-sizes are determined "off-line", before the method is iterated; the step-sizes do not depend on preceding iterations.  This "off-line" property of subgradient methods differs from the "on-line" step-size rules used for descent methods for differentiable functions: Many methods for minimizing differentiable functions satisfy Wolfe's sufficient conditions for convergence, where step-sizes typically depend on the current point and the current search-direction. An extensive discussion of stepsize rules for subgradient methods, including incremental versions, is given in the books by Bertsekas<a class="footnote-ref" id="fnref:1" href="#fn:1"><sup>1</sup></a> and by Bertsekas, Nedic, and Ozdaglar.<a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a>
</p>
<h3>Convergence results</h3>
<p>For constant step-length and scaled subgradients having <a href="/facts/Euclidean_norm/R2UbzmzM">Euclidean norm</a> equal to one, the subgradient method converges to an arbitrarily close approximation to the minimum value, that is
</p>

lim
          
            k
            →
            ∞
          
        
        
          f
          
            
              b
              e
              s
              t
            
          
          
            (
            k
            )
          
        
        −
        
          f
          
            ∗
          
        
        <
        ϵ
      
    
    {\displaystyle \lim _{k\to \infty }f_{\rm {best}}^{(k)}-f^{*}<\epsilon }
  
 by a result of <a href="/facts/Naum_Z._Shor/ksPSYFCH">Shor</a>.<a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a>
<p>These classical subgradient methods have poor performance and are no longer recommended for general use.<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a><a class="footnote-ref" id="fnref:5" href="#fn:5"><sup>5</sup></a> However, they are still used widely in specialized applications because they are simple and they can be easily adapted to take advantage of the special structure of the problem at hand.
</p>
<h2 id="subgradient-projection-and-bundle-methods">Subgradient-projection and bundle methods</h2>
<p>During the 1970s, <a href="/facts/Claude_Lemar%25C3%25A9chal/aMcoIXVJ">Claude Lemaréchal</a> and Phil Wolfe proposed "bundle methods" of descent for problems of convex minimization.<a class="footnote-ref" id="fnref:6" href="#fn:6"><sup>6</sup></a> The meaning of the term "bundle methods" has changed significantly since that time. Modern versions and full convergence analysis were provided by Kiwiel.
<a class="footnote-ref" id="fnref:7" href="#fn:7"><sup>7</sup></a> Contemporary bundle-methods often use "<a href="/facts/Level_set/JUmNbdNh">level</a> control" rules for choosing step-sizes, developing techniques from the "subgradient-projection" method of Boris T. Polyak (1969). However, there are problems on which bundle methods offer little advantage over subgradient-projection methods.<a class="footnote-ref" id="fnref:8" href="#fn:8"><sup>8</sup></a><a class="footnote-ref" id="fnref:9" href="#fn:9"><sup>9</sup></a>
</p>
<h2 id="constrained-optimization">Constrained optimization</h2>
<h3>Projected subgradient</h3>
<p>One extension of the subgradient method is the projected subgradient method, which solves the constrained <a href="/facts/Mathematical_optimization/oRn8Iv5I">optimization</a> problem
</p>
minimize 
  
    
      
        f
        (
        x
        )
         
      
    
    {\displaystyle f(x)\ }
  
 subject to 
  
    
      
        x
        ∈
        
          
            C
          
        
      
    
    {\displaystyle x\in {\mathcal {C}}}

<p>where 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
 is a <a href="/facts/Convex_set/vdAuJRJl">convex set</a>. 
The projected subgradient method uses the iteration

x
          
            (
            k
            +
            1
            )
          
        
        =
        P
        
          (
          
            
              x
              
                (
                k
                )
              
            
            −
            
              α
              
                k
              
            
            
              g
              
                (
                k
                )
              
            
          
          )
        
      
    
    {\displaystyle x^{(k+1)}=P\left(x^{(k)}-\alpha _{k}g^{(k)}\right)}

where 
  
    
      
        P
      
    
    {\displaystyle P}
  
 is projection on 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
 and 
  
    
      
        
          g
          
            (
            k
            )
          
        
      
    
    {\displaystyle g^{(k)}}
  
 is any subgradient of 
  
    
      
        f
         
      
    
    {\displaystyle f\ }
  
 at 
  
    
      
        
          x
          
            (
            k
            )
          
        
        .
      
    
    {\displaystyle x^{(k)}.}

</p>
<h3>General constraints</h3>
<p>The subgradient method can be extended to solve the inequality constrained problem
</p>
minimize 
  
    
      
        
          f
          
            0
          
        
        (
        x
        )
         
      
    
    {\displaystyle f_{0}(x)\ }
  
 subject to 
  
    
      
        
          f
          
            i
          
        
        (
        x
        )
        ≤
        0
        ,
        
        i
        =
        1
        ,
        …
        ,
        m
      
    
    {\displaystyle f_{i}(x)\leq 0,\quad i=1,\ldots ,m}

<p>where 
  
    
      
        
          f
          
            i
          
        
      
    
    {\displaystyle f_{i}}
  
 are convex.  The algorithm takes the same form as the unconstrained case

where 
  
    
      
        
          α
          
            k
          
        
        >
        0
      
    
    {\displaystyle \alpha _{k}>0}
  
 is a step size, and 
  
    
      
        
          g
          
            (
            k
            )
          
        
      
    
    {\displaystyle g^{(k)}}
  
 is a subgradient of the objective or one of the constraint functions at 
  
    
      
        x
        .
         
      
    
    {\displaystyle x.\ }
  
  Take

g
          
            (
            k
            )
          
        
        =
        
          
            {
            
              
                
                  ∂
                  
                    f
                    
                      0
                    
                  
                  (
                  x
                  )
                
                
                  
                     if 
                  
                  
                    f
                    
                      i
                    
                  
                  (
                  x
                  )
                  ≤
                  0
                  
                  ∀
                  i
                  =
                  1
                  …
                  m
                
              
              
                
                  ∂
                  
                    f
                    
                      j
                    
                  
                  (
                  x
                  )
                
                
                  
                     for some 
                  
                  j
                  
                     such that 
                  
                  
                    f
                    
                      j
                    
                  
                  (
                  x
                  )
                  >
                  0
                
              
            
            
          
        
      
    
    {\displaystyle g^{(k)}={\begin{cases}\partial f_{0}(x)&{\text{ if }}f_{i}(x)\leq 0\;\forall i=1\dots m\\\partial f_{j}(x)&{\text{ for some }}j{\text{ such that }}f_{j}(x)>0\end{cases}}}

where 
  
    
      
        ∂
        f
      
    
    {\displaystyle \partial f}
  
 denotes the <a href="/facts/Subdifferential/IK3s3Cqw">subdifferential</a> of 
  
    
      
        f
        .
         
      
    
    {\displaystyle f.\ }
  
  If the current point is feasible, the algorithm uses an objective subgradient; if the current point is infeasible, the algorithm chooses a subgradient of any violated constraint.
</p>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Stochastic_gradient_descent/HbcaYqQP">Stochastic gradient descent</a> – Optimization algorithm</li></ul>

<h2 id="further-reading">Further reading</h2>
<ul><li>Bertsekas, Dimitri P. (1999). <i>Nonlinear Programming</i>. Belmont, MA.: Athena Scientific. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 1-886529-00-0.</li>
<li>Bertsekas, Dimitri P.; Nedic, Angelia; Ozdaglar, Asuman (2003). <i>Convex Analysis and Optimization</i> (Second ed.). Belmont, MA.: Athena Scientific. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 1-886529-45-0.</li>
<li>Bertsekas, Dimitri P. (2015). <i>Convex Optimization Algorithms</i>. Belmont, MA.: Athena Scientific. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-1-886529-28-1.</li>
<li>Shor, Naum Z. (1985). <i>Minimization Methods for Non-differentiable Functions</i>. <a href="/facts/Springer-Verlag/nAesf6nT">Springer-Verlag</a>. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 0-387-12763-1.</li>
<li><a href="/facts/Andrzej_Piotr_Ruszczy%25C5%2584ski/doHj9lgC">Ruszczyński, Andrzej</a> (2006). <i>Nonlinear Optimization</i>. Princeton, NJ: <a href="/facts/Princeton_University_Press/kwvFlKEj">Princeton University Press</a>. pp. xii+454. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0691119151. <a href="/facts/MR_(identifier)/uP137L11">MR</a> <a href="https://mathscinet.ams.org/mathscinet-getitem?mr=2199043">2199043</a>.</li></ul>
<h2 id="external-links">External links</h2>
<ul><li><a href="http://www.stanford.edu/class/ee364a/">EE364A</a> and <a href="http://www.stanford.edu/class/ee364b/">EE364B</a>, Stanford's convex optimization course sequence.</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Bertsekas, Dimitri P. (2015). Convex Optimization Algorithms (Second ed.). Belmont, MA.: Athena Scientific. ISBN 978-1-886529-28-1. <a href="978-1-886529-28-1" target="_blank">978-1-886529-28-1</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Bertsekas, Dimitri P.; Nedic, Angelia; Ozdaglar, Asuman (2003). Convex Analysis and Optimization (Second ed.). Belmont, MA.: Athena Scientific. ISBN 1-886529-45-0. <a href="1-886529-45-0" target="_blank">1-886529-45-0</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>The approximate convergence of the constant step-size (scaled) subgradient method is stated as Exercise 6.3.14(a) in Bertsekas (page 636): Bertsekas, Dimitri P. (1999). Nonlinear Programming (Second ed.). Cambridge, MA.: Athena Scientific. ISBN 1-886529-00-0. On page 636, Bertsekas attributes this result to Shor: Shor, Naum Z. (1985). Minimization Methods for Non-differentiable Functions. Springer-Verlag. ISBN 0-387-12763-1. <a href="1-886529-00-00-387-12763-1" target="_blank">1-886529-00-00-387-12763-1</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Lemaréchal, Claude (2001). "Lagrangian relaxation". In Michael Jünger and Denis Naddef (ed.). Computational combinatorial optimization: Papers from the Spring School held in Schloß Dagstuhl, May 15–19, 2000. Lecture Notes in Computer Science. Vol. 2241. Berlin: Springer-Verlag. pp. 112–156. doi:10.1007/3-540-45586-8_4. ISBN 3-540-42877-1. MR 1900016. S2CID 9048698. <a href="3-540-42877-1" target="_blank">3-540-42877-1</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>Kiwiel, Krzysztof C.; Larsson, Torbjörn; Lindberg, P. O. (August 2007). "Lagrangian relaxation via ballstep subgradient methods" (PDF). Mathematics of Operations Research. 32 (3): 669–686. doi:10.1287/moor.1070.0261. MR 2348241. <a href="http://rcin.org.pl/Content/139438/PDF/RB-2002-76.pdf" target="_blank">http://rcin.org.pl/Content/139438/PDF/RB-2002-76.pdf</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
<li id="fn:6"><p>Bertsekas, Dimitri P. (1999). Nonlinear Programming (Second ed.). Cambridge, MA.: Athena Scientific. ISBN 1-886529-00-0.
 <a href="1-886529-00-0" target="_blank">1-886529-00-0</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></p></li>
<li id="fn:7"><p>Kiwiel, Krzysztof (1985). Methods of Descent for Nondifferentiable Optimization. Berlin: Springer Verlag. p. 362. ISBN 978-3540156420. MR 0797754. <a href="978-3540156420" target="_blank">978-3540156420</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></p></li>
<li id="fn:8"><p>Lemaréchal, Claude (2001). "Lagrangian relaxation". In Michael Jünger and Denis Naddef (ed.). Computational combinatorial optimization: Papers from the Spring School held in Schloß Dagstuhl, May 15–19, 2000. Lecture Notes in Computer Science. Vol. 2241. Berlin: Springer-Verlag. pp. 112–156. doi:10.1007/3-540-45586-8_4. ISBN 3-540-42877-1. MR 1900016. S2CID 9048698. <a href="3-540-42877-1" target="_blank">3-540-42877-1</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></p></li>
<li id="fn:9"><p>Kiwiel, Krzysztof C.; Larsson, Torbjörn; Lindberg, P. O. (August 2007). "Lagrangian relaxation via ballstep subgradient methods" (PDF). Mathematics of Operations Research. 32 (3): 669–686. doi:10.1287/moor.1070.0261. MR 2348241. <a href="http://rcin.org.pl/Content/139438/PDF/RB-2002-76.pdf" target="_blank">http://rcin.org.pl/Content/139438/PDF/RB-2002-76.pdf</a> <a href="#fnref:9" class="footnote-back-ref">↩</a></p></li>
</ol>

Subgradient method open-in-new

Subgradient method