Random coordinate descent

<h2 id="algorithm">Algorithm</h2>
<p>Consider the optimization problem
</p>

min
          
            x
            ∈
            
              R
              
                n
              
            
          
        
        f
        (
        x
        )
        ,
      
    
    {\displaystyle \min _{x\in R^{n}}f(x),}

<p>where 
  
    
      
        f
      
    
    {\displaystyle f}
  
 is a convex and smooth function.
</p><p>Smoothness: By smoothness we mean the following: we assume the gradient of 
  
    
      
        f
      
    
    {\displaystyle f}
  
 is coordinate-wise <a href="/facts/Lipschitz_continuous/Hw20EPEU">Lipschitz continuous</a> with constants 
  
    
      
        
          L
          
            1
          
        
        ,
        
          L
          
            2
          
        
        ,
        …
        ,
        
          L
          
            n
          
        
      
    
    {\displaystyle L_{1},L_{2},\dots ,L_{n}}
  
. That is, we assume that
</p>

|
        
        
          ∇
          
            i
          
        
        f
        (
        x
        +
        h
        
          e
          
            i
          
        
        )
        −
        
          ∇
          
            i
          
        
        f
        (
        x
        )
        
          |
        
        ≤
        
          L
          
            i
          
        
        
          |
        
        h
        
          |
        
        ,
      
    
    {\displaystyle |\nabla _{i}f(x+he_{i})-\nabla _{i}f(x)|\leq L_{i}|h|,}

<p>for all 
  
    
      
        x
        ∈
        
          R
          
            n
          
        
      
    
    {\displaystyle x\in R^{n}}
  
 and 
  
    
      
        h
        ∈
        R
      
    
    {\displaystyle h\in R}
  
, where 
  
    
      
        
          ∇
          
            i
          
        
      
    
    {\displaystyle \nabla _{i}}
  
 denotes the <a href="/facts/Partial_derivative/JWHsBkAu">partial derivative</a> with respect to variable 
  
    
      
        
          x
          
            (
            i
            )
          
        
      
    
    {\displaystyle x^{(i)}}
  
.
</p><p>Nesterov, and Richtarik and Takac showed that the following algorithm converges to the optimal point:
</p>

Algorithm Random Coordinate Descent Method
    Input: 
  
    
      
        
          x
          
            0
          
        
        ∈
        
          R
          
            n
          
        
      
    
    {\displaystyle x_{0}\in R^{n}}
  
 //starting point
    Output: 
  
    
      
        x
      
    
    {\displaystyle x}

set <i>x</i> := x_0

for <i>k</i> := 1, ... do
        choose coordinate 
  
    
      
        i
        ∈
        {
        1
        ,
        2
        ,
        …
        ,
        n
        }
      
    
    {\displaystyle i\in \{1,2,\dots ,n\}}
  
, uniformly at random
        update 
  
    
      
        
          x
          
            (
            i
            )
          
        
        =
        
          x
          
            (
            i
            )
          
        
        −
        
          
            1
            
              L
              
                i
              
            
          
        
        
          ∇
          
            i
          
        
        f
        (
        x
        )
      
    
    {\displaystyle x^{(i)}=x^{(i)}-{\frac {1}{L_{i}}}\nabla _{i}f(x)}
  
 
    end for

<ul><li>"←" denotes <a href="/facts/Assignment_(computer_science)/kLevAoyJ">assignment</a>.  For instance, "<i>largest</i> ← <i>item</i>" means that the value of <i>largest</i> changes to the value of <i>item</i>.</li>
<li>"return" terminates the algorithm and outputs the following value.</li></ul>

<h2 id="convergence-rate">Convergence rate</h2>
<p>Since the iterates of this algorithm are random vectors, a complexity result would give a bound on the number of iterations needed for the method to output an approximate solution with high <a href="/facts/Probability/fgYvMhwb">probability</a>. It was shown in <a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a> that if

k
        ≥
        
          
            
              2
              n
              
                R
                
                  L
                
              
              (
              
                x
                
                  0
                
              
              )
            
            ϵ
          
        
        log
        ⁡
        
          (
          
            
              
                f
                (
                
                  x
                  
                    0
                  
                
                )
                −
                
                  f
                  
                    ∗
                  
                
              
              
                ϵ
                ρ
              
            
          
          )
        
      
    
    {\displaystyle k\geq {\frac {2nR_{L}(x_{0})}{\epsilon }}\log \left({\frac {f(x_{0})-f^{*}}{\epsilon \rho }}\right)}
  
, 
where 
  
    
      
        
          R
          
            L
          
        
        (
        x
        )
        =
        
          max
          
            y
          
        
        
          max
          
            
              x
              
                ∗
              
            
            ∈
            
              X
              
                ∗
              
            
          
        
        {
        ‖
        y
        −
        
          x
          
            ∗
          
        
        
          ‖
          
            L
          
        
        :
        f
        (
        y
        )
        ≤
        f
        (
        x
        )
        }
      
    
    {\displaystyle R_{L}(x)=\max _{y}\max _{x^{*}\in X^{*}}\{\|y-x^{*}\|_{L}:f(y)\leq f(x)\}}
  
,

f
          
            ∗
          
        
      
    
    {\displaystyle f^{*}}
  
 is an optimal solution (
  
    
      
        
          f
          
            ∗
          
        
        =
        
          min
          
            x
            ∈
            
              R
              
                n
              
            
          
        
        {
        f
        (
        x
        )
        }
      
    
    {\displaystyle f^{*}=\min _{x\in R^{n}}\{f(x)\}}
  
),

ρ
        ∈
        (
        0
        ,
        1
        )
      
    
    {\displaystyle \rho \in (0,1)}
  
 is a confidence level and 
  
    
      
        ϵ
        >
        0
      
    
    {\displaystyle \epsilon >0}
  
 is  target accuracy,
then 
  
    
      
        
          Prob
        
        (
        f
        (
        
          x
          
            k
          
        
        )
        −
        
          f
          
            ∗
          
        
        >
        ϵ
        )
        ≤
        ρ
      
    
    {\displaystyle {\text{Prob}}(f(x_{k})-f^{*}>\epsilon )\leq \rho }
  
.
</p>
<h2 id="example-on-particular-function">Example on particular function</h2>
<p>The following Figure shows how 
  
    
      
        
          x
          
            k
          
        
      
    
    {\displaystyle x_{k}}
  
 develops during iterations, in principle.
The problem is
</p>

f
        (
        x
        )
        =
        
          
            
              1
              2
            
          
        
        
          x
          
            T
          
        
        
          (
          
            
              
                
                  1
                
                
                  0.5
                
              
              
                
                  0.5
                
                
                  1
                
              
            
          
          )
        
        x
        −
        
          (
          
            
              
                
                  1.5
                
                
                  1.5
                
              
            
          
          )
        
        x
        ,
        
        
          x
          
            0
          
        
        =
        
          
            (
            
              
                
                  
                    0
                  
                  
                    0
                  
                
              
            
            )
          
          
            T
          
        
      
    
    {\displaystyle f(x)={\tfrac {1}{2}}x^{T}\left({\begin{array}{cc}1&0.5\\0.5&1\end{array}}\right)x-\left({\begin{array}{cc}1.5&1.5\end{array}}\right)x,\quad x_{0}=\left({\begin{array}{cc}0&0\end{array}}\right)^{T}}

<h2 id="extension-to-block-coordinate-setting">Extension to block coordinate setting</h2>

<p>One can naturally extend this algorithm not only just to coordinates, but to blocks of coordinates. Assume that we have space 
  
    
      
        
          R
          
            5
          
        
      
    
    {\displaystyle R^{5}}
  
. This space has 5 coordinate directions, concretely

e
          
            1
          
        
        =
        (
        1
        ,
        0
        ,
        0
        ,
        0
        ,
        0
        
          )
          
            T
          
        
        ,
        
          e
          
            2
          
        
        =
        (
        0
        ,
        1
        ,
        0
        ,
        0
        ,
        0
        
          )
          
            T
          
        
        ,
        
          e
          
            3
          
        
        =
        (
        0
        ,
        0
        ,
        1
        ,
        0
        ,
        0
        
          )
          
            T
          
        
        ,
        
          e
          
            4
          
        
        =
        (
        0
        ,
        0
        ,
        0
        ,
        1
        ,
        0
        
          )
          
            T
          
        
        ,
        
          e
          
            5
          
        
        =
        (
        0
        ,
        0
        ,
        0
        ,
        0
        ,
        1
        
          )
          
            T
          
        
      
    
    {\displaystyle e_{1}=(1,0,0,0,0)^{T},e_{2}=(0,1,0,0,0)^{T},e_{3}=(0,0,1,0,0)^{T},e_{4}=(0,0,0,1,0)^{T},e_{5}=(0,0,0,0,1)^{T}}

in which Random Coordinate Descent Method can move. However, one can group some coordinate directions into blocks and we can have instead of those 5 coordinate directions 3 block coordinate directions (see image).
</p>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Coordinate_descent/wnU1rHLp">Coordinate descent</a></li>
<li><a href="/facts/Gradient_descent/pFFrek0F">Gradient descent</a></li>
<li><a href="/facts/Mathematical_optimization/oRn8Iv5I">Mathematical optimization</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Nesterov, Yurii (2010), "Efficiency of coordinate descent methods on huge-scale optimization problems", SIAM Journal on Optimization, 22 (2): 341–362, CiteSeerX 10.1.1.332.3336, doi:10.1137/100802001 <a href="/wiki/CiteSeerX_(identifier)" target="_blank">/wiki/CiteSeerX_(identifier)</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Richtárik, Peter; Takáč, Martin (2011), "Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function", Mathematical Programming, Series A, 144 (1–2): 1–38, arXiv:1107.2848, doi:10.1007/s10107-012-0614-z <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
</ol>

Random coordinate descent open-in-new

Random coordinate descent