Simultaneous perturbation stochastic approximation

<h2 id="convergence-lemma">Convergence lemma</h2>
Denote by

b
          
            n
          
        
        =
        E
        [
        
          
            
              
                g
                ^
              
            
          
          
            n
          
        
        
          |
        
        
          u
          
            n
          
        
        ]
        −
        ∇
        J
        (
        
          u
          
            n
          
        
        )
      
    
    {\displaystyle b_{n}=E[{\hat {g}}_{n}|u_{n}]-\nabla J(u_{n})}

the bias in the estimator 
 
 
 
 
 
 
 
 g
 ^
 
 
 
 
 n
 
 
 
 
 {\displaystyle {\hat {g}}_{n}}
 
. Assume that 
 
 
 
 {
 (
 
 Δ
 
 n
 
 
 
 )
 
 i
 
 
 }
 
 
 {\displaystyle \{(\Delta _{n})_{i}\}}
 
 are all mutually independent with zero-mean, bounded second
moments, and 
 
 
 
 E
 (
 
 |
 
 (
 
 Δ
 
 n
 
 
 
 )
 
 i
 
 
 
 
 |
 
 
 −
 1
 
 
 )
 
 
 {\displaystyle E(|(\Delta _{n})_{i}|^{-1})}
 
 uniformly bounded. Then 
 
 
 
 
 b
 
 n
 
 
 
 
 {\displaystyle b_{n}}
 
→0 w.p. 1.

<h2 id="sketch-of-the-proof">Sketch of the proof</h2>
The main <a href="/facts/Idea/zK4ar1yv">idea</a> is to use conditioning on 
 
 
 
 
 Δ
 
 n
 
 
 
 
 {\displaystyle \Delta _{n}}
 
 to express 
 
 
 
 E
 [
 (
 
 
 
 
 g
 ^
 
 
 
 
 n
 
 
 
 )
 
 i
 
 
 ]
 
 
 {\displaystyle E[({\hat {g}}_{n})_{i}]}
 
 and then to use a second order Taylor expansion of 
 
 
 
 J
 (
 
 u
 
 n
 
 
 +
 
 c
 
 n
 
 
 
 Δ
 
 n
 
 
 
 )
 
 i
 
 
 
 
 {\displaystyle J(u_{n}+c_{n}\Delta _{n})_{i}}
 
 and 
 
 
 
 J
 (
 
 u
 
 n
 
 
 −
 
 c
 
 n
 
 
 
 Δ
 
 n
 
 
 
 )
 
 i
 
 
 
 
 {\displaystyle J(u_{n}-c_{n}\Delta _{n})_{i}}
 
. After algebraic manipulations using the zero mean and the independence of 
 
 
 
 {
 (
 
 Δ
 
 n
 
 
 
 )
 
 i
 
 
 }
 
 
 {\displaystyle \{(\Delta _{n})_{i}\}}
 
, we get

E
        [
        (
        
          
            
              
                g
                ^
              
            
          
          
            n
          
        
        
          )
          
            i
          
        
        ]
        =
        (
        
          g
          
            n
          
        
        
          )
          
            i
          
        
        +
        O
        (
        
          c
          
            n
          
          
            2
          
        
        )
      
    
    {\displaystyle E[({\hat {g}}_{n})_{i}]=(g_{n})_{i}+O(c_{n}^{2})}

The result follows from the <a href="/facts/Hypothesis/gMABEJul">hypothesis</a> that 
 
 
 
 
 c
 
 n
 
 
 
 
 {\displaystyle c_{n}}
 
→0.
Next we resume some of the hypotheses under which 
 
 
 
 
 u
 
 t
 
 
 
 
 {\displaystyle u_{t}}
 
 converges in <a href="/facts/Probability/fgYvMhwb">probability</a> to the set of global minima of 
 
 
 
 J
 (
 u
 )
 
 
 {\displaystyle J(u)}
 
. The efficiency of
the method depends on the shape of 
 
 
 
 J
 (
 u
 )
 
 
 {\displaystyle J(u)}
 
, the values of the parameters 
 
 
 
 
 a
 
 n
 
 
 
 
 {\displaystyle a_{n}}
 
 and 
 
 
 
 
 c
 
 n
 
 
 
 
 {\displaystyle c_{n}}
 
 and the distribution of the perturbation terms 
 
 
 
 
 Δ
 
 n
 i
 
 
 
 
 {\displaystyle \Delta _{ni}}
 
. First, the algorithm parameters must satisfy the
following conditions:

<ul><li>
 
 
 
 
 a
 
 n
 
 
 
 
 {\displaystyle a_{n}}
 
 >0, 
 
 
 
 
 a
 
 n
 
 
 
 
 {\displaystyle a_{n}}
 
→0 when n→∝ and 
 
 
 
 
 ∑
 
 n
 =
 1
 
 
 ∞
 
 
 
 a
 
 n
 
 
 =
 ∞
 
 
 {\displaystyle \sum _{n=1}^{\infty }a_{n}=\infty }
 
. A good choice would be 
 
 
 
 
 a
 
 n
 
 
 =
 
 
 a
 n
 
 
 ;
 
 
 {\displaystyle a_{n}={\frac {a}{n}};}
 
 a>0;</li>
<li>
 
 
 
 
 c
 
 n
 
 
 =
 
 
 c
 
 n
 
 γ
 
 
 
 
 
 
 {\displaystyle c_{n}={\frac {c}{n^{\gamma }}}}
 
, where c>0, 
 
 
 
 γ
 ∈
 
 [
 
 
 
 1
 6
 
 
 ,
 
 
 1
 2
 
 
 
 ]
 
 
 
 {\displaystyle \gamma \in \left[{\frac {1}{6}},{\frac {1}{2}}\right]}
 
;</li>
<li>
 
 
 
 
 ∑
 
 n
 =
 1
 
 
 ∞
 
 
 (
 
 
 
 a
 
 n
 
 
 
 c
 
 n
 
 
 
 
 
 )
 
 2
 
 
 <
 ∞
 
 
 {\displaystyle \sum _{n=1}^{\infty }({\frac {a_{n}}{c_{n}}})^{2}<\infty }
 
</li>
<li>
 
 
 
 
 Δ
 
 n
 i
 
 
 
 
 {\displaystyle \Delta _{ni}}
 
 must be mutually independent zero-mean random variables, symmetrically distributed about zero, with 
 
 
 
 
 Δ
 
 n
 i
 
 
 <
 
 a
 
 1
 
 
 <
 ∞
 
 
 {\displaystyle \Delta _{ni}<a_{1}<\infty }
 
. The inverse first and second moments of the 
 
 
 
 
 Δ
 
 n
 i
 
 
 
 
 {\displaystyle \Delta _{ni}}
 
 must be finite.</li></ul>
A good choice for 
 
 
 
 
 Δ
 
 n
 i
 
 
 
 
 {\displaystyle \Delta _{ni}}
 
 is the <a href="/facts/Rademacher_distribution/MlPxl57G">Rademacher distribution</a>, i.e. Bernoulli +-1 with probability 0.5. Other choices are possible too, but note that the uniform and normal distributions cannot be used because they do not satisfy the finite inverse moment conditions.
The loss function J(u) must be thrice continuously <a href="/facts/Differentiable_function/jQqTgLk1">differentiable</a> and the individual elements of the third derivative must be bounded: 
 
 
 
 
 |
 
 
 J
 
 (
 3
 )
 
 
 (
 u
 )
 
 |
 
 <
 
 a
 
 3
 
 
 <
 ∞
 
 
 {\displaystyle |J^{(3)}(u)|<a_{3}<\infty }
 
. Also, 
 
 
 
 
 |
 
 J
 (
 u
 )
 
 |
 
 →
 ∞
 
 
 {\displaystyle |J(u)|\rightarrow \infty }
 
 as 
 
 
 
 u
 →
 ∞
 
 
 {\displaystyle u\rightarrow \infty }
 
.
In addition, 
 
 
 
 ∇
 J
 
 
 {\displaystyle \nabla J}
 
 must be Lipschitz continuous, bounded and the ODE 
 
 
 
 
 
 
 u
 ˙
 
 
 
 =
 g
 (
 u
 )
 
 
 {\displaystyle {\dot {u}}=g(u)}
 
 must have a unique solution for each initial condition.
Under these conditions and a few others, 
 
 
 
 
 u
 
 k
 
 
 
 
 {\displaystyle u_{k}}
 
 <a href="/facts/Convergence_(mathematics)/DTok170z">converges</a> in probability to the set of global minima of J(u) (see Maryak and Chin, 2008).
It has been shown that differentiability is not required: continuity and convexity are sufficient for convergence.<a class="footnote-ref" id="fnref:1" href="#fn:1">1</a>

<h2 id="extension-to-second-order-newton-methods">Extension to second-order (Newton) methods</h2>
It is known that a stochastic version of the standard (deterministic) Newton-Raphson algorithm (a “second-order” method) provides an asymptotically optimal or near-optimal form of stochastic approximation. SPSA can also be used to efficiently estimate the Hessian matrix of the loss function based on either noisy loss measurements or noisy gradient measurements (stochastic gradients). As with the basic SPSA method, only a small fixed number of loss measurements or gradient measurements are needed at each iteration, regardless of the problem dimension p. See the brief discussion in <a href="/facts/Stochastic_gradient_descent/HbcaYqQP">Stochastic gradient descent</a>.

<ul><li>Bhatnagar, S., Prasad, H. L., and Prashanth, L. A. (2013), Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods, Springer <a href="https://link.springer.com/book/10.1007/978-1-4471-4285-0">[1]</a>.</li>
<li>Hirokami, T., Maeda, Y., Tsukada, H. (2006) "Parameter estimation using simultaneous perturbation stochastic approximation", Electrical Engineering in Japan, 154 (2), 30–3 <a href="https://dx.doi.org/10.1002/eej.20239">[2]</a></li>
<li>Maryak, J.L., and Chin, D.C. (2008), "Global Random Optimization by Simultaneous Perturbation Stochastic Approximation," IEEE Transactions on Automatic Control, vol. 53, pp. 780-783.</li>
<li>Spall, J. C. (1987), “A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates,” Proceedings of the American Control Conference, Minneapolis, MN, June 1987, pp. 1161–1167.</li>
<li>Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation,” IEEE Transactions on Automatic Control, vol. 37(3), pp. 332–341.</li>
<li>Spall, J.C. (1998). "Overview of the Simultaneous Perturbation Method for Efficient Optimization" <a href="http://www.jhuapl.edu/SPSA/PDF-SPSA/Spall_An_Overview.PDF">2</a>. Johns Hopkins APL Technical Digest, 19(4), 482–492.</li>
<li>Spall, J.C. (2003) Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 0-471-33052-3 (Chapter 7)</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">He, Ying; Fu, Michael C.; Steven I., Marcus (August 2003). "Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization". IEEE Transactions on Automatic Control. 48 (8): 1459–1463. doi:10.1109/TAC.2003.815008. Retrieved March 6, 2022. <a href="https://ieeexplore.ieee.org/document/1220767" target="_blank">https://ieeexplore.ieee.org/document/1220767</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
</ol>

Simultaneous perturbation stochastic approximation open-in-new

Simultaneous perturbation stochastic approximation