Denote by
the bias in the estimator g ^ n {\displaystyle {\hat {g}}_{n}} . Assume that { ( Δ n ) i } {\displaystyle \{(\Delta _{n})_{i}\}} are all mutually independent with zero-mean, bounded second moments, and E ( | ( Δ n ) i | − 1 ) {\displaystyle E(|(\Delta _{n})_{i}|^{-1})} uniformly bounded. Then b n {\displaystyle b_{n}} →0 w.p. 1.
The main idea is to use conditioning on Δ n {\displaystyle \Delta _{n}} to express E [ ( g ^ n ) i ] {\displaystyle E[({\hat {g}}_{n})_{i}]} and then to use a second order Taylor expansion of J ( u n + c n Δ n ) i {\displaystyle J(u_{n}+c_{n}\Delta _{n})_{i}} and J ( u n − c n Δ n ) i {\displaystyle J(u_{n}-c_{n}\Delta _{n})_{i}} . After algebraic manipulations using the zero mean and the independence of { ( Δ n ) i } {\displaystyle \{(\Delta _{n})_{i}\}} , we get
The result follows from the hypothesis that c n {\displaystyle c_{n}} →0.
Next we resume some of the hypotheses under which u t {\displaystyle u_{t}} converges in probability to the set of global minima of J ( u ) {\displaystyle J(u)} . The efficiency of the method depends on the shape of J ( u ) {\displaystyle J(u)} , the values of the parameters a n {\displaystyle a_{n}} and c n {\displaystyle c_{n}} and the distribution of the perturbation terms Δ n i {\displaystyle \Delta _{ni}} . First, the algorithm parameters must satisfy the following conditions:
A good choice for Δ n i {\displaystyle \Delta _{ni}} is the Rademacher distribution, i.e. Bernoulli +-1 with probability 0.5. Other choices are possible too, but note that the uniform and normal distributions cannot be used because they do not satisfy the finite inverse moment conditions.
The loss function J(u) must be thrice continuously differentiable and the individual elements of the third derivative must be bounded: | J ( 3 ) ( u ) | < a 3 < ∞ {\displaystyle |J^{(3)}(u)|<a_{3}<\infty } . Also, | J ( u ) | → ∞ {\displaystyle |J(u)|\rightarrow \infty } as u → ∞ {\displaystyle u\rightarrow \infty } .
In addition, ∇ J {\displaystyle \nabla J} must be Lipschitz continuous, bounded and the ODE u ˙ = g ( u ) {\displaystyle {\dot {u}}=g(u)} must have a unique solution for each initial condition. Under these conditions and a few others, u k {\displaystyle u_{k}} converges in probability to the set of global minima of J(u) (see Maryak and Chin, 2008).
It has been shown that differentiability is not required: continuity and convexity are sufficient for convergence.1
It is known that a stochastic version of the standard (deterministic) Newton-Raphson algorithm (a “second-order” method) provides an asymptotically optimal or near-optimal form of stochastic approximation. SPSA can also be used to efficiently estimate the Hessian matrix of the loss function based on either noisy loss measurements or noisy gradient measurements (stochastic gradients). As with the basic SPSA method, only a small fixed number of loss measurements or gradient measurements are needed at each iteration, regardless of the problem dimension p. See the brief discussion in Stochastic gradient descent.
He, Ying; Fu, Michael C.; Steven I., Marcus (August 2003). "Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization". IEEE Transactions on Automatic Control. 48 (8): 1459–1463. doi:10.1109/TAC.2003.815008. Retrieved March 6, 2022. https://ieeexplore.ieee.org/document/1220767 ↩