The Robbins–Monro algorithm, introduced in 1951 by Herbert Robbins and Sutton Monro,3 presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function M ( θ ) {\textstyle M(\theta )} , and a constant α {\textstyle \alpha } , such that the equation M ( θ ) = α {\textstyle M(\theta )=\alpha } has a unique root at θ ∗ . {\textstyle \theta ^{*}.} It is assumed that while we cannot directly observe the function M ( θ ) , {\textstyle M(\theta ),} we can instead obtain measurements of the random variable N ( θ ) {\textstyle N(\theta )} where E [ N ( θ ) ] = M ( θ ) {\textstyle \operatorname {E} [N(\theta )]=M(\theta )} . The structure of the algorithm is to then generate iterates of the form:
θ n + 1 = θ n − a n ( N ( θ n ) − α ) {\displaystyle \theta _{n+1}=\theta _{n}-a_{n}(N(\theta _{n})-\alpha )}
Here, a 1 , a 2 , … {\displaystyle a_{1},a_{2},\dots } is a sequence of positive step sizes. Robbins and Monro proved4, Theorem 2 that θ n {\displaystyle \theta _{n}} converges in L 2 {\displaystyle L^{2}} (and hence also in probability) to θ ∗ {\displaystyle \theta ^{*}} , and Blum5 later proved the convergence is actually with probability one, provided that:
∑ n = 0 ∞ a n = ∞ and ∑ n = 0 ∞ a n 2 < ∞ {\displaystyle \qquad \sum _{n=0}^{\infty }a_{n}=\infty \quad {\mbox{ and }}\quad \sum _{n=0}^{\infty }a_{n}^{2}<\infty \quad } A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: a n = a / n {\textstyle a_{n}=a/n} , for a > 0 {\textstyle a>0} . Other series, such as a n = 1 n ln n , 1 n ln n ln ln n , … {\displaystyle a_{n}={\frac {1}{n\ln n}},{\frac {1}{n\ln n\ln \ln n}},\dots } are possible but in order to average out the noise in N ( θ ) {\textstyle N(\theta )} , the above condition must be met.
Consider the problem of estimating the mean θ ∗ {\displaystyle \theta ^{*}} of a probability distribution from a stream of independent samples X 1 , X 2 , … {\displaystyle X_{1},X_{2},\dots } .
Let N ( θ ) := θ − X {\displaystyle N(\theta ):=\theta -X} , then the unique solution to E [ N ( θ ) ] = 0 {\textstyle \operatorname {E} [N(\theta )]=0} is the desired mean θ ∗ {\displaystyle \theta ^{*}} . The RM algorithm gives us θ n + 1 = θ n − a n ( θ n − X n ) {\displaystyle \theta _{n+1}=\theta _{n}-a_{n}(\theta _{n}-X_{n})} This is equivalent to stochastic gradient descent with loss function L ( θ ) = 1 2 ‖ X − θ ‖ 2 {\displaystyle L(\theta )={\frac {1}{2}}\|X-\theta \|^{2}} . It is also equivalent to a weighted average: θ n + 1 = ( 1 − a n ) θ n + a n X n {\displaystyle \theta _{n+1}=(1-a_{n})\theta _{n}+a_{n}X_{n}} In general, if there exists some function L {\displaystyle L} such that ∇ L ( θ ) = N ( θ ) − α {\displaystyle \nabla L(\theta )=N(\theta )-\alpha } , then the Robbins–Monro algorithm is equivalent to stochastic gradient descent with loss function L ( θ ) {\displaystyle L(\theta )} . However, the RM algorithm does not require L {\displaystyle L} to exist in order to converge.
While the Robbins–Monro algorithm is theoretically able to achieve O ( 1 / n ) {\textstyle O(1/n)} under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.910
Chung (1954)11 and Fabian (1968)12 showed that we would achieve optimal convergence rate O ( 1 / n ) {\textstyle O(1/{\sqrt {n}})} with a n = ▽ 2 f ( θ ∗ ) − 1 / n {\textstyle a_{n}=\bigtriangledown ^{2}f(\theta ^{*})^{-1}/n} (or a n = 1 ( n M ′ ( θ ∗ ) ) {\textstyle a_{n}={\frac {1}{(nM'(\theta ^{*}))}}} ). Lai and Robbins1314 designed adaptive procedures to estimate M ′ ( θ ∗ ) {\textstyle M'(\theta ^{*})} such that θ n {\textstyle \theta _{n}} has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak (1991)15 and Ruppert (1988)16 independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky17 also presented a method of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure: θ n + 1 − θ n = a n ( α − N ( θ n ) ) , θ ¯ n = 1 n ∑ i = 0 n − 1 θ i {\displaystyle \theta _{n+1}-\theta _{n}=a_{n}(\alpha -N(\theta _{n})),\qquad {\bar {\theta }}_{n}={\frac {1}{n}}\sum _{i=0}^{n-1}\theta _{i}} The convergence of θ ¯ n {\displaystyle {\bar {\theta }}_{n}} to the unique root θ ∗ {\displaystyle \theta ^{*}} relies on the condition that the step sequence { a n } {\displaystyle \{a_{n}\}} decreases sufficiently slowly. That is
A1) a n → 0 , a n − a n + 1 a n = o ( a n ) {\displaystyle a_{n}\rightarrow 0,\qquad {\frac {a_{n}-a_{n+1}}{a_{n}}}=o(a_{n})}
Therefore, the sequence a n = n − α {\textstyle a_{n}=n^{-\alpha }} with 0 < α < 1 {\textstyle 0<\alpha <1} satisfies this restriction, but α = 1 {\textstyle \alpha =1} does not, hence the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate O ( 1 / n ) {\textstyle O(1/{\sqrt {n}})} yet with a more robust step size policy.18 Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin19 for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems. These algorithms were observed to attain the nonasymptotic rate O ( 1 / n ) {\textstyle O(1/{\sqrt {n}})} .
A more general result is given in Chapter 11 of Kushner and Yin20 by defining interpolated time t n = ∑ i = 0 n − 1 a i {\textstyle t_{n}=\sum _{i=0}^{n-1}a_{i}} , interpolated process θ n ( ⋅ ) {\textstyle \theta ^{n}(\cdot )} and interpolated normalized process U n ( ⋅ ) {\textstyle U^{n}(\cdot )} as
θ n ( t ) = θ n + i , U n ( t ) = ( θ n + i − θ ∗ ) / a n + i for t ∈ [ t n + i − t n , t n + i + 1 − t n ) , i ≥ 0 {\displaystyle \theta ^{n}(t)=\theta _{n+i},\quad U^{n}(t)=(\theta _{n+i}-\theta ^{*})/{\sqrt {a_{n+i}}}\quad {\mbox{for}}\quad t\in [t_{n+i}-t_{n},t_{n+i+1}-t_{n}),i\geq 0} Let the iterate average be Θ n = a n t ∑ i = n n + t / a n − 1 θ i {\displaystyle \Theta _{n}={\frac {a_{n}}{t}}\sum _{i=n}^{n+t/a_{n}-1}\theta _{i}} and the associate normalized error to be U ^ n ( t ) = a n t ∑ i = n n + t / a n − 1 ( θ i − θ ∗ ) {\displaystyle {\hat {U}}^{n}(t)={\frac {\sqrt {a_{n}}}{t}}\sum _{i=n}^{n+t/a_{n}-1}(\theta _{i}-\theta ^{*})} .
With assumption A1) and the following A2)
A2) There is a Hurwitz matrix A {\textstyle A} and a symmetric and positive-definite matrix Σ {\textstyle \Sigma } such that { U n ( ⋅ ) } {\textstyle \{U^{n}(\cdot )\}} converges weakly to U ( ⋅ ) {\textstyle U(\cdot )} , where U ( ⋅ ) {\textstyle U(\cdot )} is the statisolution to d U = A U d t + Σ 1 / 2 d w {\displaystyle dU=AU\,dt+\Sigma ^{1/2}\,dw} where w ( ⋅ ) {\textstyle w(\cdot )} is a standard Wiener process.
satisfied, and define V ¯ = ( A − 1 ) ′ Σ ( A ′ ) − 1 {\textstyle {\bar {V}}=(A^{-1})'\Sigma (A')^{-1}} . Then for each t {\textstyle t} ,
U ^ n ( t ) ⟶ D N ( 0 , V t ) , where V t = V ¯ / t + O ( 1 / t 2 ) . {\displaystyle {\hat {U}}^{n}(t){\stackrel {\mathcal {D}}{\longrightarrow }}{\mathcal {N}}(0,V_{t}),\quad {\text{where}}\quad V_{t}={\bar {V}}/t+O(1/t^{2}).}
The success of the averaging idea is because of the time scale separation of the original sequence { θ n } {\textstyle \{\theta _{n}\}} and the averaged sequence { Θ n } {\textstyle \{\Theta _{n}\}} , with the time scale of the former one being faster.
Suppose we want to solve the following stochastic optimization problem g ( θ ∗ ) = min θ ∈ Θ E [ Q ( θ , X ) ] , {\displaystyle g(\theta ^{*})=\min _{\theta \in \Theta }\operatorname {E} [Q(\theta ,X)],} where g ( θ ) = E [ Q ( θ , X ) ] {\textstyle g(\theta )=\operatorname {E} [Q(\theta ,X)]} is differentiable and convex, then this problem is equivalent to find the root θ ∗ {\displaystyle \theta ^{*}} of ∇ g ( θ ) = 0 {\displaystyle \nabla g(\theta )=0} . Here Q ( θ , X ) {\displaystyle Q(\theta ,X)} can be interpreted as some "observed" cost as a function of the chosen θ {\displaystyle \theta } and random effects X {\displaystyle X} . In practice, it might be hard to get an analytical form of ∇ g ( θ ) {\displaystyle \nabla g(\theta )} , Robbins–Monro method manages to generate a sequence ( θ n ) n ≥ 0 {\displaystyle (\theta _{n})_{n\geq 0}} to approximate θ ∗ {\displaystyle \theta ^{*}} if one can generate ( X n ) n ≥ 0 {\displaystyle (X_{n})_{n\geq 0}} , in which the conditional expectation of X n {\displaystyle X_{n}} given θ n {\displaystyle \theta _{n}} is exactly ∇ g ( θ n ) {\displaystyle \nabla g(\theta _{n})} , i.e. X n {\displaystyle X_{n}} is simulated from a conditional distribution defined by
E [ H ( θ , X ) | θ = θ n ] = ∇ g ( θ n ) . {\displaystyle \operatorname {E} [H(\theta ,X)|\theta =\theta _{n}]=\nabla g(\theta _{n}).}
Here H ( θ , X ) {\displaystyle H(\theta ,X)} is an unbiased estimator of ∇ g ( θ ) {\displaystyle \nabla g(\theta )} . If X {\displaystyle X} depends on θ {\displaystyle \theta } , there is in general no natural way of generating a random outcome H ( θ , X ) {\displaystyle H(\theta ,X)} that is an unbiased estimator of the gradient. In some special cases when either IPA or likelihood ratio methods are applicable, then one is able to obtain an unbiased gradient estimator H ( θ , X ) {\displaystyle H(\theta ,X)} . If X {\displaystyle X} is viewed as some "fundamental" underlying random process that is generated independently of θ {\displaystyle \theta } , and under some regularization conditions for derivative-integral interchange operations so that E [ ∂ ∂ θ Q ( θ , X ) ] = ∇ g ( θ ) {\displaystyle \operatorname {E} {\Big [}{\frac {\partial }{\partial \theta }}Q(\theta ,X){\Big ]}=\nabla g(\theta )} , then H ( θ , X ) = ∂ ∂ θ Q ( θ , X ) {\displaystyle H(\theta ,X)={\frac {\partial }{\partial \theta }}Q(\theta ,X)} gives the fundamental gradient unbiased estimate. However, for some applications we have to use finite-difference methods in which H ( θ , X ) {\displaystyle H(\theta ,X)} has a conditional expectation close to ∇ g ( θ ) {\displaystyle \nabla g(\theta )} but not exactly equal to it.
We then define a recursion analogously to Newton's Method in the deterministic algorithm:
The following result gives sufficient conditions on θ n {\displaystyle \theta _{n}} for the algorithm to converge:21
C1) ε n ≥ 0 , ∀ n ≥ 0. {\displaystyle \varepsilon _{n}\geq 0,\forall \;n\geq 0.}
C2) ∑ n = 0 ∞ ε n = ∞ {\displaystyle \sum _{n=0}^{\infty }\varepsilon _{n}=\infty }
C3) ∑ n = 0 ∞ ε n 2 < ∞ {\displaystyle \sum _{n=0}^{\infty }\varepsilon _{n}^{2}<\infty }
C4) | X n | ≤ B , for a fixed bound B . {\displaystyle |X_{n}|\leq B,{\text{ for a fixed bound }}B.}
C5) g ( θ ) is strictly convex, i.e. {\displaystyle g(\theta ){\text{ is strictly convex, i.e.}}}
Then θ n {\displaystyle \theta _{n}} converges to θ ∗ {\displaystyle \theta ^{*}} almost surely.
Here are some intuitive explanations about these conditions. Suppose H ( θ n , X n + 1 ) {\displaystyle H(\theta _{n},X_{n+1})} is a uniformly bounded random variables. If C2) is not satisfied, i.e. ∑ n = 0 ∞ ε n < ∞ {\displaystyle \sum _{n=0}^{\infty }\varepsilon _{n}<\infty } , then θ n − θ 0 = − ∑ i = 0 n − 1 ε i H ( θ i , X i + 1 ) {\displaystyle \theta _{n}-\theta _{0}=-\sum _{i=0}^{n-1}\varepsilon _{i}H(\theta _{i},X_{i+1})} is a bounded sequence, so the iteration cannot converge to θ ∗ {\displaystyle \theta ^{*}} if the initial guess θ 0 {\displaystyle \theta _{0}} is too far away from θ ∗ {\displaystyle \theta ^{*}} . As for C3) note that if θ n {\displaystyle \theta _{n}} converges to θ ∗ {\displaystyle \theta ^{*}} then
θ n + 1 − θ n = − ε n H ( θ n , X n + 1 ) → 0 , as n → ∞ . {\displaystyle \theta _{n+1}-\theta _{n}=-\varepsilon _{n}H(\theta _{n},X_{n+1})\rightarrow 0,{\text{ as }}n\rightarrow \infty .} so we must have ε n ↓ 0 {\displaystyle \varepsilon _{n}\downarrow 0} ,and the condition C3) ensures it. A natural choice would be ε n = 1 / n {\displaystyle \varepsilon _{n}=1/n} . Condition C5) is a fairly stringent condition on the shape of g ( θ ) {\displaystyle g(\theta )} ; it gives the search direction of the algorithm.
Suppose Q ( θ , X ) = f ( θ ) + θ T X {\displaystyle Q(\theta ,X)=f(\theta )+\theta ^{T}X} , where f {\displaystyle f} is differentiable and X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} is a random variable independent of θ {\displaystyle \theta } . Then g ( θ ) = E [ Q ( θ , X ) ] = f ( θ ) + θ T E X {\displaystyle g(\theta )=\operatorname {E} [Q(\theta ,X)]=f(\theta )+\theta ^{T}\operatorname {E} X} depends on the mean of X {\displaystyle X} , and the stochastic gradient method would be appropriate in this problem. We can choose H ( θ , X ) = ∂ ∂ θ Q ( θ , X ) = ∂ ∂ θ f ( θ ) + X . {\displaystyle H(\theta ,X)={\frac {\partial }{\partial \theta }}Q(\theta ,X)={\frac {\partial }{\partial \theta }}f(\theta )+X.}
The Kiefer–Wolfowitz algorithm was introduced in 1952 by Jacob Wolfowitz and Jack Kiefer,23 and was motivated by the publication of the Robbins–Monro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function.
Let M ( x ) {\displaystyle M(x)} be a function which has a maximum at the point θ {\displaystyle \theta } . It is assumed that M ( x ) {\displaystyle M(x)} is unknown; however, certain observations N ( x ) {\displaystyle N(x)} , where E [ N ( x ) ] = M ( x ) {\displaystyle \operatorname {E} [N(x)]=M(x)} , can be made at any point x {\displaystyle x} . The structure of the algorithm follows a gradient-like method, with the iterates being generated as
where N ( x n + c n ) {\displaystyle N(x_{n}+c_{n})} and N ( x n − c n ) {\displaystyle N(x_{n}-c_{n})} are independent. At every step, the gradient of M ( x ) {\displaystyle M(x)} is approximated akin to a central difference method with h = 2 c n {\displaystyle h=2c_{n}} . So the sequence { c n } {\displaystyle \{c_{n}\}} specifies the sequence of finite difference widths used for the gradient approximation, while the sequence { a n } {\displaystyle \{a_{n}\}} specifies a sequence of positive step sizes taken along that direction.
Kiefer and Wolfowitz proved that, if M ( x ) {\displaystyle M(x)} satisfied certain regularity conditions, then x n {\displaystyle x_{n}} will converge to θ {\displaystyle \theta } in probability as n → ∞ {\displaystyle n\to \infty } , and later Blum24 in 1954 showed x n {\displaystyle x_{n}} converges to θ {\displaystyle \theta } almost surely, provided that:
A suitable choice of sequences, as recommended by Kiefer and Wolfowitz, would be a n = 1 / n {\displaystyle a_{n}=1/n} and c n = n − 1 / 3 {\displaystyle c_{n}=n^{-1/3}} .
An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on.2627 These methods are also applied in control theory, in which case the unknown function which we wish to optimize or find the zero of may vary in time. In this case, the step size a n {\displaystyle a_{n}} should not converge to zero but should be chosen so as to track the function.28, 2nd ed., chapter 3
C. Johan Masreliez and R. Douglas Martin were the first to apply stochastic approximation to robust estimation.29
The main tool for analyzing stochastic approximations algorithms (including the Robbins–Monro and the Kiefer–Wolfowitz algorithms) is a theorem by Aryeh Dvoretzky published in 1956.30
Toulis, Panos; Airoldi, Edoardo (2015). "Scalable estimation strategies based on stochastic approximations: classical results and new insights". Statistics and Computing. 25 (4): 781–795. doi:10.1007/s11222-015-9560-y. PMC 4484776. PMID 26139959. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484776 ↩
Le Ny, Jerome. "Introduction to Stochastic Approximation Algorithms" (PDF). Polytechnique Montreal. Teaching Notes. Retrieved 16 November 2016. http://www.professeurs.polymtl.ca/jerome.le-ny/teaching/DP_fall09/notes/lec11_SA.pdf ↩
Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:10.1214/aoms/1177729586. /wiki/Herbert_Robbins ↩
Blum, Julius R. (1954-06-01). "Approximation Methods which Converge with Probability one". The Annals of Mathematical Statistics. 25 (2): 382–386. doi:10.1214/aoms/1177728794. ISSN 0003-4851. https://doi.org/10.1214%2Faoms%2F1177728794 ↩
Sacks, J. (1958). "Asymptotic Distribution of Stochastic Approximation Procedures". The Annals of Mathematical Statistics. 29 (2): 373–405. doi:10.1214/aoms/1177706619. JSTOR 2237335. https://doi.org/10.1214%2Faoms%2F1177706619 ↩
Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. (2009). "Robust Stochastic Approximation Approach to Stochastic Programming". SIAM Journal on Optimization. 19 (4): 1574. doi:10.1137/070704277. /wiki/Arkadi_Nemirovski ↩
Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, Wiley -Intersci. Ser. Discrete Math 15 John Wiley New York (1983) . ↩
Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control, J.C. Spall, John Wiley Hoboken, NJ, (2003). https://books.google.com/books?id=f66OIvvkKnAC&q=%22Robbins-Monro%22 ↩
Chung, K. L. (1954-09-01). "On a Stochastic Approximation Method". The Annals of Mathematical Statistics. 25 (3): 463–483. doi:10.1214/aoms/1177728716. ISSN 0003-4851. https://doi.org/10.1214%2Faoms%2F1177728716 ↩
Fabian, Vaclav (1968-08-01). "On Asymptotic Normality in Stochastic Approximation". The Annals of Mathematical Statistics. 39 (4): 1327–1332. doi:10.1214/aoms/1177698258. ISSN 0003-4851. https://doi.org/10.1214%2Faoms%2F1177698258 ↩
Lai, T. L.; Robbins, Herbert (1979-11-01). "Adaptive Design and Stochastic Approximation". The Annals of Statistics. 7 (6): 1196–1221. doi:10.1214/aos/1176344840. ISSN 0090-5364. https://doi.org/10.1214%2Faos%2F1176344840 ↩
Lai, Tze Leung; Robbins, Herbert (1981-09-01). "Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes". Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete. 56 (3): 329–360. doi:10.1007/BF00536178. ISSN 0044-3719. S2CID 122109044. https://doi.org/10.1007%2FBF00536178 ↩
Polyak, B T (1991). "New stochastic approximation type procedures. (In Russian.)". Automation and Remote Control. 7 (7). https://www.researchgate.net/publication/236736759 ↩
Ruppert, David (1988). Efficient estimators from a slowly converging robbins-monro process (Technical Report 781). Cornell University School of Operations Research and Industrial Engineering. https://www.researchgate.net/publication/242608650 ↩
Polyak, B. T.; Juditsky, A. B. (1992). "Acceleration of Stochastic Approximation by Averaging". SIAM Journal on Control and Optimization. 30 (4): 838. doi:10.1137/0330046. /wiki/Doi_(identifier) ↩
On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin, Dokl. Akad. Nauk SSR 2939, (1978 (Russian)), Soviet Math. Dokl. 19 (1978 (English)). ↩
Kushner, Harold; George Yin, G. (2003-07-17). Stochastic Approximation and Recursive Algorithms and | Harold Kushner | Springer. www.springer.com. ISBN 9780387008943. Retrieved 2016-05-16. 9780387008943 ↩
Bouleau, N.; Lepingle, D. (1994). Numerical Methods for stochastic Processes. New York: John Wiley. ISBN 9780471546412. 9780471546412 ↩
Kiefer, J.; Wolfowitz, J. (1952). "Stochastic Estimation of the Maximum of a Regression Function". The Annals of Mathematical Statistics. 23 (3): 462. doi:10.1214/aoms/1177729392. https://doi.org/10.1214%2Faoms%2F1177729392 ↩
Spall, J. C. (2000). "Adaptive stochastic approximation by the simultaneous perturbation method". IEEE Transactions on Automatic Control. 45 (10): 1839–1853. doi:10.1109/TAC.2000.880982. /wiki/Doi_(identifier) ↩
Kushner, H. J.; Yin, G. G. (1997). Stochastic Approximation Algorithms and Applications. doi:10.1007/978-1-4899-2696-8. ISBN 978-1-4899-2698-2. 978-1-4899-2698-2 ↩
Stochastic Approximation and Recursive Estimation, Mikhail Borisovich Nevel'son and Rafail Zalmanovich Has'minskiĭ, translated by Israel Program for Scientific Translations and B. Silver, Providence, RI: American Mathematical Society, 1973, 1976. ISBN 0-8218-1597-0. /wiki/ISBN_(identifier) ↩
Martin, R.; Masreliez, C. (1975). "Robust estimation via stochastic approximation". IEEE Transactions on Information Theory. 21 (3): 263. doi:10.1109/TIT.1975.1055386. /wiki/Doi_(identifier) ↩
Dvoretzky, Aryeh (1956). "On stochastic approximation". In Neyman, Jerzy (ed.). Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, vol. I. University of California Press. pp. 39–55. MR 0084911. /wiki/Aryeh_Dvoretzky ↩