Scoring algorithm

<h2 id="sketch-of-derivation">Sketch of derivation</h2>
<p>Let 
  
    
      
        
          Y
          
            1
          
        
        ,
        …
        ,
        
          Y
          
            n
          
        
      
    
    {\displaystyle Y_{1},\ldots ,Y_{n}}
  
 be <a href="/facts/Random_variable/TwTBXnLT">random variables</a>, independent and identically distributed with twice differentiable <a href="/facts/Probability_density_function/zvfybna4">p.d.f.</a> 
  
    
      
        f
        (
        y
        ;
        θ
        )
      
    
    {\displaystyle f(y;\theta )}
  
, and we wish to calculate the <a href="/facts/Maximum_likelihood_estimator/0Yq2dpQD">maximum likelihood estimator</a> (M.L.E.) 
  
    
      
        
          θ
          
            ∗
          
        
      
    
    {\displaystyle \theta ^{*}}
  
 of 
  
    
      
        θ
      
    
    {\displaystyle \theta }
  
.  First, suppose we have a starting point for our algorithm 
  
    
      
        
          θ
          
            0
          
        
      
    
    {\displaystyle \theta _{0}}
  
, and consider a <a href="/facts/Taylor_series/M0aTuftH">Taylor expansion</a> of the <a href="/facts/Score_(statistics)/gnBI7IEy">score function</a>, 
  
    
      
        V
        (
        θ
        )
      
    
    {\displaystyle V(\theta )}
  
, about 
  
    
      
        
          θ
          
            0
          
        
      
    
    {\displaystyle \theta _{0}}
  
:
</p>

V
        (
        θ
        )
        ≈
        V
        (
        
          θ
          
            0
          
        
        )
        −
        
          
            J
          
        
        (
        
          θ
          
            0
          
        
        )
        (
        θ
        −
        
          θ
          
            0
          
        
        )
        ,
        
      
    
    {\displaystyle V(\theta )\approx V(\theta _{0})-{\mathcal {J}}(\theta _{0})(\theta -\theta _{0}),\,}

<p>where 
</p>

J
          
        
        (
        
          θ
          
            0
          
        
        )
        =
        −
        
          ∑
          
            i
            =
            1
          
          
            n
          
        
        
          
            
            
              ∇
              
                ∇
                
                  ⊤
                
              
            
            |
          
          
            θ
            =
            
              θ
              
                0
              
            
          
        
        log
        ⁡
        f
        (
        
          Y
          
            i
          
        
        ;
        θ
        )
      
    
    {\displaystyle {\mathcal {J}}(\theta _{0})=-\sum _{i=1}^{n}\left.\nabla \nabla ^{\top }\right|_{\theta =\theta _{0}}\log f(Y_{i};\theta )}

<p>is the <a href="/facts/Observed_information/S2gDCn8V">observed information matrix</a> at 
  
    
      
        
          θ
          
            0
          
        
      
    
    {\displaystyle \theta _{0}}
  
.  Now, setting 
  
    
      
        θ
        =
        
          θ
          
            ∗
          
        
      
    
    {\displaystyle \theta =\theta ^{*}}
  
, using that 
  
    
      
        V
        (
        
          θ
          
            ∗
          
        
        )
        =
        0
      
    
    {\displaystyle V(\theta ^{*})=0}
  
 and rearranging gives us:
</p>

θ
          
            ∗
          
        
        ≈
        
          θ
          
            0
          
        
        +
        
          
            
              J
            
          
          
            −
            1
          
        
        (
        
          θ
          
            0
          
        
        )
        V
        (
        
          θ
          
            0
          
        
        )
        .
        
      
    
    {\displaystyle \theta ^{*}\approx \theta _{0}+{\mathcal {J}}^{-1}(\theta _{0})V(\theta _{0}).\,}

<p>We therefore use the algorithm
</p>

θ
          
            m
            +
            1
          
        
        =
        
          θ
          
            m
          
        
        +
        
          
            
              J
            
          
          
            −
            1
          
        
        (
        
          θ
          
            m
          
        
        )
        V
        (
        
          θ
          
            m
          
        
        )
        ,
        
      
    
    {\displaystyle \theta _{m+1}=\theta _{m}+{\mathcal {J}}^{-1}(\theta _{m})V(\theta _{m}),\,}

<p>and under certain regularity conditions, it can be shown that 
  
    
      
        
          θ
          
            m
          
        
        →
        
          θ
          
            ∗
          
        
      
    
    {\displaystyle \theta _{m}\rightarrow \theta ^{*}}
  
.
</p>
<h2 id="fisher-scoring">Fisher scoring</h2>
<p>In practice, 
  
    
      
        
          
            J
          
        
        (
        θ
        )
      
    
    {\displaystyle {\mathcal {J}}(\theta )}
  
 is usually replaced by 
  
    
      
        
          
            I
          
        
        (
        θ
        )
        =
        
          E
        
        [
        
          
            J
          
        
        (
        θ
        )
        ]
      
    
    {\displaystyle {\mathcal {I}}(\theta )=\mathrm {E} [{\mathcal {J}}(\theta )]}
  
, the <a href="/facts/Fisher_information/Q2JLexN9">Fisher information</a>, thus giving us the Fisher Scoring Algorithm:
</p>

θ
          
            m
            +
            1
          
        
        =
        
          θ
          
            m
          
        
        +
        
          
            
              I
            
          
          
            −
            1
          
        
        (
        
          θ
          
            m
          
        
        )
        V
        (
        
          θ
          
            m
          
        
        )
      
    
    {\displaystyle \theta _{m+1}=\theta _{m}+{\mathcal {I}}^{-1}(\theta _{m})V(\theta _{m})}
  
..
<p>Under some regularity conditions, if 
  
    
      
        
          θ
          
            m
          
        
      
    
    {\displaystyle \theta _{m}}
  
 is a <a href="/facts/Consistent_estimator/HAKDWYEM">consistent estimator</a>, then 
  
    
      
        
          θ
          
            m
            +
            1
          
        
      
    
    {\displaystyle \theta _{m+1}}
  
 (the correction after a single step) is 'optimal' in the sense that its error distribution is asymptotically identical to that of the true max-likelihood estimate.<a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a>
</p>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Score_(statistics)/gnBI7IEy">Score (statistics)</a></li>
<li><a href="/facts/Score_test/paM5A8pV">Score test</a></li>
<li><a href="/facts/Fisher_information/Q2JLexN9">Fisher information</a></li></ul>

<h2 id="further-reading">Further reading</h2>
<ul><li>Jennrich, R. I. & Sampson, P. F. (1976). <a href="https://www.tandfonline.com/doi/abs/10.1080/00401706.1976.10489395">"Newton-Raphson and Related Algorithms for Maximum Likelihood Variance Component Estimation"</a>. <i><a href="/facts/Technometrics/J5j3J6Lj">Technometrics</a></i>. 18 (1): 11–17. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1080%2F00401706.1976.10489395">10.1080/00401706.1976.10489395</a> (inactive 1 November 2024). <a href="/facts/JSTOR_(identifier)/YTeVmaJ7">JSTOR</a> <a href="https://www.jstor.org/stable/1267911">1267911</a>.{{cite journal}}:  CS1 maint: DOI inactive as of November 2024 (link)</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Longford, Nicholas T. (1987). "A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects". Biometrika. 74 (4): 817–827. doi:10.1093/biomet/74.4.817. <a href="/wiki/Doi_(identifier)" target="_blank">/wiki/Doi_(identifier)</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Li, Bing; Babu, G. Jogesh (2019), "Bayesian Inference", Springer Texts in Statistics, New York, NY: Springer New York, Theorem 9.4, doi:10.1007/978-1-4939-9761-9_6, ISBN 978-1-4939-9759-6, S2CID 239322258, retrieved 2023-01-03 <a href="978-1-4939-9759-6" target="_blank">978-1-4939-9759-6</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
</ol>

Scoring algorithm open-in-new

Scoring algorithm