Backpropagation through time

<h2 id="algorithm">Algorithm</h2>

<p>The training data for a recurrent neural network is an ordered sequence of 
  
    
      
        k
      
    
    {\displaystyle k}
  
 input-output pairs, 
  
    
      
        ⟨
        
          
            a
          
          
            0
          
        
        ,
        
          
            y
          
          
            0
          
        
        ⟩
        ,
        ⟨
        
          
            a
          
          
            1
          
        
        ,
        
          
            y
          
          
            1
          
        
        ⟩
        ,
        ⟨
        
          
            a
          
          
            2
          
        
        ,
        
          
            y
          
          
            2
          
        
        ⟩
        ,
        .
        .
        .
        ,
        ⟨
        
          
            a
          
          
            k
            −
            1
          
        
        ,
        
          
            y
          
          
            k
            −
            1
          
        
        ⟩
      
    
    {\displaystyle \langle \mathbf {a} _{0},\mathbf {y} _{0}\rangle ,\langle \mathbf {a} _{1},\mathbf {y} _{1}\rangle ,\langle \mathbf {a} _{2},\mathbf {y} _{2}\rangle ,...,\langle \mathbf {a} _{k-1},\mathbf {y} _{k-1}\rangle }
  
. An initial value must be specified for the hidden state 
  
    
      
        
          
            x
          
          
            0
          
        
      
    
    {\displaystyle \mathbf {x} _{0}}
  
, typically chosen to be a <a href="/facts/Zero_vector/PvH2qhxh">zero vector</a>.
</p><p>BPTT begins by unfolding a recurrent neural network in time. The unfolded network contains 
  
    
      
        k
      
    
    {\displaystyle k}
  
 inputs and outputs, but every copy of the network shares the same parameters. Then, the <a href="/facts/Backpropagation/lCsIdKHc">backpropagation</a> algorithm is used to find the gradient of the <a href="/facts/Loss_function/xv5ozuhl">loss function</a> with respect to all the network parameters.
</p><p>Consider an example of a neural network that contains a <a href="/facts/Recurrent_neural_network/bx7hBVB1">recurrent</a> layer 
  
    
      
        f
      
    
    {\displaystyle f}
  
 and a <a href="/facts/Feedforward_neural_network/CP0pPGDF">feedforward</a> layer 
  
    
      
        g
      
    
    {\displaystyle g}
  
. There are different ways to define the training cost, but the aggregated cost is always the average of the costs of each of the time steps. The cost of each time step can be computed separately. The figure above shows how the cost at time 
  
    
      
        t
        +
        3
      
    
    {\displaystyle t+3}
  
 can be computed, by unfolding the recurrent layer 
  
    
      
        f
      
    
    {\displaystyle f}
  
 for three time steps and adding the feedforward layer 
  
    
      
        g
      
    
    {\displaystyle g}
  
. Each instance of 
  
    
      
        f
      
    
    {\displaystyle f}
  
 in the unfolded network shares the same parameters. Thus, the weight updates in each instance (
  
    
      
        
          f
          
            1
          
        
        ,
        
          f
          
            2
          
        
        ,
        
          f
          
            3
          
        
      
    
    {\displaystyle f_{1},f_{2},f_{3}}
  
) are summed together.
</p>
<h2 id="pseudocode">Pseudocode</h2>
<p>Below is pseudocode for a truncated version of BPTT, where the training data contains 
  
    
      
        n
      
    
    {\displaystyle n}
  
 input-output pairs, and the network is unfolded for 
  
    
      
        k
      
    
    {\displaystyle k}
  
 time steps:
</p>
Back_Propagation_Through_Time(a, y)   // a[t] is the input at time t. y[t] is the output
    Unfold the network to contain <i>k</i> instances of <i>f</i>
    do until stopping criterion is met:
        x := the zero-magnitude vector // x is the current context
        for t from 0 to n − k do      // t is time. n is the length of the training sequence
            Set the network inputs to x, a[t], a[t+1], ..., a[t+k−1]
            p := forward-propagate the inputs over the whole unfolded network
            e := y[t+k] − p;           // error = target − prediction
            Back-propagate the error, e, back across the whole unfolded network
            Sum the weight changes in the k instances of f together.
            Update all the weights in f and g.
            x := f(x, a[t]);           // compute the context for the next time-step

<h2 id="advantages">Advantages</h2>
<p>BPTT tends to be significantly faster for training recurrent neural networks than general-purpose optimization techniques such as <a href="/facts/Evolutionary_programming/L2k2xPdI">evolutionary optimization</a>.<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a>
</p>
<h2 id="disadvantages">Disadvantages</h2>
<p>BPTT has difficulty with local optima. With recurrent neural networks, local optima are a much more significant problem than with feed-forward neural networks.<a class="footnote-ref" id="fnref:5" href="#fn:5"><sup>5</sup></a> The recurrent feedback in such networks tends to create chaotic responses in the error surface which cause local optima to occur frequently, and in poor locations on the error surface.
</p>

<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Backpropagation_through_structure/09nPDkQx">Backpropagation through structure</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Mozer, M. C. (1995). "A Focused Backpropagation Algorithm for Temporal Pattern Recognition". In Chauvin, Y.; Rumelhart, D. (eds.). Backpropagation: Theory, architectures, and applications. Hillsdale, NJ: Lawrence Erlbaum Associates. pp. 137–169. Retrieved 2017-08-21. {{cite book}}: |website= ignored (help) <a href="https://www.researchgate.net/publication/243781476" target="_blank">https://www.researchgate.net/publication/243781476</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Robinson, A. J. & Fallside, F. (1987). The utility driven dynamic error propagation network (Technical report). Cambridge University, Engineering Department. CUED/F-INFENG/TR.1. <a href="https://www.bibsonomy.org/bibtex/269a88ecbac9a51cbf0b4be189c412820/idsia" target="_blank">https://www.bibsonomy.org/bibtex/269a88ecbac9a51cbf0b4be189c412820/idsia</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Werbos, Paul J. (1988). "Generalization of backpropagation with application to a recurrent gas market model". Neural Networks. 1 (4): 339–356. doi:10.1016/0893-6080(88)90007-x. <a href="https://zenodo.org/record/1258627" target="_blank">https://zenodo.org/record/1258627</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Sjöberg, Jonas; Zhang, Qinghua; Ljung, Lennart; Benveniste, Albert; Delyon, Bernard; Glorennec, Pierre-Yves; Hjalmarsson, Håkan; Juditsky, Anatoli (1995). "Nonlinear black-box modeling in system identification: a unified overview". Automatica. 31 (12): 1691–1724. CiteSeerX 10.1.1.27.81. doi:10.1016/0005-1098(95)00120-8. <a href="/wiki/H%C3%A5kan_Hjalmarsson" target="_blank">/wiki/H%C3%A5kan_Hjalmarsson</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>M.P. Cuéllar and M. Delgado and M.C. Pegalajar (2006). "An Application of Non-Linear Programming to Train Recurrent Neural Networks in Time Series Prediction Problems". Enterprise Information Systems VII. Springer Netherlands. pp. 95–102. doi:10.1007/978-1-4020-5347-4_11. ISBN 978-1-4020-5323-8. <a href="978-1-4020-5323-8" target="_blank">978-1-4020-5323-8</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
</ol>

Backpropagation through time open-in-new

Backpropagation through time