Reparameterization trick

<h2 id="mathematics">Mathematics</h2>
<p>Let 
  
    
      
        z
      
    
    {\displaystyle z}
  
 be a random variable with distribution 
  
    
      
        
          q
          
            ϕ
          
        
        (
        z
        )
      
    
    {\displaystyle q_{\phi }(z)}
  
, where 
  
    
      
        ϕ
      
    
    {\displaystyle \phi }
  
 is a vector containing the parameters of the distribution.
</p>
<h3>REINFORCE estimator</h3>
<p>Consider an objective function of the form:
  
    
      
        L
        (
        ϕ
        )
        =
        
          
            E
          
          
            z
            ∼
            
              q
              
                ϕ
              
            
            (
            z
            )
          
        
        [
        f
        (
        z
        )
        ]
      
    
    {\displaystyle L(\phi )=\mathbb {E} _{z\sim q_{\phi }(z)}[f(z)]}
  
Without the reparameterization trick, estimating the gradient 
  
    
      
        
          ∇
          
            ϕ
          
        
        L
        (
        ϕ
        )
      
    
    {\displaystyle \nabla _{\phi }L(\phi )}
  
 can be challenging, because the parameter appears in the random variable itself. In more detail, we have to statistically estimate:
  
    
      
        
          ∇
          
            ϕ
          
        
        L
        (
        ϕ
        )
        =
        
          ∇
          
            ϕ
          
        
        ∫
        d
        z
        
        
          q
          
            ϕ
          
        
        (
        z
        )
        f
        (
        z
        )
      
    
    {\displaystyle \nabla _{\phi }L(\phi )=\nabla _{\phi }\int dz\;q_{\phi }(z)f(z)}
  
The REINFORCE estimator, widely used in <a href="/facts/Reinforcement_learning/NrgPPS0Q">reinforcement learning</a> and especially <a href="/facts/Policy_gradient/EV7Lyymh">policy gradient</a>,<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a> uses the following equality:
  
    
      
        
          ∇
          
            ϕ
          
        
        L
        (
        ϕ
        )
        =
        ∫
        d
        z
        
        
          q
          
            ϕ
          
        
        (
        z
        )
        
          ∇
          
            ϕ
          
        
        (
        ln
        ⁡
        
          q
          
            ϕ
          
        
        (
        z
        )
        )
        f
        (
        z
        )
        =
        
          
            E
          
          
            z
            ∼
            
              q
              
                ϕ
              
            
            (
            z
            )
          
        
        [
        
          ∇
          
            ϕ
          
        
        (
        ln
        ⁡
        
          q
          
            ϕ
          
        
        (
        z
        )
        )
        f
        (
        z
        )
        ]
      
    
    {\displaystyle \nabla _{\phi }L(\phi )=\int dz\;q_{\phi }(z)\nabla _{\phi }(\ln q_{\phi }(z))f(z)=\mathbb {E} _{z\sim q_{\phi }(z)}[\nabla _{\phi }(\ln q_{\phi }(z))f(z)]}
  
This allows the gradient to be estimated:
  
    
      
        
          ∇
          
            ϕ
          
        
        L
        (
        ϕ
        )
        ≈
        
          
            1
            N
          
        
        
          ∑
          
            i
            =
            1
          
          
            N
          
        
        
          ∇
          
            ϕ
          
        
        (
        ln
        ⁡
        
          q
          
            ϕ
          
        
        (
        
          z
          
            i
          
        
        )
        )
        f
        (
        
          z
          
            i
          
        
        )
      
    
    {\displaystyle \nabla _{\phi }L(\phi )\approx {\frac {1}{N}}\sum _{i=1}^{N}\nabla _{\phi }(\ln q_{\phi }(z_{i}))f(z_{i})}
  
The REINFORCE estimator has high variance, and many methods were developed to <a href="/facts/Variance_reduction/EFbuX0tD">reduce its variance</a>.<a class="footnote-ref" id="fnref:5" href="#fn:5"><sup>5</sup></a>
</p>
<h3>Reparameterization estimator</h3>
<p>The reparameterization trick expresses 
  
    
      
        z
      
    
    {\displaystyle z}
  
 as:
  
    
      
        z
        =
        
          g
          
            ϕ
          
        
        (
        ϵ
        )
        ,
        
        ϵ
        ∼
        p
        (
        ϵ
        )
      
    
    {\displaystyle z=g_{\phi }(\epsilon ),\quad \epsilon \sim p(\epsilon )}
  
Here, 
  
    
      
        
          g
          
            ϕ
          
        
      
    
    {\displaystyle g_{\phi }}
  
 is a deterministic function parameterized by 
  
    
      
        ϕ
      
    
    {\displaystyle \phi }
  
, and 
  
    
      
        ϵ
      
    
    {\displaystyle \epsilon }
  
 is a noise variable drawn from a fixed distribution 
  
    
      
        p
        (
        ϵ
        )
      
    
    {\displaystyle p(\epsilon )}
  
. This gives:
  
    
      
        L
        (
        ϕ
        )
        =
        
          
            E
          
          
            ϵ
            ∼
            p
            (
            ϵ
            )
          
        
        [
        f
        (
        
          g
          
            ϕ
          
        
        (
        ϵ
        )
        )
        ]
      
    
    {\displaystyle L(\phi )=\mathbb {E} _{\epsilon \sim p(\epsilon )}[f(g_{\phi }(\epsilon ))]}
  
Now, the gradient can be estimated as:
  
    
      
        
          ∇
          
            ϕ
          
        
        L
        (
        ϕ
        )
        =
        
          
            E
          
          
            ϵ
            ∼
            p
            (
            ϵ
            )
          
        
        [
        
          ∇
          
            ϕ
          
        
        f
        (
        
          g
          
            ϕ
          
        
        (
        ϵ
        )
        )
        ]
        ≈
        
          
            1
            N
          
        
        
          ∑
          
            i
            =
            1
          
          
            N
          
        
        
          ∇
          
            ϕ
          
        
        f
        (
        
          g
          
            ϕ
          
        
        (
        
          ϵ
          
            i
          
        
        )
        )
      
    
    {\displaystyle \nabla _{\phi }L(\phi )=\mathbb {E} _{\epsilon \sim p(\epsilon )}[\nabla _{\phi }f(g_{\phi }(\epsilon ))]\approx {\frac {1}{N}}\sum _{i=1}^{N}\nabla _{\phi }f(g_{\phi }(\epsilon _{i}))}

</p>
<h2 id="examples">Examples</h2>
<p>For some common distributions, the reparameterization trick takes specific forms:
</p><p><a href="/facts/Normal_distribution/UapjjPyQ">Normal distribution</a>: For 
  
    
      
        z
        ∼
        
          
            N
          
        
        (
        μ
        ,
        
          σ
          
            2
          
        
        )
      
    
    {\displaystyle z\sim {\mathcal {N}}(\mu ,\sigma ^{2})}
  
, we can use:
  
    
      
        z
        =
        μ
        +
        σ
        ϵ
        ,
        
        ϵ
        ∼
        
          
            N
          
        
        (
        0
        ,
        1
        )
      
    
    {\displaystyle z=\mu +\sigma \epsilon ,\quad \epsilon \sim {\mathcal {N}}(0,1)}

</p><p><a href="/facts/Exponential_distribution/yY3EdV0u">Exponential distribution</a>: For 
  
    
      
        z
        ∼
        
          Exp
        
        (
        λ
        )
      
    
    {\displaystyle z\sim {\text{Exp}}(\lambda )}
  
, we can use:
  
    
      
        z
        =
        −
        
          
            1
            λ
          
        
        log
        ⁡
        (
        ϵ
        )
        ,
        
        ϵ
        ∼
        
          Uniform
        
        (
        0
        ,
        1
        )
      
    
    {\displaystyle z=-{\frac {1}{\lambda }}\log(\epsilon ),\quad \epsilon \sim {\text{Uniform}}(0,1)}
  
<a href="/facts/Discrete_distribution/EpsKKVRu">Discrete distribution</a> can be reparameterized by the <a href="/facts/Gumbel_distribution/PGkDYzwe">Gumbel distribution</a> (Gumbel-softmax trick or "concrete distribution").<a class="footnote-ref" id="fnref:6" href="#fn:6"><sup>6</sup></a>
</p><p>In general, any distribution that is differentiable with respect to its parameters can be reparameterized by inverting the multivariable CDF function, then apply the implicit method. See <a class="footnote-ref" id="fnref:7" href="#fn:7"><sup>7</sup></a> for an exposition and application to the <a href="/facts/Gamma_distribution/lczcdJmw">Gamma</a> <a href="/facts/Beta_distribution/DbJk4eeV">Beta</a>, <a href="/facts/Dirichlet_distribution/Qq13f5g2">Dirichlet</a>, and <a href="/facts/Von_Mises_distribution/bnFma2cF">von Mises distributions</a>.
</p>
<h2 id="applications">Applications</h2>
<h3>Variational autoencoder</h3>

<p>In <a href="/facts/Variational_autoencoder/I5wBbYpR">Variational Autoencoders</a> (VAEs), the VAE objective function, known as the <a href="/facts/Evidence_lower_bound/cANNQv7y">Evidence Lower Bound</a> (ELBO), is given by:
</p><p>
  
    
      
        
          ELBO
        
        (
        ϕ
        ,
        θ
        )
        =
        
          
            E
          
          
            z
            ∼
            
              q
              
                ϕ
              
            
            (
            z
            
              |
            
            x
            )
          
        
        [
        log
        ⁡
        
          p
          
            θ
          
        
        (
        x
        
          |
        
        z
        )
        ]
        −
        
          D
          
            KL
          
        
        (
        
          q
          
            ϕ
          
        
        (
        z
        
          |
        
        x
        )
        
          |
        
        
          |
        
        p
        (
        z
        )
        )
      
    
    {\displaystyle {\text{ELBO}}(\phi ,\theta )=\mathbb {E} _{z\sim q_{\phi }(z|x)}[\log p_{\theta }(x|z)]-D_{\text{KL}}(q_{\phi }(z|x)||p(z))}

</p><p>where 
  
    
      
        
          q
          
            ϕ
          
        
        (
        z
        
          |
        
        x
        )
      
    
    {\displaystyle q_{\phi }(z|x)}
  
 is the encoder (recognition model), 
  
    
      
        
          p
          
            θ
          
        
        (
        x
        
          |
        
        z
        )
      
    
    {\displaystyle p_{\theta }(x|z)}
  
 is the decoder (<a href="/facts/Generative_model/JUgExNIP">generative model</a>), and 
  
    
      
        p
        (
        z
        )
      
    
    {\displaystyle p(z)}
  
 is the prior distribution over latent variables. The gradient of ELBO with respect to 
  
    
      
        θ
      
    
    {\displaystyle \theta }
  
 is simply
  
    
      
        
          
            E
          
          
            z
            ∼
            
              q
              
                ϕ
              
            
            (
            z
            
              |
            
            x
            )
          
        
        [
        
          ∇
          
            θ
          
        
        log
        ⁡
        
          p
          
            θ
          
        
        (
        x
        
          |
        
        z
        )
        ]
        ≈
        
          
            1
            L
          
        
        
          ∑
          
            l
            =
            1
          
          
            L
          
        
        
          ∇
          
            θ
          
        
        log
        ⁡
        
          p
          
            θ
          
        
        (
        x
        
          |
        
        
          z
          
            l
          
        
        )
      
    
    {\displaystyle \mathbb {E} _{z\sim q_{\phi }(z|x)}[\nabla _{\theta }\log p_{\theta }(x|z)]\approx {\frac {1}{L}}\sum _{l=1}^{L}\nabla _{\theta }\log p_{\theta }(x|z_{l})}
  
but the gradient with respect to 
  
    
      
        ϕ
      
    
    {\displaystyle \phi }
  
 requires the trick. Express the sampling operation 
  
    
      
        z
        ∼
        
          q
          
            ϕ
          
        
        (
        z
        
          |
        
        x
        )
      
    
    {\displaystyle z\sim q_{\phi }(z|x)}
  
 as:
  
    
      
        z
        =
        
          μ
          
            ϕ
          
        
        (
        x
        )
        +
        
          σ
          
            ϕ
          
        
        (
        x
        )
        ⊙
        ϵ
        ,
        
        ϵ
        ∼
        
          
            N
          
        
        (
        0
        ,
        I
        )
      
    
    {\displaystyle z=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon ,\quad \epsilon \sim {\mathcal {N}}(0,I)}
  
where 
  
    
      
        
          μ
          
            ϕ
          
        
        (
        x
        )
      
    
    {\displaystyle \mu _{\phi }(x)}
  
 and 
  
    
      
        
          σ
          
            ϕ
          
        
        (
        x
        )
      
    
    {\displaystyle \sigma _{\phi }(x)}
  
 are the outputs of the encoder network, and 
  
    
      
        ⊙
      
    
    {\displaystyle \odot }
  
 denotes <a href="/facts/Hadamard_product_(matrices)/NjAUd4o4">element-wise multiplication</a>. Then we have
  
    
      
        
          ∇
          
            ϕ
          
        
        
          ELBO
        
        (
        ϕ
        ,
        θ
        )
        =
        
          
            E
          
          
            ϵ
            ∼
            
              
                N
              
            
            (
            0
            ,
            I
            )
          
        
        [
        
          ∇
          
            ϕ
          
        
        log
        ⁡
        
          p
          
            θ
          
        
        (
        x
        
          |
        
        z
        )
        +
        
          ∇
          
            ϕ
          
        
        log
        ⁡
        
          q
          
            ϕ
          
        
        (
        z
        
          |
        
        x
        )
        −
        
          ∇
          
            ϕ
          
        
        log
        ⁡
        p
        (
        z
        )
        ]
      
    
    {\displaystyle \nabla _{\phi }{\text{ELBO}}(\phi ,\theta )=\mathbb {E} _{\epsilon \sim {\mathcal {N}}(0,I)}[\nabla _{\phi }\log p_{\theta }(x|z)+\nabla _{\phi }\log q_{\phi }(z|x)-\nabla _{\phi }\log p(z)]}
  
where 
  
    
      
        z
        =
        
          μ
          
            ϕ
          
        
        (
        x
        )
        +
        
          σ
          
            ϕ
          
        
        (
        x
        )
        ⊙
        ϵ
      
    
    {\displaystyle z=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon }
  
. This allows us to estimate the gradient using Monte Carlo sampling:
  
    
      
        
          ∇
          
            ϕ
          
        
        
          ELBO
        
        (
        ϕ
        ,
        θ
        )
        ≈
        
          
            1
            L
          
        
        
          ∑
          
            l
            =
            1
          
          
            L
          
        
        [
        
          ∇
          
            ϕ
          
        
        log
        ⁡
        
          p
          
            θ
          
        
        (
        x
        
          |
        
        
          z
          
            l
          
        
        )
        +
        
          ∇
          
            ϕ
          
        
        log
        ⁡
        
          q
          
            ϕ
          
        
        (
        
          z
          
            l
          
        
        
          |
        
        x
        )
        −
        
          ∇
          
            ϕ
          
        
        log
        ⁡
        p
        (
        
          z
          
            l
          
        
        )
        ]
      
    
    {\displaystyle \nabla _{\phi }{\text{ELBO}}(\phi ,\theta )\approx {\frac {1}{L}}\sum _{l=1}^{L}[\nabla _{\phi }\log p_{\theta }(x|z_{l})+\nabla _{\phi }\log q_{\phi }(z_{l}|x)-\nabla _{\phi }\log p(z_{l})]}
  
where 
  
    
      
        
          z
          
            l
          
        
        =
        
          μ
          
            ϕ
          
        
        (
        x
        )
        +
        
          σ
          
            ϕ
          
        
        (
        x
        )
        ⊙
        
          ϵ
          
            l
          
        
      
    
    {\displaystyle z_{l}=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon _{l}}
  
 and 
  
    
      
        
          ϵ
          
            l
          
        
        ∼
        
          
            N
          
        
        (
        0
        ,
        I
        )
      
    
    {\displaystyle \epsilon _{l}\sim {\mathcal {N}}(0,I)}
  
 for 
  
    
      
        l
        =
        1
        ,
        …
        ,
        L
      
    
    {\displaystyle l=1,\ldots ,L}
  
.
</p><p>This formulation enables <a href="/facts/Backpropagation/lCsIdKHc">backpropagation</a> through the sampling process, allowing for <a href="/facts/End-to-end_principle/6RkKmdUm">end-to-end</a> training of the VAE model using stochastic gradient descent or its variants.
</p>
<h3>Variational inference</h3>
<p>More generally, the trick allows using stochastic gradient descent for <a href="/facts/Variational_inference/pFuMLBcU">variational inference</a>. Let the variational objective (ELBO) be of the form:
  
    
      
        
          ELBO
        
        (
        ϕ
        )
        =
        
          
            E
          
          
            z
            ∼
            
              q
              
                ϕ
              
            
            (
            z
            )
          
        
        [
        log
        ⁡
        p
        (
        x
        ,
        z
        )
        −
        log
        ⁡
        
          q
          
            ϕ
          
        
        (
        z
        )
        ]
      
    
    {\displaystyle {\text{ELBO}}(\phi )=\mathbb {E} _{z\sim q_{\phi }(z)}[\log p(x,z)-\log q_{\phi }(z)]}
  
Using the reparameterization trick, we can estimate the gradient of this objective with respect to 
  
    
      
        ϕ
      
    
    {\displaystyle \phi }
  
:
  
    
      
        
          ∇
          
            ϕ
          
        
        
          ELBO
        
        (
        ϕ
        )
        ≈
        
          
            1
            L
          
        
        
          ∑
          
            l
            =
            1
          
          
            L
          
        
        
          ∇
          
            ϕ
          
        
        [
        log
        ⁡
        p
        (
        x
        ,
        
          g
          
            ϕ
          
        
        (
        
          ϵ
          
            l
          
        
        )
        )
        −
        log
        ⁡
        
          q
          
            ϕ
          
        
        (
        
          g
          
            ϕ
          
        
        (
        
          ϵ
          
            l
          
        
        )
        )
        ]
        ,
        
        
          ϵ
          
            l
          
        
        ∼
        p
        (
        ϵ
        )
      
    
    {\displaystyle \nabla _{\phi }{\text{ELBO}}(\phi )\approx {\frac {1}{L}}\sum _{l=1}^{L}\nabla _{\phi }[\log p(x,g_{\phi }(\epsilon _{l}))-\log q_{\phi }(g_{\phi }(\epsilon _{l}))],\quad \epsilon _{l}\sim p(\epsilon )}

</p>
<h3>Dropout</h3>
<p>The reparameterization trick has been applied to reduce the variance in <a href="/facts/Dilution_(neural_networks)/HvqnkJpt">dropout</a>, a regularization technique in neural networks. The original dropout can be reparameterized with <a href="/facts/Bernoulli_distribution/ChCtYyvs">Bernoulli distributions</a>:
  
    
      
        y
        =
        (
        W
        ⊙
        ϵ
        )
        x
        ,
        
        
          ϵ
          
            i
            j
          
        
        ∼
        
          Bernoulli
        
        (
        
          α
          
            i
            j
          
        
        )
      
    
    {\displaystyle y=(W\odot \epsilon )x,\quad \epsilon _{ij}\sim {\text{Bernoulli}}(\alpha _{ij})}
  
where 
  
    
      
        W
      
    
    {\displaystyle W}
  
 is the weight matrix, 
  
    
      
        x
      
    
    {\displaystyle x}
  
 is the input, and 
  
    
      
        
          α
          
            i
            j
          
        
      
    
    {\displaystyle \alpha _{ij}}
  
 are the (fixed) dropout rates.
</p><p>More generally, other distributions can be used than the Bernoulli distribution, such as the gaussian noise:
  
    
      
        
          y
          
            i
          
        
        =
        
          μ
          
            i
          
        
        +
        
          σ
          
            i
          
        
        ⊙
        
          ϵ
          
            i
          
        
        ,
        
        
          ϵ
          
            i
          
        
        ∼
        
          
            N
          
        
        (
        0
        ,
        I
        )
      
    
    {\displaystyle y_{i}=\mu _{i}+\sigma _{i}\odot \epsilon _{i},\quad \epsilon _{i}\sim {\mathcal {N}}(0,I)}
  
where 
  
    
      
        
          μ
          
            i
          
        
        =
        
          
            m
          
          
            i
          
          
            ⊤
          
        
        x
      
    
    {\displaystyle \mu _{i}=\mathbf {m} _{i}^{\top }x}
  
 and 
  
    
      
        
          σ
          
            i
          
          
            2
          
        
        =
        
          
            v
          
          
            i
          
          
            ⊤
          
        
        
          x
          
            2
          
        
      
    
    {\displaystyle \sigma _{i}^{2}=\mathbf {v} _{i}^{\top }x^{2}}
  
, with 
  
    
      
        
          
            m
          
          
            i
          
        
      
    
    {\displaystyle \mathbf {m} _{i}}
  
 and 
  
    
      
        
          
            v
          
          
            i
          
        
      
    
    {\displaystyle \mathbf {v} _{i}}
  
 being the mean and variance of the 
  
    
      
        i
      
    
    {\displaystyle i}
  
-th output neuron. The reparameterization trick can be applied to all such cases, resulting in the <i>variational dropout</i> method.<a class="footnote-ref" id="fnref:8" href="#fn:8"><sup>8</sup></a>
</p>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Variational_autoencoder/I5wBbYpR">Variational autoencoder</a></li>
<li><a href="/facts/Stochastic_gradient_descent/HbcaYqQP">Stochastic gradient descent</a></li>
<li><a href="/facts/Variational_inference/pFuMLBcU">Variational inference</a></li></ul>

<h2 id="further-reading">Further reading</h2>
<ul><li>Ruiz, Francisco R.; AUEB, Titsias RC; Blei, David (2016). <a href="https://proceedings.neurips.cc/paper_files/paper/2016/hash/f718499c1c8cef6730f9fd03c8125cab-Abstract.html">"The Generalized Reparameterization Gradient"</a>. <i>Advances in Neural Information Processing Systems</i>. 29. <a href="/facts/ArXiv_(identifier)/H6EtgnBe">arXiv</a>:<a href="https://arxiv.org/abs/1610.02287">1610.02287</a>. Retrieved September 23, 2024.</li>
<li>Zhang, Cheng; Butepage, Judith; Kjellstrom, Hedvig; Mandt, Stephan (2019-08-01). <a href="https://ieeexplore.ieee.org/document/8588399">"Advances in Variational Inference"</a>. <i>IEEE Transactions on Pattern Analysis and Machine Intelligence</i>. 41 (8): 2008–2026. <a href="/facts/ArXiv_(identifier)/H6EtgnBe">arXiv</a>:<a href="https://arxiv.org/abs/1711.05597">1711.05597</a>. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1109%2FTPAMI.2018.2889774">10.1109/TPAMI.2018.2889774</a>. <a href="/facts/ISSN_(identifier)/DPAflDvU">ISSN</a> <a href="https://search.worldcat.org/issn/0162-8828">0162-8828</a>. <a href="/facts/PMID_(identifier)/JlHAvMHt">PMID</a> <a href="https://pubmed.ncbi.nlm.nih.gov/30596568">30596568</a>.</li>
<li>Mohamed, Shakir (October 29, 2015). <a href="https://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/">"Machine Learning Trick of the Day (4): Reparameterisation Tricks"</a>. <i>The Spectator</i>. Retrieved September 23, 2024.</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Figurnov, Mikhail; Mohamed, Shakir; Mnih, Andriy (2018). "Implicit Reparameterization Gradients". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper_files/paper/2018/hash/92c8c96e4c37100777c7190b76d28233-Abstract.html" target="_blank">https://proceedings.neurips.cc/paper_files/paper/2018/hash/92c8c96e4c37100777c7190b76d28233-Abstract.html</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Fu, Michael C. "Gradient estimation." Handbooks in operations research and management science 13 (2006): 575-616. <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Kingma, Diederik P.; Welling, Max (2022-12-10). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML]. <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Williams, Ronald J. (1992-05-01). "Simple statistical gradient-following algorithms for connectionist reinforcement learning". Machine Learning. 8 (3): 229–256. doi:10.1007/BF00992696. ISSN 1573-0565. <a href="https://link.springer.com/article/10.1007/bf00992696" target="_blank">https://link.springer.com/article/10.1007/bf00992696</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>Greensmith, Evan; Bartlett, Peter L.; Baxter, Jonathan (2004). "Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning". Journal of Machine Learning Research. 5 (Nov): 1471–1530. ISSN 1533-7928. <a href="https://jmlr.org/papers/v5/greensmith04a.html" target="_blank">https://jmlr.org/papers/v5/greensmith04a.html</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
<li id="fn:6"><p>Maddison, Chris J.; Mnih, Andriy; Teh, Yee Whye (2017-03-05). "The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables". arXiv:1611.00712 [cs.LG]. <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></p></li>
<li id="fn:7"><p>Figurnov, Mikhail; Mohamed, Shakir; Mnih, Andriy (2018). "Implicit Reparameterization Gradients". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper_files/paper/2018/hash/92c8c96e4c37100777c7190b76d28233-Abstract.html" target="_blank">https://proceedings.neurips.cc/paper_files/paper/2018/hash/92c8c96e4c37100777c7190b76d28233-Abstract.html</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></p></li>
<li id="fn:8"><p>Kingma, Durk P; Salimans, Tim; Welling, Max (2015). "Variational Dropout and the Local Reparameterization Trick". Advances in Neural Information Processing Systems. 28. arXiv:1506.02557. <a href="https://proceedings.neurips.cc/paper/2015/hash/bc7316929fe1545bf0b98d114ee3ecb8-Abstract.html" target="_blank">https://proceedings.neurips.cc/paper/2015/hash/bc7316929fe1545bf0b98d114ee3ecb8-Abstract.html</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></p></li>
</ol>

Reparameterization trick open-in-new

Reparameterization trick