Cross-entropy method

<h2 id="estimation-via-importance-sampling">Estimation via importance sampling</h2>
<p>Consider the general problem of estimating the quantity
</p><p>
  
    
      
        ℓ
        =
        
          
            E
          
          
            
              u
            
          
        
        [
        H
        (
        
          X
        
        )
        ]
        =
        ∫
        H
        (
        
          x
        
        )
        
        f
        (
        
          x
        
        ;
        
          u
        
        )
        
        
          
            d
          
        
        
          x
        
      
    
    {\displaystyle \ell =\mathbb {E} _{\mathbf {u} }[H(\mathbf {X} )]=\int H(\mathbf {x} )\,f(\mathbf {x} ;\mathbf {u} )\,{\textrm {d}}\mathbf {x} }
  
,
</p><p>where 
  
    
      
        H
      
    
    {\displaystyle H}
  
 is some <i>performance function</i> and 
  
    
      
        f
        (
        
          x
        
        ;
        
          u
        
        )
      
    
    {\displaystyle f(\mathbf {x} ;\mathbf {u} )}
  
 is a member of some <a href="/facts/Parametric_family/YoEWw0I8">parametric family</a> of distributions. Using <a href="/facts/Importance_sampling/jrbUnMXa">importance sampling</a> this quantity can be estimated as
</p><p>
  
    
      
        
          
            
              ℓ
              ^
            
          
        
        =
        
          
            1
            N
          
        
        
          ∑
          
            i
            =
            1
          
          
            N
          
        
        H
        (
        
          
            X
          
          
            i
          
        
        )
        
          
            
              f
              (
              
                
                  X
                
                
                  i
                
              
              ;
              
                u
              
              )
            
            
              g
              (
              
                
                  X
                
                
                  i
                
              
              )
            
          
        
      
    
    {\displaystyle {\hat {\ell }}={\frac {1}{N}}\sum _{i=1}^{N}H(\mathbf {X} _{i}){\frac {f(\mathbf {X} _{i};\mathbf {u} )}{g(\mathbf {X} _{i})}}}
  
,
</p><p>where 
  
    
      
        
          
            X
          
          
            1
          
        
        ,
        …
        ,
        
          
            X
          
          
            N
          
        
      
    
    {\displaystyle \mathbf {X} _{1},\dots ,\mathbf {X} _{N}}
  
 is a random sample from 
  
    
      
        g
        
      
    
    {\displaystyle g\,}
  
. For positive 
  
    
      
        H
      
    
    {\displaystyle H}
  
, the theoretically <i>optimal</i> importance sampling <a href="/facts/Probability_density_function/zvfybna4">density</a> (PDF) is given by
</p><p>
  
    
      
        
          g
          
            ∗
          
        
        (
        
          x
        
        )
        =
        H
        (
        
          x
        
        )
        f
        (
        
          x
        
        ;
        
          u
        
        )
        
          /
        
        ℓ
      
    
    {\displaystyle g^{*}(\mathbf {x} )=H(\mathbf {x} )f(\mathbf {x} ;\mathbf {u} )/\ell }
  
.
</p><p>This, however, depends on the unknown 
  
    
      
        ℓ
      
    
    {\displaystyle \ell }
  
. The CE method aims to approximate the optimal PDF by adaptively selecting members of the parametric family that are closest (in the <a href="/facts/Kullback%E2%80%93Leibler_divergence/nh7SjlPE">Kullback–Leibler</a> sense) to the optimal PDF 
  
    
      
        
          g
          
            ∗
          
        
      
    
    {\displaystyle g^{*}}
  
.
</p>
<h2 id="generic-ce-algorithm">Generic CE algorithm</h2>
<ol><li>Choose initial parameter vector 
  
    
      
        
          
            v
          
          
            (
            0
            )
          
        
      
    
    {\displaystyle \mathbf {v} ^{(0)}}
  
; set t = 1.</li>
<li>Generate a random sample 
  
    
      
        
          
            X
          
          
            1
          
        
        ,
        …
        ,
        
          
            X
          
          
            N
          
        
      
    
    {\displaystyle \mathbf {X} _{1},\dots ,\mathbf {X} _{N}}
  
 from 
  
    
      
        f
        (
        ⋅
        ;
        
          
            v
          
          
            (
            t
            −
            1
            )
          
        
        )
      
    
    {\displaystyle f(\cdot ;\mathbf {v} ^{(t-1)})}
  
</li>
<li>Solve for 
  
    
      
        
          
            v
          
          
            (
            t
            )
          
        
      
    
    {\displaystyle \mathbf {v} ^{(t)}}
  
, where
  
    
      
        
          
            v
          
          
            (
            t
            )
          
        
        =
        
          
            
              argmax
            
          
          
            
              v
            
          
        
        ⁡
        
          
            1
            N
          
        
        
          ∑
          
            i
            =
            1
          
          
            N
          
        
        H
        (
        
          
            X
          
          
            i
          
        
        )
        
          
            
              f
              (
              
                
                  X
                
                
                  i
                
              
              ;
              
                u
              
              )
            
            
              f
              (
              
                
                  X
                
                
                  i
                
              
              ;
              
                
                  v
                
                
                  (
                  t
                  −
                  1
                  )
                
              
              )
            
          
        
        log
        ⁡
        f
        (
        
          
            X
          
          
            i
          
        
        ;
        
          v
        
        )
      
    
    {\displaystyle \mathbf {v} ^{(t)}=\mathop {\textrm {argmax}} _{\mathbf {v} }{\frac {1}{N}}\sum _{i=1}^{N}H(\mathbf {X} _{i}){\frac {f(\mathbf {X} _{i};\mathbf {u} )}{f(\mathbf {X} _{i};\mathbf {v} ^{(t-1)})}}\log f(\mathbf {X} _{i};\mathbf {v} )}
  
</li>
<li>If convergence is reached then stop; otherwise, increase t by 1 and reiterate from step 2.</li></ol>
<p>In several cases, the solution to step 3 can be found <i>analytically</i>.  Situations in which this occurs are
</p>
<ul><li>When 
  
    
      
        f
        
      
    
    {\displaystyle f\,}
  
 belongs to the <a href="/facts/Exponential_family/1LkkqEIf">natural exponential family</a></li>
<li>When 
  
    
      
        f
        
      
    
    {\displaystyle f\,}
  
 is <a href="/facts/Discrete_space/hpvXA23u">discrete</a> with finite <a href="/facts/Support_(mathematics)/BTuMGbsb">support</a></li>
<li>When 
  
    
      
        H
        (
        
          X
        
        )
        =
        
          
            I
          
          
            {
            
              x
            
            ∈
            A
            }
          
        
      
    
    {\displaystyle H(\mathbf {X} )=\mathrm {I} _{\{\mathbf {x} \in A\}}}
  
 and 
  
    
      
        f
        (
        
          
            X
          
          
            i
          
        
        ;
        
          u
        
        )
        =
        f
        (
        
          
            X
          
          
            i
          
        
        ;
        
          
            v
          
          
            (
            t
            −
            1
            )
          
        
        )
      
    
    {\displaystyle f(\mathbf {X} _{i};\mathbf {u} )=f(\mathbf {X} _{i};\mathbf {v} ^{(t-1)})}
  
, then 
  
    
      
        
          
            v
          
          
            (
            t
            )
          
        
      
    
    {\displaystyle \mathbf {v} ^{(t)}}
  
 corresponds to the <a href="/facts/Maximum_likelihood/0Yq2dpQD">maximum likelihood estimator</a> based on those 
  
    
      
        
          
            X
          
          
            k
          
        
        ∈
        A
      
    
    {\displaystyle \mathbf {X} _{k}\in A}
  
.</li></ul>
<h2 id="continuous-optimizationexample">Continuous optimization—example</h2>
<p>The same CE algorithm can be used for optimization, rather than estimation. 
Suppose the problem is to maximize some function 
  
    
      
        S
      
    
    {\displaystyle S}
  
, for example,

S
        (
        x
        )
        =
        
          
            
              e
            
          
          
            −
            (
            x
            −
            2
            
              )
              
                2
              
            
          
        
        +
        0.8
        
        
          
            
              e
            
          
          
            −
            (
            x
            +
            2
            
              )
              
                2
              
            
          
        
      
    
    {\displaystyle S(x)={\textrm {e}}^{-(x-2)^{2}}+0.8\,{\textrm {e}}^{-(x+2)^{2}}}
  
. 
To apply CE, one considers first the <i>associated stochastic problem</i> of estimating

P
          
          
            θ
          
        
        (
        S
        (
        X
        )
        ≥
        γ
        )
      
    
    {\displaystyle \mathbb {P} _{\boldsymbol {\theta }}(S(X)\geq \gamma )}

for a given <i>level</i> 
  
    
      
        γ
        
      
    
    {\displaystyle \gamma \,}
  
, and parametric family 
  
    
      
        
          {
          
            f
            (
            ⋅
            ;
            
              θ
            
            )
          
          }
        
      
    
    {\displaystyle \left\{f(\cdot ;{\boldsymbol {\theta }})\right\}}
  
, for example the 1-dimensional 
<a href="/facts/Gaussian_distribution/UapjjPyQ">Gaussian distribution</a>,
parameterized by its mean 
  
    
      
        
          μ
          
            t
          
        
        
      
    
    {\displaystyle \mu _{t}\,}
  
 and variance 
  
    
      
        
          σ
          
            t
          
          
            2
          
        
      
    
    {\displaystyle \sigma _{t}^{2}}
  
 (so 
  
    
      
        
          θ
        
        =
        (
        μ
        ,
        
          σ
          
            2
          
        
        )
      
    
    {\displaystyle {\boldsymbol {\theta }}=(\mu ,\sigma ^{2})}
  
 here).
Hence, for a given 
  
    
      
        γ
        
      
    
    {\displaystyle \gamma \,}
  
, the goal is to find 
  
    
      
        
          θ
        
      
    
    {\displaystyle {\boldsymbol {\theta }}}
  
 so that

D
          
            
              K
              L
            
          
        
        (
        
          
            
              I
            
          
          
            {
            S
            (
            x
            )
            ≥
            γ
            }
          
        
        ‖
        
          f
          
            θ
          
        
        )
      
    
    {\displaystyle D_{\mathrm {KL} }({\textrm {I}}_{\{S(x)\geq \gamma \}}\|f_{\boldsymbol {\theta }})}

is minimized. This is done by solving the sample version (stochastic counterpart) of the KL divergence minimization problem, as in step 3 above.
It turns out that parameters that minimize the stochastic counterpart for this choice of target distribution and
parametric family are the sample mean and sample variance corresponding to the <i>elite samples</i>, which are those samples that have objective function value 
  
    
      
        ≥
        γ
      
    
    {\displaystyle \geq \gamma }
  
.
The worst of the elite samples is then used as the level parameter for the next iteration.
This yields the following randomized algorithm that happens to coincide with the so-called Estimation of Multivariate Normal Algorithm (EMNA), an <a href="/facts/Estimation_of_distribution_algorithm/oeGEPIv4">estimation of distribution algorithm</a>.
</p>
<h3>Pseudocode</h3>
<i>// Initialize parameters</i>
μ := −6
σ2 := 100
t := 0
maxits := 100
N := 100
Ne := 10
<i>// While maxits not exceeded and not converged</i>
while t < maxits and σ2 > ε do
    <i>// Obtain N samples from current sampling distribution</i>
    X := SampleGaussian(μ, σ2, N)
    <i>// Evaluate objective function at sampled points</i>
    S := exp(−(X − 2) ^ 2) + 0.8 exp(−(X + 2) ^ 2)
    <i>// Sort X by objective function values in descending order</i>
    X := sort(X, S)
    <i>// Update parameters of sampling distribution via elite samples</i>                  
    μ := mean(X(1:Ne))
    σ2 := var(X(1:Ne))
    t := t + 1
<i>// Return mean of final sampling distribution as solution</i>
return μ

<h2 id="related-methods">Related methods</h2>
<ul><li><a href="/facts/Simulated_annealing/JCmHqd3t">Simulated annealing</a></li>
<li><a href="/facts/Genetic_algorithms/WP2AFWuW">Genetic algorithms</a></li>
<li><a href="/facts/Harmony_search/6zoAh906">Harmony search</a></li>
<li><a href="/facts/Estimation_of_distribution_algorithm/oeGEPIv4">Estimation of distribution algorithm</a></li>
<li><a href="/facts/Tabu_search/olw8qTSl">Tabu search</a></li>
<li><a href="/facts/Natural_Evolution_Strategy/xINxM3k4">Natural Evolution Strategy</a></li>
<li><a href="/facts/Ant_colony_optimization_algorithms/SXGYs630">Ant colony optimization algorithms</a></li></ul>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Cross_entropy/IG1WD4jb">Cross entropy</a></li>
<li><a href="/facts/Kullback%E2%80%93Leibler_divergence/nh7SjlPE">Kullback–Leibler divergence</a></li>
<li><a href="/facts/Randomized_algorithm/Fngl1zgk">Randomized algorithm</a></li>
<li><a href="/facts/Importance_sampling/jrbUnMXa">Importance sampling</a></li></ul>
<h2 id="journal-papers">Journal papers</h2>
<ul><li>De Boer, P.-T., Kroese, D.P., Mannor, S. and Rubinstein, R.Y. (2005). A Tutorial on the Cross-Entropy Method. <i>Annals of Operations Research</i>, 134 (1), 19–67.<a href="http://www.maths.uq.edu.au/~kroese/ps/aortut.pdf">[1]</a></li>
<li>Rubinstein, R.Y. (1997). Optimization of Computer Simulation Models with Rare Events, <i>European Journal of Operational Research</i>, 99, 89–112.</li></ul>
<h2 id="software-implementations">Software implementations</h2>
<ul><li><a href="https://cran.r-project.org/web/packages/CEoptim/index.html">CEoptim R package</a></li>
<li><a href="https://www.nuget.org/packages/Novacta.Analytics">Novacta.Analytics .NET library</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Rubinstein, R.Y. and  Kroese, D.P. (2004), The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, Springer-Verlag, New York ISBN 978-0-387-21240-1. <a href="/wiki/ISBN_(identifier)" target="_blank">/wiki/ISBN_(identifier)</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
</ol>

Cross-entropy method open-in-new

Cross-entropy method