Variational message passing

<h2 id="likelihood-lower-bound">Likelihood lower bound</h2>
<p class="note">Main article: <a href="/facts/Evidence_lower_bound/cANNQv7y">Evidence lower bound</a></p>
<p>Given some set of hidden variables 
  
    
      
        H
      
    
    {\displaystyle H}
  
 and observed variables 
  
    
      
        V
      
    
    {\displaystyle V}
  
, the goal of approximate inference is to maximize a lower-bound on the probability that a graphical model is in the configuration 
  
    
      
        V
      
    
    {\displaystyle V}
  
. Over some probability distribution 
  
    
      
        Q
      
    
    {\displaystyle Q}
  
 (to be defined later),
</p>

ln
        ⁡
        P
        (
        V
        )
        =
        
          ∑
          
            H
          
        
        Q
        (
        H
        )
        ln
        ⁡
        
          
            
              P
              (
              H
              ,
              V
              )
            
            
              P
              (
              H
              
                |
              
              V
              )
            
          
        
        =
        
          ∑
          
            H
          
        
        Q
        (
        H
        )
        
          
            [
          
        
        ln
        ⁡
        
          
            
              P
              (
              H
              ,
              V
              )
            
            
              Q
              (
              H
              )
            
          
        
        −
        ln
        ⁡
        
          
            
              P
              (
              H
              
                |
              
              V
              )
            
            
              Q
              (
              H
              )
            
          
        
        
          
            ]
          
        
      
    
    {\displaystyle \ln P(V)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{P(H|V)}}=\sum _{H}Q(H){\Bigg [}\ln {\frac {P(H,V)}{Q(H)}}-\ln {\frac {P(H|V)}{Q(H)}}{\Bigg ]}}
  
.
<p>So, if we define our lower bound to be
</p>

L
        (
        Q
        )
        =
        
          ∑
          
            H
          
        
        Q
        (
        H
        )
        ln
        ⁡
        
          
            
              P
              (
              H
              ,
              V
              )
            
            
              Q
              (
              H
              )
            
          
        
      
    
    {\displaystyle L(Q)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{Q(H)}}}
  
,
<p>then the likelihood is simply this bound plus the <a href="/facts/Relative_entropy/nh7SjlPE">relative entropy</a> between 
  
    
      
        P
      
    
    {\displaystyle P}
  
 and 
  
    
      
        Q
      
    
    {\displaystyle Q}
  
. Because the relative entropy is non-negative, the function 
  
    
      
        L
      
    
    {\displaystyle L}
  
 defined above is indeed a lower bound of the log likelihood of our observation 
  
    
      
        V
      
    
    {\displaystyle V}
  
. The distribution 
  
    
      
        Q
      
    
    {\displaystyle Q}
  
 will have a simpler character than that of 
  
    
      
        P
      
    
    {\displaystyle P}
  
 because marginalizing over 
  
    
      
        P
      
    
    {\displaystyle P}
  
 is intractable for all but the simplest of <a href="/facts/Graphical_models/XxfmKhmM">graphical models</a>. In particular, VMP uses a factorized distribution
</p>

Q
        (
        H
        )
        =
        
          ∏
          
            i
          
        
        
          Q
          
            i
          
        
        (
        
          H
          
            i
          
        
        )
        ,
      
    
    {\displaystyle Q(H)=\prod _{i}Q_{i}(H_{i}),}

<p>where 
  
    
      
        
          H
          
            i
          
        
      
    
    {\displaystyle H_{i}}
  
 is a disjoint part of the graphical model.
</p>
<h2 id="determining-the-update-rule">Determining the update rule</h2>
<p>The likelihood estimate needs to be as large as possible; because it's a lower bound, getting closer 
  
    
      
        log
        ⁡
        P
      
    
    {\displaystyle \log P}
  
 improves the approximation of the log likelihood. By substituting in the factorized version of 
  
    
      
        Q
      
    
    {\displaystyle Q}
  
, 
  
    
      
        L
        (
        Q
        )
      
    
    {\displaystyle L(Q)}
  
, parameterized over the hidden nodes 
  
    
      
        
          H
          
            i
          
        
      
    
    {\displaystyle H_{i}}
  
 as above, is simply the negative <a href="/facts/Relative_entropy/nh7SjlPE">relative entropy</a> between 
  
    
      
        
          Q
          
            j
          
        
      
    
    {\displaystyle Q_{j}}
  
 and 
  
    
      
        
          Q
          
            j
          
          
            ∗
          
        
      
    
    {\displaystyle Q_{j}^{*}}
  
 plus other terms independent of 
  
    
      
        
          Q
          
            j
          
        
      
    
    {\displaystyle Q_{j}}
  
 if 
  
    
      
        
          Q
          
            j
          
          
            ∗
          
        
      
    
    {\displaystyle Q_{j}^{*}}
  
 is defined as
</p>

Q
          
            j
          
          
            ∗
          
        
        (
        
          H
          
            j
          
        
        )
        =
        
          
            1
            Z
          
        
        
          e
          
            
              
                E
              
              
                −
                j
              
            
            {
            ln
            ⁡
            P
            (
            H
            ,
            V
            )
            }
          
        
      
    
    {\displaystyle Q_{j}^{*}(H_{j})={\frac {1}{Z}}e^{\mathbb {E} _{-j}\{\ln P(H,V)\}}}
  
,
<p>where 
  
    
      
        
          
            E
          
          
            −
            j
          
        
        {
        ln
        ⁡
        P
        (
        H
        ,
        V
        )
        }
      
    
    {\displaystyle \mathbb {E} _{-j}\{\ln P(H,V)\}}
  
 is the expectation over all distributions 
  
    
      
        
          Q
          
            i
          
        
      
    
    {\displaystyle Q_{i}}
  
 except 
  
    
      
        
          Q
          
            j
          
        
      
    
    {\displaystyle Q_{j}}
  
. Thus, if we set 
  
    
      
        
          Q
          
            j
          
        
      
    
    {\displaystyle Q_{j}}
  
 to be 
  
    
      
        
          Q
          
            j
          
          
            ∗
          
        
      
    
    {\displaystyle Q_{j}^{*}}
  
, the bound 
  
    
      
        L
      
    
    {\displaystyle L}
  
 is maximized.
</p>
<h2 id="messages-in-variational-message-passing">Messages in variational message passing</h2>
<p>Parents send their children the expectation of their <a href="/facts/Sufficient_statistic/eClRXHpd">sufficient statistic</a> while children send their parents their <a href="/facts/Natural_parameter/1LkkqEIf">natural parameter</a>, which also requires messages to be sent from the co-parents of the node.
</p>
<h2 id="relationship-to-exponential-families">Relationship to exponential families</h2>
<p>Because all nodes in VMP come from <a href="/facts/Exponential_family/1LkkqEIf">exponential families</a> and all parents of nodes are <a href="/facts/Conjugate_prior/ScCFcs8b">conjugate</a> to their children nodes, the expectation of the <a href="/facts/Sufficient_statistic/eClRXHpd">sufficient statistic</a> can be computed from the <a href="/facts/Normalization_factor/KKTfybqr">normalization factor</a>.
</p>
<h2 id="vmp-algorithm">VMP algorithm</h2>
<p>The algorithm begins by computing the expected value of the sufficient statistics for that vector. Then, until the likelihood converges to a stable value (this is usually accomplished by setting a small threshold value and running the algorithm until it increases by less than that threshold value), do the following at each node:
</p>
<ol><li>Get all messages from parents.</li>
<li>Get all messages from children (this might require the children to get messages from the co-parents).</li>
<li>Compute the expected value of the nodes sufficient statistics.</li></ol>
<h2 id="constraints">Constraints</h2>
<p>Because every child must be conjugate to its parent, this has limited the types of distributions that can be used in the model. For example, the parents of a <a href="/facts/Gaussian_distribution/UapjjPyQ">Gaussian distribution</a> must be a <a href="/facts/Gaussian_distribution/UapjjPyQ">Gaussian distribution</a> (corresponding to the <a href="/facts/Mean/swcEd4Pg">Mean</a>) and a <a href="/facts/Gamma_distribution/lczcdJmw">gamma distribution</a> (corresponding to the precision, or one over 
  
    
      
        σ
      
    
    {\displaystyle \sigma }
  
 in more common parameterizations). Discrete variables can have <a href="/facts/Dirichlet_distribution/Qq13f5g2">Dirichlet</a> parents, and <a href="/facts/Poisson_distribution/CvCzXkHr">Poisson</a> and <a href="/facts/Exponential_distribution/yY3EdV0u">exponential</a> nodes must have <a href="/facts/Gamma_distribution/lczcdJmw">gamma</a> parents. More recently, VMP has been extended to handle models that violate this conditional conjugacy constraint.<a class="footnote-ref" id="fnref:1" href="#fn:1"><sup>1</sup></a>
</p>

<ul><li>Winn, J.M.; Bishop, C. (2005). <a href="http://www.johnwinn.org/Publications/papers/VMP2004.pdf">"Variational Message Passing"</a> (PDF). <i>Journal of Machine Learning Research</i>. 6: 661–694.</li>
<li>Beal, M.J. (2003). <a href="https://web.archive.org/web/20050428173705/http://www.cs.toronto.edu/~beal/thesis/beal03.pdf"><i>Variational Algorithms for Approximate Bayesian Inference</i></a> (PDF) (PhD). Gatsby Computational Neuroscience Unit, University College London. Archived from <a href="http://www.cs.toronto.edu/~beal/thesis/beal03.pdf">the original</a> (PDF) on 2005-04-28. Retrieved 2007-02-15.</li></ul>
<h2 id="external-links">External links</h2>
<ul><li><a href="http://research.microsoft.com/infernet">Infer.NET</a>: an inference framework which includes an implementation of VMP with examples.</li>
<li><a href="http://dimple.probprog.org">dimple</a>: an open-source inference system supporting VMP.</li>
<li>An <a href="http://vibes.sourceforge.net">older implementation</a> of VMP with usage examples.</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Knowles, David A.; Minka, Thomas P. (2011). "Non-conjugate Variational Message Passing for Multinomial and Binary Regression" (PDF). NeurIPS. <a href="https://proceedings.neurips.cc/paper/2011/file/5c936263f3428a40227908d5a3847c0b-Paper.pdf" target="_blank">https://proceedings.neurips.cc/paper/2011/file/5c936263f3428a40227908d5a3847c0b-Paper.pdf</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
</ol>

Variational message passing open-in-new

Variational message passing