Given a set of independent identically distributed data points X = ( x 1 , … , x n ) , {\displaystyle \mathbf {X} =(x_{1},\ldots ,x_{n}),} where x i ∼ p ( x | θ ) {\displaystyle x_{i}\sim p(x|\theta )} according to some probability distribution parameterized by θ {\displaystyle \theta } , where θ {\displaystyle \theta } itself is a random variable described by a distribution, i.e. θ ∼ p ( θ ∣ α ) , {\displaystyle \theta \sim p(\theta \mid \alpha ),} the marginal likelihood in general asks what the probability p ( X ∣ α ) {\displaystyle p(\mathbf {X} \mid \alpha )} is, where θ {\displaystyle \theta } has been marginalized out (integrated out):
The above definition is phrased in the context of Bayesian statistics in which case p ( θ ∣ α ) {\displaystyle p(\theta \mid \alpha )} is called prior density and p ( X ∣ θ ) {\displaystyle p(\mathbf {X} \mid \theta )} is the likelihood. Recognizing that the marginal likelihood is the normalizing constant of the Bayesian posterior density p ( θ ∣ X , α ) {\displaystyle p(\theta \mid \mathbf {X} ,\alpha )} , one also has the alternative expression2
which is an identity in θ {\displaystyle \theta } . The marginal likelihood quantifies the agreement between data and prior in a geometric sense made precise[how?] in de Carvalho et al. (2019). In classical (frequentist) statistics, the concept of marginal likelihood occurs instead in the context of a joint parameter θ = ( ψ , λ ) {\displaystyle \theta =(\psi ,\lambda )} , where ψ {\displaystyle \psi } is the actual parameter of interest, and λ {\displaystyle \lambda } is a non-interesting nuisance parameter. If there exists a probability distribution for λ {\displaystyle \lambda } [dubious – discuss], it is often desirable to consider the likelihood function only in terms of ψ {\displaystyle \psi } , by marginalizing out λ {\displaystyle \lambda } :
Unfortunately, marginal likelihoods are generally difficult to compute. Exact solutions are known for a small class of distributions, particularly when the marginalized-out parameter is the conjugate prior of the distribution of the data. In other cases, some kind of numerical integration method is needed, either a general method such as Gaussian integration or a Monte Carlo method, or a method specialized to statistical problems such as the Laplace approximation, Gibbs/Metropolis sampling, or the EM algorithm.
It is also possible to apply the above considerations to a single random variable (data point) x {\displaystyle x} , rather than a set of observations. In a Bayesian context, this is equivalent to the prior predictive distribution of a data point.
In Bayesian model comparison, the marginalized variables θ {\displaystyle \theta } are parameters for a particular type of model, and the remaining variable M {\displaystyle M} is the identity of the model itself. In this case, the marginalized likelihood is the probability of the data given the model type, not assuming any particular model parameters. Writing θ {\displaystyle \theta } for the model parameters, the marginal likelihood for the model M is
It is in this context that the term model evidence is normally used. This quantity is important because the posterior odds ratio for a model M1 against another model M2 involves a ratio of marginal likelihoods, called the Bayes factor:
which can be stated schematically as
Šmídl, Václav; Quinn, Anthony (2006). "Bayesian Theory". The Variational Bayes Method in Signal Processing. Springer. pp. 13–23. doi:10.1007/3-540-28820-1_2. /wiki/Doi_(identifier) ↩
Chib, Siddhartha (1995). "Marginal likelihood from the Gibbs output". Journal of the American Statistical Association. 90 (432): 1313–1321. doi:10.1080/01621459.1995.10476635. /wiki/Doi_(identifier) ↩