Statistical methods and models commonly involve multiple parameters that can be regarded as related or connected in such a way that the problem implies a dependence of the joint probability model for these parameters.6 Individual degrees of belief, expressed in the form of probabilities, come with uncertainty.7 Amidst this is the change of the degrees of belief over time. As was stated by Professor José M. Bernardo and Professor Adrian F. Smith, “The actuality of the learning process consists in the evolution of individual and subjective beliefs about the reality.” These subjective probabilities are more directly involved in the mind rather than the physical probabilities.8 Hence, it is with this need of updating beliefs that Bayesians have formulated an alternative statistical model which takes into account the prior occurrence of a particular event.9
The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options.10
Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability θ j {\displaystyle \theta _{j}} , the survival probability will be updated with the occurrence of y, the event in which a controversial serum is created which, as believed by some, increases survival in cardiac patients.
In order to make updated probability statements about θ j {\displaystyle \theta _{j}} , given the occurrence of event y, we must begin with a model providing a joint probability distribution for θ j {\displaystyle \theta _{j}} and y. This can be written as a product of the two distributions that are often referred to as the prior distribution P ( θ ) {\displaystyle P(\theta )} and the sampling distribution P ( y ∣ θ ) {\displaystyle P(y\mid \theta )} respectively:
Using the basic property of conditional probability, the posterior distribution will yield:
This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to deconstruct the probability, P ( θ ∣ y ) {\displaystyle P(\theta \mid y)} , relative to solvable subsets of its supportive evidence.11
The usual starting point of a statistical analysis is the assumption that the n values y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} are exchangeable. If no information – other than data y – is available to distinguish any of the θ j {\displaystyle \theta _{j}} ’s from any others, and no ordering or grouping of the parameters can be made, one must assume symmetry of prior distribution parameters.12 This symmetry is represented probabilistically by exchangeability. Generally, it is useful and appropriate to model data from an exchangeable distribution as independently and identically distributed, given some unknown parameter vector θ {\displaystyle \theta } , with distribution P ( θ ) {\displaystyle P(\theta )} .
For a fixed number n, the set y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable if the joint probability P ( y 1 , y 2 , … , y n ) {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})} is invariant under permutations of the indices. That is, for every permutation π {\displaystyle \pi } or ( π 1 , π 2 , … , π n ) {\displaystyle (\pi _{1},\pi _{2},\ldots ,\pi _{n})} of (1, 2, …, n), P ( y 1 , y 2 , … , y n ) = P ( y π 1 , y π 2 , … , y π n ) . {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})=P(y_{\pi _{1}},y_{\pi _{2}},\ldots ,y_{\pi _{n}}).} 13
The following is an exchangeable, but not independent and identical (iid), example: Consider an urn with a red ball and a blue ball inside, with probability 1 2 {\displaystyle {\frac {1}{2}}} of drawing either. Balls are drawn without replacement, i.e. after one ball is drawn from the n balls, there will be n − 1 remaining balls left for the next draw.
The probability of selecting a red ball in the first draw and a blue ball in the second draw is equal to the probability of selecting a blue ball on the first draw and a red on the second, both of which are 1/2:
This makes y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} exchangeable.
But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first is 0. This is not equal to the probability that the red ball is selected in the second draw, which is 1/2:
Thus, y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} are not independent.
If x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} are independent and identically distributed, then they are exchangeable, but the converse is not necessarily true.14
Infinite exchangeability is the property that every finite subset of an infinite sequence y 1 {\displaystyle y_{1}} , y 2 , … {\displaystyle y_{2},\ldots } is exchangeable. For any n, the sequence y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable.15
Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution,16 namely:
Suppose a random variable Y follows a normal distribution with parameter θ {\displaystyle \theta } as the mean and 1 as the variance, that is Y ∣ θ ∼ N ( θ , 1 ) {\displaystyle Y\mid \theta \sim N(\theta ,1)} . The tilde relation ∼ {\displaystyle \sim } can be read as "has the distribution of" or "is distributed as". Suppose also that the parameter θ {\displaystyle \theta } has a distribution given by a normal distribution with mean μ {\displaystyle \mu } and variance 1, i.e. θ ∣ μ ∼ N ( μ , 1 ) {\displaystyle \theta \mid \mu \sim N(\mu ,1)} . Furthermore, μ {\displaystyle \mu } follows another distribution given, for example, by the standard normal distribution, N ( 0 , 1 ) {\displaystyle {\text{N}}(0,1)} . The parameter μ {\displaystyle \mu } is called the hyperparameter, while its distribution given by N ( 0 , 1 ) {\displaystyle {\text{N}}(0,1)} is an example of a hyperprior distribution. The notation of the distribution of Y changes as another parameter is added, i.e. Y ∣ θ , μ ∼ N ( θ , 1 ) {\displaystyle Y\mid \theta ,\mu \sim N(\theta ,1)} . If there is another stage, say, μ {\displaystyle \mu } following another normal distribution with a mean of β {\displaystyle \beta } and a variance of ϵ {\displaystyle \epsilon } , then μ ∼ N ( β , ϵ ) {\displaystyle \mu \sim N(\beta ,\epsilon )} , {\displaystyle {\mbox{ }}} β {\displaystyle \beta } and ϵ {\displaystyle \epsilon } can also be called hyperparameters with hyperprior distributions.17
Let y j {\displaystyle y_{j}} be an observation and θ j {\displaystyle \theta _{j}} a parameter governing the data generating process for y j {\displaystyle y_{j}} . Assume further that the parameters θ 1 , θ 2 , … , θ j {\displaystyle \theta _{1},\theta _{2},\ldots ,\theta _{j}} are generated exchangeably from a common population, with distribution governed by a hyperparameter ϕ {\displaystyle \phi } . The Bayesian hierarchical model contains the following stages:
The likelihood, as seen in stage I is P ( y j ∣ θ j , ϕ ) {\displaystyle P(y_{j}\mid \theta _{j},\phi )} , with P ( θ j , ϕ ) {\displaystyle P(\theta _{j},\phi )} as its prior distribution. Note that the likelihood depends on ϕ {\displaystyle \phi } only through θ j {\displaystyle \theta _{j}} .
The prior distribution from stage I can be broken down into:
With ϕ {\displaystyle \phi } as its hyperparameter with hyperprior distribution, P ( ϕ ) {\displaystyle P(\phi )} .
Thus, the posterior distribution is proportional to:
As an example, a teacher wants to estimate how well a student did on the SAT. The teacher uses the current grade point average (GPA) of the student for an estimate. Their current GPA, denoted by Y {\displaystyle Y} , has a likelihood given by some probability function with parameter θ {\displaystyle \theta } , i.e. Y ∣ θ ∼ P ( Y ∣ θ ) {\displaystyle Y\mid \theta \sim P(Y\mid \theta )} . This parameter θ {\displaystyle \theta } is the SAT score of the student. The SAT score is viewed as a sample coming from a common population distribution indexed by another parameter ϕ {\displaystyle \phi } , which is the high school grade of the student (freshman, sophomore, junior or senior).19 That is, θ ∣ ϕ ∼ P ( θ ∣ ϕ ) {\displaystyle \theta \mid \phi \sim P(\theta \mid \phi )} . Moreover, the hyperparameter ϕ {\displaystyle \phi } follows its own distribution given by P ( ϕ ) {\displaystyle P(\phi )} , a hyperprior.
These relationships can be used to calculate the likelihood of a specific SAT score relative to a particular GPA:
All information in the problem will be used to solve for the posterior distribution. Instead of solving only using the prior distribution and the likelihood function, using hyperpriors allows a more nuanced distinction of relationships between given variables.20
In general, the joint posterior distribution of interest in 2-stage hierarchical models is:
For 3-stage hierarchical models, the posterior distribution is given by:
A three stage version of Bayesian hierarchical modeling could be used to calculate probability at 1) an individual level, 2) at the level of population and 3) the prior, which is an assumed probability distribution that takes place before evidence is initially acquired:
Stage 1: Individual-Level Model
y i j = f ( t i j ; θ 1 i , θ 2 i , … , θ l i , … , θ K i ) + ϵ i j , ϵ i j ∼ N ( 0 , σ 2 ) , i = 1 , … , N , j = 1 , … , M i . {\displaystyle {y}_{ij}=f(t_{ij};\theta _{1i},\theta _{2i},\ldots ,\theta _{li},\ldots ,\theta _{Ki})+\epsilon _{ij},\quad \epsilon _{ij}\sim N(0,\sigma ^{2}),\quad i=1,\ldots ,N,\,j=1,\ldots ,M_{i}.}
Stage 2: Population Model
θ l i = α l + ∑ b = 1 P β l b x i b + η l i , η l i ∼ N ( 0 , ω l 2 ) , i = 1 , … , N , l = 1 , … , K . {\displaystyle \theta _{li}=\alpha _{l}+\sum _{b=1}^{P}\beta _{lb}x_{ib}+\eta _{li},\quad \eta _{li}\sim N(0,\omega _{l}^{2}),\quad i=1,\ldots ,N,\,l=1,\ldots ,K.}
Stage 3: Prior
σ 2 ∼ π ( σ 2 ) , α l ∼ π ( α l ) , ( β l 1 , … , β l b , … , β l P ) ∼ π ( β l 1 , … , β l b , … , β l P ) , ω l 2 ∼ π ( ω l 2 ) , l = 1 , … , K . {\displaystyle \sigma ^{2}\sim \pi (\sigma ^{2}),\quad \alpha _{l}\sim \pi (\alpha _{l}),\quad (\beta _{l1},\ldots ,\beta _{lb},\ldots ,\beta _{lP})\sim \pi (\beta _{l1},\ldots ,\beta _{lb},\ldots ,\beta _{lP}),\quad \omega _{l}^{2}\sim \pi (\omega _{l}^{2}),\quad l=1,\ldots ,K.}
Here, y i j {\displaystyle y_{ij}} denotes the continuous response of the i {\displaystyle i} -th subject at the time point t i j {\displaystyle t_{ij}} , and x i b {\displaystyle x_{ib}} is the b {\displaystyle b} -th covariate of the i {\displaystyle i} -th subject. Parameters involved in the model are written in Greek letters. The variable f ( t ; θ 1 , … , θ K ) {\displaystyle f(t;\theta _{1},\ldots ,\theta _{K})} is a known function parameterized by the K {\displaystyle K} -dimensional vector ( θ 1 , … , θ K ) {\displaystyle (\theta _{1},\ldots ,\theta _{K})} .
Typically, f {\displaystyle f} is a `nonlinear' function and describes the temporal trajectory of individuals. In the model, ϵ i j {\displaystyle \epsilon _{ij}} and η l i {\displaystyle \eta _{li}} describe within-individual variability and between-individual variability, respectively. If the prior is not considered, the relationship reduces to a frequentist nonlinear mixed-effect model.
A central task in the application of the Bayesian nonlinear mixed-effect models is to evaluate posterior density:
π ( { θ l i } i = 1 , l = 1 N , K , σ 2 , { α l } l = 1 K , { β l b } l = 1 , b = 1 K , P , { ω l } l = 1 K | { y i j } i = 1 , j = 1 N , M i ) {\displaystyle \pi (\{\theta _{li}\}_{i=1,l=1}^{N,K},\sigma ^{2},\{\alpha _{l}\}_{l=1}^{K},\{\beta _{lb}\}_{l=1,b=1}^{K,P},\{\omega _{l}\}_{l=1}^{K}|\{y_{ij}\}_{i=1,j=1}^{N,M_{i}})}
∝ π ( { y i j } i = 1 , j = 1 N , M i , { θ l i } i = 1 , l = 1 N , K , σ 2 , { α l } l = 1 K , { β l b } l = 1 , b = 1 K , P , { ω l } l = 1 K ) {\displaystyle \propto \pi (\{y_{ij}\}_{i=1,j=1}^{N,M_{i}},\{\theta _{li}\}_{i=1,l=1}^{N,K},\sigma ^{2},\{\alpha _{l}\}_{l=1}^{K},\{\beta _{lb}\}_{l=1,b=1}^{K,P},\{\omega _{l}\}_{l=1}^{K})}
= π ( { y i j } i = 1 , j = 1 N , M i | { θ l i } i = 1 , l = 1 N , K , σ 2 ) ⏟ S t a g e 1 : I n d i v i d u a l − L e v e l M o d e l × π ( { θ l i } i = 1 , l = 1 N , K | { α l } l = 1 K , { β l b } l = 1 , b = 1 K , P , { ω l } l = 1 K ) ⏟ S t a g e 2 : P o p u l a t i o n M o d e l × p ( σ 2 , { α l } l = 1 K , { β l b } l = 1 , b = 1 K , P , { ω l } l = 1 K ) ⏟ S t a g e 3 : P r i o r {\displaystyle =\underbrace {\pi (\{y_{ij}\}_{i=1,j=1}^{N,M_{i}}|\{\theta _{li}\}_{i=1,l=1}^{N,K},\sigma ^{2})} _{Stage1:Individual-LevelModel}\times \underbrace {\pi (\{\theta _{li}\}_{i=1,l=1}^{N,K}|\{\alpha _{l}\}_{l=1}^{K},\{\beta _{lb}\}_{l=1,b=1}^{K,P},\{\omega _{l}\}_{l=1}^{K})} _{Stage2:PopulationModel}\times \underbrace {p(\sigma ^{2},\{\alpha _{l}\}_{l=1}^{K},\{\beta _{lb}\}_{l=1,b=1}^{K,P},\{\omega _{l}\}_{l=1}^{K})} _{Stage3:Prior}}
The panel on the right displays Bayesian research cycle using Bayesian nonlinear mixed-effects model.23 A research cycle using the Bayesian nonlinear mixed-effects model comprises two steps: (a) standard research cycle and (b) Bayesian-specific workflow.
A standard research cycle involves 1) literature review, 2) defining a problem and 3) specifying the research question and hypothesis. Bayesian-specific workflow stratifies this approach to include three sub-steps: (b)–(i) formalizing prior distributions based on background knowledge and prior elicitation; (b)–(ii) determining the likelihood function based on a nonlinear function f {\displaystyle f} ; and (b)–(iii) making a posterior inference. The resulting posterior inference can be used to start a new research cycle.
Allenby, Rossi, McCulloch (January 2005). "Hierarchical Bayes Model: A Practitioner’s Guide". Journal of Bayesian Applications in Marketing, pp. 1–4. Retrieved 26 April 2014, p. 3 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=655541 ↩
Gelman, Andrew; Carlin, John B.; Stern, Hal S. & Rubin, Donald B. (2004). Bayesian Data Analysis (second ed.). Boca Raton, Florida: CRC Press. pp. 4–5. ISBN 1-58488-388-X. 1-58488-388-X ↩
Lee, Se Yoon; Lei, Bowen; Mallick, Bani (2020). "Estimation of COVID-19 spread curves integrating global data and borrowing information". PLOS ONE. 15 (7): e0236860. arXiv:2005.00662. doi:10.1371/journal.pone.0236860. PMC 7390340. PMID 32726361. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7390340 ↩
Lee, Se Yoon; Mallick, Bani (2021). "Bayesian Hierarchical Modeling: Application Towards Production Results in the Eagle Ford Shale of South Texas". Sankhya B. 84: 1–43. doi:10.1007/s13571-020-00245-8. https://doi.org/10.1007%2Fs13571-020-00245-8 ↩
Gelman et al. 2004, p. 6. - Gelman, Andrew; Carlin, John B.; Stern, Hal S. & Rubin, Donald B. (2004). Bayesian Data Analysis (second ed.). Boca Raton, Florida: CRC Press. pp. 4–5. ISBN 1-58488-388-X. ↩
Gelman et al. 2004, p. 117. - Gelman, Andrew; Carlin, John B.; Stern, Hal S. & Rubin, Donald B. (2004). Bayesian Data Analysis (second ed.). Boca Raton, Florida: CRC Press. pp. 4–5. ISBN 1-58488-388-X. ↩
Good, I.J. (1980). "Some history of the hierarchical Bayesian methodology". Trabajos de Estadistica y de Investigacion Operativa. 31: 489–519. doi:10.1007/BF02888365. S2CID 121270218. http://dialnet.unirioja.es/servlet/oaiart?codigo=2368428 ↩
Bernardo, Smith(1994). Bayesian Theory. Chichester, England: John Wiley & Sons, ISBN 0-471-92416-4, p. 23 https://books.google.com/books?id=11nSgIcd7xQC&dq=bernardo+degroot+lindley&pg=PA497 ↩
Gelman et al. 2004, pp. 6–8. - Gelman, Andrew; Carlin, John B.; Stern, Hal S. & Rubin, Donald B. (2004). Bayesian Data Analysis (second ed.). Boca Raton, Florida: CRC Press. pp. 4–5. ISBN 1-58488-388-X. ↩
Bernardo, Degroot, Lindley (September 1983). “Proceedings of the Second Valencia International Meeting”. Bayesian Statistics 2. Amsterdam: Elsevier Science Publishers B.V, ISBN 0-444-87746-0, pp. 167–168 https://books.google.com/books?id=myfRtgAACAAJ&q=Proceedings+of+the+Second+Valencia+International+Meeting ↩
Gelman et al. 2004, pp. 121–125. - Gelman, Andrew; Carlin, John B.; Stern, Hal S. & Rubin, Donald B. (2004). Bayesian Data Analysis (second ed.). Boca Raton, Florida: CRC Press. pp. 4–5. ISBN 1-58488-388-X. ↩
Diaconis, Freedman (1980). “Finite exchangeable sequences”. Annals of Probability, pp. 745–747 http://projecteuclid.org/download/pdf_1/euclid.aop/1176994663 ↩
Bernardo, Degroot, Lindley (September 1983). “Proceedings of the Second Valencia International Meeting”. Bayesian Statistics 2. Amsterdam: Elsevier Science Publishers B.V, ISBN 0-444-87746-0, pp. 371–372 https://books.google.com/books?id=wYj-_uFLOe4C&q=Proceedings%20of%20the%20Second%20Valencia%20International%20Meeting ↩
Gelman et al. 2004, pp. 120–121. - Gelman, Andrew; Carlin, John B.; Stern, Hal S. & Rubin, Donald B. (2004). Bayesian Data Analysis (second ed.). Boca Raton, Florida: CRC Press. pp. 4–5. ISBN 1-58488-388-X. ↩
Box G. E. P., Tiao G. C. (1965). "Multiparameter problem from a bayesian point of view". Multiparameter Problems From A Bayesian Point of View Volume 36 Number 5. New York City: John Wiley & Sons, ISBN 0-471-57428-7 /wiki/George_E._P._Box ↩
Lee, Se Yoon (2022). "Bayesian Nonlinear Models for Repeated Measurement Data: An Overview, Implementation, and Applications". Mathematics. 10 (6): 898. arXiv:2201.12430. doi:10.3390/math10060898. https://doi.org/10.3390%2Fmath10060898 ↩