In probability theory and statistics, the Bernoulli distribution, named after Jacob Bernoulli, is a discrete probability distribution describing a random variable with two possible outcomes: 1 with probability p and 0 with probability q = 1 - p. It models a single experiment that answers a yes–no question, producing Boolean outcomes such as success/failure or true/false. For example, it can represent a (possibly biased) coin toss with heads as 1 and tails as 0, where p is the probability of heads. The Bernoulli distribution is a special case of the binomial distribution with a single trial and of the two-point distribution with outcomes not limited to 0 and 1.
Properties
If X {\displaystyle X} is a random variable with a Bernoulli distribution, then:
Pr ( X = 1 ) = p , Pr ( X = 0 ) = q = 1 − p . {\displaystyle \Pr(X=1)=p,\Pr(X=0)=q=1-p.}The probability mass function f {\displaystyle f} of this distribution, over possible outcomes k, is
f ( k ; p ) = { p if k = 1 , q = 1 − p if k = 0. {\displaystyle f(k;p)={\begin{cases}p&{\text{if }}k=1,\\q=1-p&{\text{if }}k=0.\end{cases}}} 3This can also be expressed as
f ( k ; p ) = p k ( 1 − p ) 1 − k for k ∈ { 0 , 1 } {\displaystyle f(k;p)=p^{k}(1-p)^{1-k}\quad {\text{for }}k\in \{0,1\}}or as
f ( k ; p ) = p k + ( 1 − p ) ( 1 − k ) for k ∈ { 0 , 1 } . {\displaystyle f(k;p)=pk+(1-p)(1-k)\quad {\text{for }}k\in \{0,1\}.}The Bernoulli distribution is a special case of the binomial distribution with n = 1. {\displaystyle n=1.} 4
The kurtosis goes to infinity for high and low values of p , {\displaystyle p,} but for p = 1 / 2 {\displaystyle p=1/2} the two-point distributions including the Bernoulli distribution have a lower excess kurtosis, namely −2, than any other probability distribution.
The Bernoulli distributions for 0 ≤ p ≤ 1 {\displaystyle 0\leq p\leq 1} form an exponential family.
The maximum likelihood estimator of p {\displaystyle p} based on a random sample is the sample mean.
Mean
The expected value of a Bernoulli random variable X {\displaystyle X} is
E [ X ] = p {\displaystyle \operatorname {E} [X]=p}This is because for a Bernoulli distributed random variable X {\displaystyle X} with Pr ( X = 1 ) = p {\displaystyle \Pr(X=1)=p} and Pr ( X = 0 ) = q {\displaystyle \Pr(X=0)=q} we find
E [ X ] = Pr ( X = 1 ) ⋅ 1 + Pr ( X = 0 ) ⋅ 0 = p ⋅ 1 + q ⋅ 0 = p . {\displaystyle \operatorname {E} [X]=\Pr(X=1)\cdot 1+\Pr(X=0)\cdot 0=p\cdot 1+q\cdot 0=p.} 5Variance
The variance of a Bernoulli distributed X {\displaystyle X} is
Var [ X ] = p q = p ( 1 − p ) {\displaystyle \operatorname {Var} [X]=pq=p(1-p)}We first find
E [ X 2 ] = Pr ( X = 1 ) ⋅ 1 2 + Pr ( X = 0 ) ⋅ 0 2 {\displaystyle \operatorname {E} [X^{2}]=\Pr(X=1)\cdot 1^{2}+\Pr(X=0)\cdot 0^{2}} = p ⋅ 1 2 + q ⋅ 0 2 = p = E [ X ] {\displaystyle =p\cdot 1^{2}+q\cdot 0^{2}=p=\operatorname {E} [X]}From this follows
Var [ X ] = E [ X 2 ] − E [ X ] 2 = E [ X ] − E [ X ] 2 {\displaystyle \operatorname {Var} [X]=\operatorname {E} [X^{2}]-\operatorname {E} [X]^{2}=\operatorname {E} [X]-\operatorname {E} [X]^{2}} = p − p 2 = p ( 1 − p ) = p q {\displaystyle =p-p^{2}=p(1-p)=pq} 6With this result it is easy to prove that, for any Bernoulli distribution, its variance will have a value inside [ 0 , 1 / 4 ] {\displaystyle [0,1/4]} .
Skewness
The skewness is q − p p q = 1 − 2 p p q {\displaystyle {\frac {q-p}{\sqrt {pq}}}={\frac {1-2p}{\sqrt {pq}}}} . When we take the standardized Bernoulli distributed random variable X − E [ X ] Var [ X ] {\displaystyle {\frac {X-\operatorname {E} [X]}{\sqrt {\operatorname {Var} [X]}}}} we find that this random variable attains q p q {\displaystyle {\frac {q}{\sqrt {pq}}}} with probability p {\displaystyle p} and attains − p p q {\displaystyle -{\frac {p}{\sqrt {pq}}}} with probability q {\displaystyle q} . Thus we get
γ 1 = E [ ( X − E [ X ] Var [ X ] ) 3 ] = p ⋅ ( q p q ) 3 + q ⋅ ( − p p q ) 3 = 1 p q 3 ( p q 3 − q p 3 ) = p q p q 3 ( q 2 − p 2 ) = ( 1 − p ) 2 − p 2 p q = 1 − 2 p p q = q − p p q . {\displaystyle {\begin{aligned}\gamma _{1}&=\operatorname {E} \left[\left({\frac {X-\operatorname {E} [X]}{\sqrt {\operatorname {Var} [X]}}}\right)^{3}\right]\\&=p\cdot \left({\frac {q}{\sqrt {pq}}}\right)^{3}+q\cdot \left(-{\frac {p}{\sqrt {pq}}}\right)^{3}\\&={\frac {1}{{\sqrt {pq}}^{3}}}\left(pq^{3}-qp^{3}\right)\\&={\frac {pq}{{\sqrt {pq}}^{3}}}(q^{2}-p^{2})\\&={\frac {(1-p)^{2}-p^{2}}{\sqrt {pq}}}\\&={\frac {1-2p}{\sqrt {pq}}}={\frac {q-p}{\sqrt {pq}}}.\end{aligned}}}Higher moments and cumulants
The raw moments are all equal because 1 k = 1 {\displaystyle 1^{k}=1} and 0 k = 0 {\displaystyle 0^{k}=0} .
E [ X k ] = Pr ( X = 1 ) ⋅ 1 k + Pr ( X = 0 ) ⋅ 0 k = p ⋅ 1 + q ⋅ 0 = p = E [ X ] . {\displaystyle \operatorname {E} [X^{k}]=\Pr(X=1)\cdot 1^{k}+\Pr(X=0)\cdot 0^{k}=p\cdot 1+q\cdot 0=p=\operatorname {E} [X].}The central moment of order k {\displaystyle k} is given by
μ k = ( 1 − p ) ( − p ) k + p ( 1 − p ) k . {\displaystyle \mu _{k}=(1-p)(-p)^{k}+p(1-p)^{k}.}The first six central moments are
μ 1 = 0 , μ 2 = p ( 1 − p ) , μ 3 = p ( 1 − p ) ( 1 − 2 p ) , μ 4 = p ( 1 − p ) ( 1 − 3 p ( 1 − p ) ) , μ 5 = p ( 1 − p ) ( 1 − 2 p ) ( 1 − 2 p ( 1 − p ) ) , μ 6 = p ( 1 − p ) ( 1 − 5 p ( 1 − p ) ( 1 − p ( 1 − p ) ) ) . {\displaystyle {\begin{aligned}\mu _{1}&=0,\\\mu _{2}&=p(1-p),\\\mu _{3}&=p(1-p)(1-2p),\\\mu _{4}&=p(1-p)(1-3p(1-p)),\\\mu _{5}&=p(1-p)(1-2p)(1-2p(1-p)),\\\mu _{6}&=p(1-p)(1-5p(1-p)(1-p(1-p))).\end{aligned}}}The higher central moments can be expressed more compactly in terms of μ 2 {\displaystyle \mu _{2}} and μ 3 {\displaystyle \mu _{3}}
μ 4 = μ 2 ( 1 − 3 μ 2 ) , μ 5 = μ 3 ( 1 − 2 μ 2 ) , μ 6 = μ 2 ( 1 − 5 μ 2 ( 1 − μ 2 ) ) . {\displaystyle {\begin{aligned}\mu _{4}&=\mu _{2}(1-3\mu _{2}),\\\mu _{5}&=\mu _{3}(1-2\mu _{2}),\\\mu _{6}&=\mu _{2}(1-5\mu _{2}(1-\mu _{2})).\end{aligned}}}The first six cumulants are
κ 1 = p , κ 2 = μ 2 , κ 3 = μ 3 , κ 4 = μ 2 ( 1 − 6 μ 2 ) , κ 5 = μ 3 ( 1 − 12 μ 2 ) , κ 6 = μ 2 ( 1 − 30 μ 2 ( 1 − 4 μ 2 ) ) . {\displaystyle {\begin{aligned}\kappa _{1}&=p,\\\kappa _{2}&=\mu _{2},\\\kappa _{3}&=\mu _{3},\\\kappa _{4}&=\mu _{2}(1-6\mu _{2}),\\\kappa _{5}&=\mu _{3}(1-12\mu _{2}),\\\kappa _{6}&=\mu _{2}(1-30\mu _{2}(1-4\mu _{2})).\end{aligned}}}Entropy and Fisher's Information
Entropy
Entropy is a measure of uncertainty or randomness in a probability distribution. For a Bernoulli random variable X {\displaystyle X} with success probability p {\displaystyle p} and failure probability q = 1 − p {\displaystyle q=1-p} , the entropy H ( X ) {\displaystyle H(X)} is defined as:
H ( X ) = E p ln ( 1 P ( X ) ) = − [ P ( X = 0 ) ln P ( X = 0 ) + P ( X = 1 ) ln P ( X = 1 ) ] H ( X ) = − ( q ln q + p ln p ) , q = P ( X = 0 ) , p = P ( X = 1 ) {\displaystyle {\begin{aligned}H(X)&=\mathbb {E} _{p}\ln({\frac {1}{P(X)}})=-[P(X=0)\ln P(X=0)+P(X=1)\ln P(X=1)]\\H(X)&=-(q\ln q+p\ln p),\quad q=P(X=0),p=P(X=1)\end{aligned}}}The entropy is maximized when p = 0.5 {\displaystyle p=0.5} , indicating the highest level of uncertainty when both outcomes are equally likely. The entropy is zero when p = 0 {\displaystyle p=0} or p = 1 {\displaystyle p=1} , where one outcome is certain.
Fisher's Information
Fisher information measures the amount of information that an observable random variable X {\displaystyle X} carries about an unknown parameter p {\displaystyle p} upon which the probability of X {\displaystyle X} depends. For the Bernoulli distribution, the Fisher information with respect to the parameter p {\displaystyle p} is given by:
I ( p ) = 1 p q {\displaystyle {\begin{aligned}I(p)={\frac {1}{pq}}\end{aligned}}}Proof:
- The Likelihood Function for a Bernoulli random variable X {\displaystyle X} is:
This represents the probability of observing X {\displaystyle X} given the parameter p {\displaystyle p} .
- The Log-Likelihood Function is:
- The Score Function (the first derivative of the log-likelihood w.r.t. p {\displaystyle p} is:
- The second derivative of the log-likelihood function is:
- Fisher information is calculated as the negative expected value of the second derivative of the log-likelihood:
It is maximized when p = 0.5 {\displaystyle p=0.5} , reflecting maximum uncertainty and thus maximum information about the parameter p {\displaystyle p} .
Related distributions
- If X 1 , … , X n {\displaystyle X_{1},\dots ,X_{n}} are independent, identically distributed (i.i.d.) random variables, all Bernoulli trials with success probability p, then their sum is distributed according to a binomial distribution with parameters n and p: ∑ k = 1 n X k ∼ B ( n , p ) {\displaystyle \sum _{k=1}^{n}X_{k}\sim \operatorname {B} (n,p)} (binomial distribution).7
- The categorical distribution is the generalization of the Bernoulli distribution for variables with any constant number of discrete values.
- The Beta distribution is the conjugate prior of the Bernoulli distribution.8
- The geometric distribution models the number of independent and identical Bernoulli trials needed to get one success.
- If Y ∼ B e r n o u l l i ( 1 2 ) {\textstyle Y\sim \mathrm {Bernoulli} \left({\frac {1}{2}}\right)} , then 2 Y − 1 {\textstyle 2Y-1} has a Rademacher distribution.
See also
- Bernoulli process, a random process consisting of a sequence of independent Bernoulli trials
- Bernoulli sampling
- Binary entropy function
- Binary decision diagram
Further reading
- Johnson, N. L.; Kotz, S.; Kemp, A. (1993). Univariate Discrete Distributions (2nd ed.). Wiley. ISBN 0-471-54897-9.
- Peatman, John G. (1963). Introduction to Applied Statistics. New York: Harper & Row. pp. 162–171.
External links
Wikimedia Commons has media related to Bernoulli distribution.- "Binomial distribution", Encyclopedia of Mathematics, EMS Press, 2001 [1994].
- Weisstein, Eric W. "Bernoulli Distribution". MathWorld.
- Interactive graphic: Univariate Distribution Relationships.
References
Uspensky, James Victor (1937). Introduction to Mathematical Probability. New York: McGraw-Hill. p. 45. OCLC 996937. /wiki/OCLC_(identifier) ↩
Dekking, Frederik; Kraaikamp, Cornelis; Lopuhaä, Hendrik; Meester, Ludolf (9 October 2010). A Modern Introduction to Probability and Statistics (1 ed.). Springer London. pp. 43–48. ISBN 9781849969529. 9781849969529 ↩
Bertsekas, Dimitri P. (2002). Introduction to Probability. Tsitsiklis, John N., Τσιτσικλής, Γιάννης Ν. Belmont, Mass.: Athena Scientific. ISBN 188652940X. OCLC 51441829. 188652940X ↩
McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and Hall/CRC. Section 4.2.2. ISBN 0-412-31760-5. 0-412-31760-5 ↩
Bertsekas, Dimitri P. (2002). Introduction to Probability. Tsitsiklis, John N., Τσιτσικλής, Γιάννης Ν. Belmont, Mass.: Athena Scientific. ISBN 188652940X. OCLC 51441829. 188652940X ↩
Bertsekas, Dimitri P. (2002). Introduction to Probability. Tsitsiklis, John N., Τσιτσικλής, Γιάννης Ν. Belmont, Mass.: Athena Scientific. ISBN 188652940X. OCLC 51441829. 188652940X ↩
Bertsekas, Dimitri P. (2002). Introduction to Probability. Tsitsiklis, John N., Τσιτσικλής, Γιάννης Ν. Belmont, Mass.: Athena Scientific. ISBN 188652940X. OCLC 51441829. 188652940X ↩
Orloff, Jeremy; Bloom, Jonathan. "Conjugate priors: Beta and normal" (PDF). math.mit.edu. Retrieved October 20, 2023. https://math.mit.edu/~dav/05.dir/class15-prep.pdf ↩