In statistical classification, two main approaches to building classifiers are the generative model, which models the joint probability distribution P(X,Y) allowing generation of outcomes, and the discriminative model, which models the conditional probability P(Y|X=x) to discriminate target variables from observations. Some classifiers do not use probability models but are still called discriminative. Examples of generative classifiers include the naive Bayes classifier and linear discriminant analysis, while logistic regression is an example of a discriminative model. These methods vary in probabilistic assumptions and application, often combined to leverage domain knowledge.
Definition
An alternative division defines these symmetrically as:
- a generative model is a model of the conditional probability of the observable X, given a target y, symbolically, P ( X ∣ Y = y ) {\displaystyle P(X\mid Y=y)} 7
- a discriminative model is a model of the conditional probability of the target Y, given an observation x, symbolically, P ( Y ∣ X = x ) {\displaystyle P(Y\mid X=x)} 8
Regardless of precise definition, the terminology is constitutional because a generative model can be used to "generate" random instances (outcomes), either of an observation and target ( x , y ) {\displaystyle (x,y)} , or of an observation x given a target value y,9 while a discriminative model or discriminative classifier (without a model) can be used to "discriminate" the value of the target variable Y, given an observation x.10 The difference between "discriminate" (distinguish) and "classify" is subtle, and these are not consistently distinguished. (The term "discriminative classifier" becomes a pleonasm when "discrimination" is equivalent to "classification".)
The term "generative model" is also used to describe models that generate instances of output variables in a way that has no clear relationship to probability distributions over potential samples of input variables. Generative adversarial networks are examples of this class of generative models, and are judged primarily by the similarity of particular outputs to potential inputs. Such models are not classifiers.
Relationships between models
In application to classification, the observable X is frequently a continuous variable, the target Y is generally a discrete variable consisting of a finite set of labels, and the conditional probability P ( Y ∣ X ) {\displaystyle P(Y\mid X)} can also be interpreted as a (non-deterministic) target function f : X → Y {\displaystyle f\colon X\to Y} , considering X as inputs and Y as outputs.
Given a finite set of labels, the two definitions of "generative model" are closely related. A model of the conditional distribution P ( X ∣ Y = y ) {\displaystyle P(X\mid Y=y)} is a model of the distribution of each label, and a model of the joint distribution is equivalent to a model of the distribution of label values P ( Y ) {\displaystyle P(Y)} , together with the distribution of observations given a label, P ( X ∣ Y ) {\displaystyle P(X\mid Y)} ; symbolically, P ( X , Y ) = P ( X ∣ Y ) P ( Y ) . {\displaystyle P(X,Y)=P(X\mid Y)P(Y).} Thus, while a model of the joint probability distribution is more informative than a model of the distribution of label (but without their relative frequencies), it is a relatively small step, hence these are not always distinguished.
Given a model of the joint distribution, P ( X , Y ) {\displaystyle P(X,Y)} , the distribution of the individual variables can be computed as the marginal distributions P ( X ) = ∑ y P ( X , Y = y ) {\displaystyle P(X)=\sum _{y}P(X,Y=y)} and P ( Y ) = ∫ x P ( Y , X = x ) {\displaystyle P(Y)=\int _{x}P(Y,X=x)} (considering X as continuous, hence integrating over it, and Y as discrete, hence summing over it), and either conditional distribution can be computed from the definition of conditional probability: P ( X ∣ Y ) = P ( X , Y ) / P ( Y ) {\displaystyle P(X\mid Y)=P(X,Y)/P(Y)} and P ( Y ∣ X ) = P ( X , Y ) / P ( X ) {\displaystyle P(Y\mid X)=P(X,Y)/P(X)} .
Given a model of one conditional probability, and estimated probability distributions for the variables X and Y, denoted P ( X ) {\displaystyle P(X)} and P ( Y ) {\displaystyle P(Y)} , one can estimate the opposite conditional probability using Bayes' rule:
P ( X ∣ Y ) P ( Y ) = P ( Y ∣ X ) P ( X ) . {\displaystyle P(X\mid Y)P(Y)=P(Y\mid X)P(X).}For example, given a generative model for P ( X ∣ Y ) {\displaystyle P(X\mid Y)} , one can estimate:
P ( Y ∣ X ) = P ( X ∣ Y ) P ( Y ) / P ( X ) , {\displaystyle P(Y\mid X)=P(X\mid Y)P(Y)/P(X),}and given a discriminative model for P ( Y ∣ X ) {\displaystyle P(Y\mid X)} , one can estimate:
P ( X ∣ Y ) = P ( Y ∣ X ) P ( X ) / P ( Y ) . {\displaystyle P(X\mid Y)=P(Y\mid X)P(X)/P(Y).}Note that Bayes' rule (computing one conditional probability in terms of the other) and the definition of conditional probability (computing conditional probability in terms of the joint distribution) are frequently conflated as well.
Contrast with discriminative classifiers
A generative algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumptions, which category is most likely to generate this signal? A discriminative algorithm does not care about how the data was generated, it simply categorizes a given signal. So, discriminative algorithms try to learn p ( y | x ) {\displaystyle p(y|x)} directly from the data and then try to classify data. On the other hand, generative algorithms try to learn p ( x , y ) {\displaystyle p(x,y)} which can be transformed into p ( y | x ) {\displaystyle p(y|x)} later to classify the data. One of the advantages of generative algorithms is that you can use p ( x , y ) {\displaystyle p(x,y)} to generate new data similar to existing data. On the other hand, it has been proved that some discriminative algorithms give better performance than some generative algorithms in classification tasks.11
Despite the fact that discriminative models do not need to model the distribution of the observed variables, they cannot generally express complex relationships between the observed and target variables. But in general, they don't necessarily perform better than generative models at classification and regression tasks. The two classes are seen as complementary or as different views of the same procedure.12
Deep generative models
With the rise of deep learning, a new family of methods, called deep generative models (DGMs),1314 is formed through the combination of generative models and deep neural networks. An increase in the scale of the neural networks is typically accompanied by an increase in the scale of the training data, both of which are required for good performance.15
Popular DGMs include variational autoencoders (VAEs), generative adversarial networks (GANs), and auto-regressive models. Recently, there has been a trend to build very large deep generative models.16 For example, GPT-3, and its precursor GPT-2,17 are auto-regressive neural language models that contain billions of parameters, BigGAN18 and VQ-VAE19 which are used for image generation that can have hundreds of millions of parameters, and Jukebox is a very large generative model for musical audio that contains billions of parameters.20
Types
Generative models
Types of generative models are:
- Gaussian mixture model (and other types of mixture model)
- Hidden Markov model
- Probabilistic context-free grammar
- Bayesian network (e.g. Naive bayes, Autoregressive model)
- Averaged one-dependence estimators
- Latent Dirichlet allocation
- Boltzmann machine (e.g. Restricted Boltzmann machine, Deep belief network)
- Variational autoencoder
- Generative adversarial network
- Flow-based generative model
- Energy based model
- Diffusion model
If the observed data are truly sampled from the generative model, then fitting the parameters of the generative model to maximize the data likelihood is a common method. However, since most statistical models are only approximations to the true distribution, if the model's application is to infer about a subset of variables conditional on known values of others, then it can be argued that the approximation makes more assumptions than are necessary to solve the problem at hand. In such cases, it can be more accurate to model the conditional density functions directly using a discriminative model (see below), although application-specific details will ultimately dictate which approach is most suitable in any particular case.
Discriminative models
- k-nearest neighbors algorithm
- Logistic regression
- Support Vector Machines
- Decision Tree Learning
- Random Forest
- Maximum-entropy Markov models
- Conditional random fields
Examples
Simple example
Suppose the input data is x ∈ { 1 , 2 } {\displaystyle x\in \{1,2\}} , the set of labels for x {\displaystyle x} is y ∈ { 0 , 1 } {\displaystyle y\in \{0,1\}} , and there are the following 4 data points: ( x , y ) = { ( 1 , 0 ) , ( 1 , 1 ) , ( 2 , 0 ) , ( 2 , 1 ) } {\displaystyle (x,y)=\{(1,0),(1,1),(2,0),(2,1)\}}
For the above data, estimating the joint probability distribution p ( x , y ) {\displaystyle p(x,y)} from the empirical measure will be the following:
y = 0 {\displaystyle y=0} | y = 1 {\displaystyle y=1} | |
---|---|---|
x = 1 {\displaystyle x=1} | 1 / 4 {\displaystyle 1/4} | 1 / 4 {\displaystyle 1/4} |
x = 2 {\displaystyle x=2} | 1 / 4 {\displaystyle 1/4} | 1 / 4 {\displaystyle 1/4} |
while p ( y | x ) {\displaystyle p(y|x)} will be following:
y = 0 {\displaystyle y=0} | y = 1 {\displaystyle y=1} | |
---|---|---|
x = 1 {\displaystyle x=1} | 1 / 2 {\displaystyle 1/2} | 1 / 2 {\displaystyle 1/2} |
x = 2 {\displaystyle x=2} | 1 / 2 {\displaystyle 1/2} | 1 / 2 {\displaystyle 1/2} |
Text generation
Shannon (1948) gives an example in which a table of frequencies of English word pairs is used to generate a sentence beginning with "representing and speedily is an good"; which is not proper English but which will increasingly approximate it as the table is moved from word pairs to word triplets etc.
See also
- Mathematics portal
Notes
External links
- Shannon, C. E. (1948). "A Mathematical Theory of Communication" (PDF). Bell System Technical Journal. 27 (July, October): 379–423, 623–656. doi:10.1002/j.1538-7305.1948.tb01338.x. hdl:10338.dmlcz/101429. Archived from the original (PDF) on 2016-06-06. Retrieved 2016-01-09.
- Mitchell, Tom M. (2015). "3. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression" (PDF). Machine Learning.
- Ng, Andrew Y.; Jordan, Michael I. (2002). "On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes" (PDF). Advances in Neural Information Processing Systems.
- Jebara, Tony (2004). Machine Learning: Discriminative and Generative. The Springer International Series in Engineering and Computer Science. Kluwer Academic (Springer). ISBN 978-1-4020-7647-3.
- Jebara, Tony (2002). Discriminative, generative, and imitative learning (PhD). Massachusetts Institute of Technology. hdl:1721.1/8323., (mirror, mirror), published as book (above)
References
Three leading sources, Ng & Jordan 2002, Jebara 2004, and Mitchell 2015, give different divisions and definitions. - Ng, Andrew Y.; Jordan, Michael I. (2002). "On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes" (PDF). Advances in Neural Information Processing Systems. http://robotics.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf ↩
Ng & Jordan (2002): "Generative classifiers learn a model of the joint probability, p ( x , y ) {\displaystyle p(x,y)} , of the inputs x and the label y, and make their predictions by using Bayes rules to calculate p ( y ∣ x ) {\displaystyle p(y\mid x)} , and then picking the most likely label y. - Ng, Andrew Y.; Jordan, Michael I. (2002). "On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes" (PDF). Advances in Neural Information Processing Systems. http://robotics.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf ↩
Mitchell 2015: "We can use Bayes rule as the basis for designing learning algorithms (function approximators), as follows: Given that we wish to learn some target function f : X → Y {\displaystyle f\colon X\to Y} , or equivalently, P ( Y ∣ X ) {\displaystyle P(Y\mid X)} , we use the training data to learn estimates of P ( X ∣ Y ) {\displaystyle P(X\mid Y)} and P ( Y ) {\displaystyle P(Y)} . New X examples can then be classified using these estimated probability distributions, plus Bayes rule. This type of classifier is called a generative classifier, because we can view the distribution P ( X ∣ Y ) {\displaystyle P(X\mid Y)} as describing how to generate random instances X conditioned on the target attribute Y. - Mitchell, Tom M. (2015). "3. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression" (PDF). Machine Learning. https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf ↩
Mitchell 2015: "Logistic Regression is a function approximation algorithm that uses training data to directly estimate P ( Y ∣ X ) {\displaystyle P(Y\mid X)} , in contrast to Naive Bayes. In this sense, Logistic Regression is often referred to as a discriminative classifier because we can view the distribution P ( Y ∣ X ) {\displaystyle P(Y\mid X)} as directly discriminating the value of the target value Y for any given instance X - Mitchell, Tom M. (2015). "3. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression" (PDF). Machine Learning. https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf ↩
Jebara 2004, 2.4 Discriminative Learning: "This distinction between conditional learning and discriminative learning is not currently a well-established convention in the field." - Jebara, Tony (2004). Machine Learning: Discriminative and Generative. The Springer International Series in Engineering and Computer Science. Kluwer Academic (Springer). ISBN 978-1-4020-7647-3. https://www.springer.com/us/book/9781402076473 ↩
Ng & Jordan 2002: "Discriminative classifiers model the posterior p ( y | x ) {\displaystyle p(y|x)} directly, or learn a direct map from inputs x to the class labels." - Ng, Andrew Y.; Jordan, Michael I. (2002). "On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes" (PDF). Advances in Neural Information Processing Systems. http://robotics.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf ↩
Mitchell 2015: "We can use Bayes rule as the basis for designing learning algorithms (function approximators), as follows: Given that we wish to learn some target function f : X → Y {\displaystyle f\colon X\to Y} , or equivalently, P ( Y ∣ X ) {\displaystyle P(Y\mid X)} , we use the training data to learn estimates of P ( X ∣ Y ) {\displaystyle P(X\mid Y)} and P ( Y ) {\displaystyle P(Y)} . New X examples can then be classified using these estimated probability distributions, plus Bayes rule. This type of classifier is called a generative classifier, because we can view the distribution P ( X ∣ Y ) {\displaystyle P(X\mid Y)} as describing how to generate random instances X conditioned on the target attribute Y. - Mitchell, Tom M. (2015). "3. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression" (PDF). Machine Learning. https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf ↩
Mitchell 2015: "Logistic Regression is a function approximation algorithm that uses training data to directly estimate P ( Y ∣ X ) {\displaystyle P(Y\mid X)} , in contrast to Naive Bayes. In this sense, Logistic Regression is often referred to as a discriminative classifier because we can view the distribution P ( Y ∣ X ) {\displaystyle P(Y\mid X)} as directly discriminating the value of the target value Y for any given instance X - Mitchell, Tom M. (2015). "3. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression" (PDF). Machine Learning. https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf ↩
Mitchell 2015: "We can use Bayes rule as the basis for designing learning algorithms (function approximators), as follows: Given that we wish to learn some target function f : X → Y {\displaystyle f\colon X\to Y} , or equivalently, P ( Y ∣ X ) {\displaystyle P(Y\mid X)} , we use the training data to learn estimates of P ( X ∣ Y ) {\displaystyle P(X\mid Y)} and P ( Y ) {\displaystyle P(Y)} . New X examples can then be classified using these estimated probability distributions, plus Bayes rule. This type of classifier is called a generative classifier, because we can view the distribution P ( X ∣ Y ) {\displaystyle P(X\mid Y)} as describing how to generate random instances X conditioned on the target attribute Y. - Mitchell, Tom M. (2015). "3. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression" (PDF). Machine Learning. https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf ↩
Mitchell 2015: "Logistic Regression is a function approximation algorithm that uses training data to directly estimate P ( Y ∣ X ) {\displaystyle P(Y\mid X)} , in contrast to Naive Bayes. In this sense, Logistic Regression is often referred to as a discriminative classifier because we can view the distribution P ( Y ∣ X ) {\displaystyle P(Y\mid X)} as directly discriminating the value of the target value Y for any given instance X - Mitchell, Tom M. (2015). "3. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression" (PDF). Machine Learning. https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf ↩
Ng & Jordan 2002 - Ng, Andrew Y.; Jordan, Michael I. (2002). "On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes" (PDF). Advances in Neural Information Processing Systems. http://robotics.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf ↩
Bishop, C. M.; Lasserre, J. (24 September 2007), "Generative or Discriminative? getting the best of both worlds", in Bernardo, J. M. (ed.), Bayesian statistics 8: proceedings of the eighth Valencia International Meeting, June 2-6, 2006, Oxford University Press, pp. 3–23, ISBN 978-0-19-921465-5 978-0-19-921465-5 ↩
"Scaling up—researchers advance large-scale deep generative models". Microsoft. April 9, 2020. https://www.microsoft.com/en-us/research/blog/a-deep-generative-model-trifecta-three-advances-that-work-towards-harnessing-large-scale-power/ ↩
"Generative Models". OpenAI. June 16, 2016. https://openai.com/blog/generative-models/ ↩
Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361 [stat.ML]. /wiki/ArXiv_(identifier) ↩
"Scaling up—researchers advance large-scale deep generative models". Microsoft. April 9, 2020. https://www.microsoft.com/en-us/research/blog/a-deep-generative-model-trifecta-three-advances-that-work-towards-harnessing-large-scale-power/ ↩
"Better Language Models and Their Implications". OpenAI. February 14, 2019. https://openai.com/blog/better-language-models/ ↩
Brock, Andrew; Donahue, Jeff; Simonyan, Karen (2018). "Large Scale GAN Training for High Fidelity Natural Image Synthesis". arXiv:1809.11096 [cs.LG]. /wiki/ArXiv_(identifier) ↩
Razavi, Ali; van den Oord, Aaron; Vinyals, Oriol (2019). "Generating Diverse High-Fidelity Images with VQ-VAE-2". arXiv:1906.00446 [cs.LG]. /wiki/ArXiv_(identifier) ↩
"Jukebox". OpenAI. April 30, 2020. https://openai.com/blog/jukebox/ ↩