Let there be two spaces, the space of images Ω X {\displaystyle \Omega _{X}} and the space of labels Ω Y {\displaystyle \Omega _{Y}} . The space of labels is finite.
Let p g e n {\displaystyle p_{gen}} be a probability distribution over Ω X {\displaystyle \Omega _{X}} that we wish to judge.
Let a discriminator be a function of type p d i s : Ω X → M ( Ω Y ) {\displaystyle p_{dis}:\Omega _{X}\to M(\Omega _{Y})} where M ( Ω Y ) {\displaystyle M(\Omega _{Y})} is the set of all probability distributions on Ω Y {\displaystyle \Omega _{Y}} . For any image x {\displaystyle x} , and any label y {\displaystyle y} , let p d i s ( y | x ) {\displaystyle p_{dis}(y|x)} be the probability that image x {\displaystyle x} has label y {\displaystyle y} , according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet.
The Inception Score of p g e n {\displaystyle p_{gen}} relative to p d i s {\displaystyle p_{dis}} is I S ( p g e n , p d i s ) := exp ( E x ∼ p g e n [ D K L ( p d i s ( ⋅ | x ) ‖ ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x ) ] ) {\displaystyle IS(p_{gen},p_{dis}):=\exp \left(\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\int p_{dis}(\cdot |x)p_{gen}(x)dx\right)\right]\right)} Equivalent rewrites include ln I S ( p g e n , p d i s ) := E x ∼ p g e n [ D K L ( p d i s ( ⋅ | x ) ‖ E x ∼ p g e n [ p d i s ( ⋅ | x ) ] ) ] {\displaystyle \ln IS(p_{gen},p_{dis}):=\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]\right)\right]} ln I S ( p g e n , p d i s ) := H [ E x ∼ p g e n [ p d i s ( ⋅ | x ) ] ] − E x ∼ p g e n [ H [ p d i s ( ⋅ | x ) ] ] {\displaystyle \ln IS(p_{gen},p_{dis}):=H[\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]]-\mathbb {E} _{x\sim p_{gen}}[H[p_{dis}(\cdot |x)]]} ln I S {\displaystyle \ln IS} is nonnegative by Jensen's inequality.
Pseudocode:
INPUT discriminator p d i s {\displaystyle p_{dis}} . INPUT generator g {\displaystyle g} . Sample images x i {\displaystyle x_{i}} from generator. Compute p d i s ( ⋅ | x i ) {\displaystyle p_{dis}(\cdot |x_{i})} , the probability distribution over labels conditional on image x i {\displaystyle x_{i}} . Sum up the results to obtain p ^ {\displaystyle {\hat {p}}} , an empirical estimate of ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x {\displaystyle \int p_{dis}(\cdot |x)p_{gen}(x)dx} . Sample more images x i {\displaystyle x_{i}} from generator, and for each, compute D K L ( p d i s ( ⋅ | x i ) ‖ p ^ ) {\displaystyle D_{KL}\left(p_{dis}(\cdot |x_{i})\|{\hat {p}}\right)} . Average the results, and take its exponential. RETURN the result.
INPUT discriminator p d i s {\displaystyle p_{dis}} .
INPUT generator g {\displaystyle g} .
Sample images x i {\displaystyle x_{i}} from generator.
Compute p d i s ( ⋅ | x i ) {\displaystyle p_{dis}(\cdot |x_{i})} , the probability distribution over labels conditional on image x i {\displaystyle x_{i}} .
Sum up the results to obtain p ^ {\displaystyle {\hat {p}}} , an empirical estimate of ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x {\displaystyle \int p_{dis}(\cdot |x)p_{gen}(x)dx} .
Sample more images x i {\displaystyle x_{i}} from generator, and for each, compute D K L ( p d i s ( ⋅ | x i ) ‖ p ^ ) {\displaystyle D_{KL}\left(p_{dis}(\cdot |x_{i})\|{\hat {p}}\right)} .
Average the results, and take its exponential.
RETURN the result.
A higher inception score is interpreted as "better", as it means that p g e n {\displaystyle p_{gen}} is a "sharp and distinct" collection of pictures.
ln I S ( p g e n , p d i s ) ∈ [ 0 , ln N ] {\displaystyle \ln IS(p_{gen},p_{dis})\in [0,\ln N]} , where N {\displaystyle N} is the total number of possible labels.
ln I S ( p g e n , p d i s ) = 0 {\displaystyle \ln IS(p_{gen},p_{dis})=0} iff for almost all x ∼ p g e n {\displaystyle x\sim p_{gen}} p d i s ( ⋅ | x ) = ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x {\displaystyle p_{dis}(\cdot |x)=\int p_{dis}(\cdot |x)p_{gen}(x)dx} That means p g e n {\displaystyle p_{gen}} is completely "indistinct". That is, for any image x {\displaystyle x} sampled from p g e n {\displaystyle p_{gen}} , discriminator returns exactly the same label predictions p d i s ( ⋅ | x ) {\displaystyle p_{dis}(\cdot |x)} .
The highest inception score N {\displaystyle N} is achieved if and only if the two conditions are both true:
Salimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi; Chen, Xi (2016). "Improved Techniques for Training GANs". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv:1606.03498. https://proceedings.neurips.cc/paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html ↩
Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209. arXiv:2101.09983. doi:10.1016/j.neunet.2021.07.019. PMID 34500257. S2CID 231698782. https://doi.org/10.1016%2Fj.neunet.2021.07.019 ↩
Borji, Ali (2022). "Pros and cons of GAN evaluation measures: New developments". Computer Vision and Image Understanding. 215: 103329. arXiv:2103.09396. doi:10.1016/j.cviu.2021.103329. S2CID 232257836. https://linkinghub.elsevier.com/retrieve/pii/S1077314221001685 ↩