Let C = { C } C ∈ C {\displaystyle {\mathcal {C}}=\{C\}_{C\in {\mathcal {C}}}} be a family of sets (also called set family, collection of sets or set of sets) and X {\displaystyle X} a set. Their intersection is defined as the following set family:
Here typically X {\displaystyle X} and each C ∈ C {\displaystyle C\in {\mathcal {C}}} are subsets of a big "universe" of possibilities U {\displaystyle U} where intersection takes place.
We say that a set X {\displaystyle X} is shattered by C {\displaystyle {\mathcal {C}}} if P ( X ) = C ∩ X {\displaystyle {\mathcal {P}}(X)={\mathcal {C}}\cap X} i.e. the set of intersections contains (hence is equal to) all the subsets of X {\displaystyle X} . For finite sets X {\displaystyle X} this is equivalent to
The VC dimension D {\displaystyle D} of C {\displaystyle {\mathcal {C}}} is the cardinality of the largest set that is shattered by C {\displaystyle {\mathcal {C}}} . If arbitrarily large sets can be shattered, the VC dimension of C {\displaystyle {\mathcal {C}}} is ∞ {\displaystyle \infty } .
A binary classification model f {\displaystyle f} with some parameter vector θ {\displaystyle \theta } is said to shatter a set of generally positioned data points ( x 1 , x 2 , … , x n ) {\displaystyle (x_{1},x_{2},\ldots ,x_{n})} if, for every assignment of labels to those points, there exists a θ {\displaystyle \theta } such that the model f {\displaystyle f} makes no errors when evaluating that set of data points.
The VC dimension of a model f {\displaystyle f} is the maximum number of points that can be arranged so that f {\displaystyle f} shatters them. More formally, it is the maximum cardinal D {\displaystyle D} such that there exists a generally positioned data point set of cardinality D {\displaystyle D} that can be shattered by f {\displaystyle f} .
The VC dimension can predict a probabilistic upper bound on the test error of a classification model. Vapnik3 proved that the probability of the test error (i.e., risk with 0–1 loss function) distancing from an upper bound (on data that is drawn i.i.d. from the same distribution as the training set) is given by:
where D {\displaystyle D} is the VC dimension of the classification model, 0 < η ⩽ 1 {\displaystyle 0<\eta \leqslant 1} , and N {\displaystyle N} is the size of the training set (restriction: this formula is valid when D ≪ N {\displaystyle D\ll N} . When D {\displaystyle D} is larger, the test-error may be much higher than the training-error. This is due to overfitting).
The VC dimension also appears in sample-complexity bounds. A space of binary functions with VC dimension D {\displaystyle D} can be learned with:4: 73
samples, where ε {\displaystyle \varepsilon } is the learning error and δ {\displaystyle \delta } is the failure probability. Thus, the sample-complexity is a linear function of the VC dimension of the hypothesis space.
The VC dimension is one of the critical parameters in the size of ε-nets, which determines the complexity of approximation algorithms based on them; range sets without finite VC dimension may not have finite ε-nets at all.
A finite projective plane of order n is a collection of n2 + n + 1 sets (called "lines") over n2 + n + 1 elements (called "points"), for which:
The VC dimension of a finite projective plane is 2.8
Proof: (a) For each pair of distinct points, there is one line that contains both of them, lines that contain only one of them, and lines that contain none of them, so every set of size 2 is shattered. (b) For any triple of three distinct points, if there is a line x that contain all three, then there is no line y that contains exactly two (since then x and y would intersect in two points, which is contrary to the definition of a projective plane). Hence, no set of size 3 is shattered.
Suppose we have a base class B {\displaystyle B} of simple classifiers, whose VC dimension is D {\displaystyle D} .
We can construct a more powerful classifier by combining several different classifiers from B {\displaystyle B} ; this technique is called boosting. Formally, given T {\displaystyle T} classifiers h 1 , … , h T ∈ B {\displaystyle h_{1},\ldots ,h_{T}\in B} and a weight vector w ∈ R T {\displaystyle w\in \mathbb {R} ^{T}} , we can define the following classifier:
The VC dimension of the set of all such classifiers (for all selections of T {\displaystyle T} classifiers from B {\displaystyle B} and a weight-vector from R T {\displaystyle \mathbb {R} ^{T}} ), assuming T , D ≥ 3 {\displaystyle T,D\geq 3} , is at most:9: 108–109
A neural network is described by a directed acyclic graph G(V,E), where:
The VC dimension of a neural network is bounded as follows:10: 234–235
The VC dimension is defined for spaces of binary functions (functions to {0,1}). Several generalizations have been suggested for spaces of non-binary functions.
Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9. 978-3-319-21851-9 ↩
Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. US, Massachusetts: MIT Press. ISBN 9780262018258. 9780262018258 ↩
Vapnik 2000. - Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer. ↩
Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135. 9781107057135 ↩
Alon, N.; Haussler, D.; Welzl, E. (1987). "Partitioning and geometric embedding of range spaces of finite Vapnik-Chervonenkis dimension". Proceedings of the third annual symposium on Computational geometry – SCG '87. p. 331. doi:10.1145/41958.41994. ISBN 978-0897912310. S2CID 7394360. 978-0897912310 ↩
Natarajan 1989. - Natarajan, B.K. (1989). "On Learning sets and functions". Machine Learning. 4: 67–97. doi:10.1007/BF00114804. https://doi.org/10.1007%2FBF00114804 ↩
Brukhim et al. 2022. - Brukhim, Nataly; Carmon, Daniel; Dinur, Irit; Moran, Shay; Yehudayoff, Amir (2022). "A Characterization of Multiclass Learnability". 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). arXiv:2203.01550. https://arxiv.org/abs/2203.01550 ↩
Pollard 1984. - Pollard, D. (1984). Convergence of Stochastic Processes. Springer. ISBN 9781461252542. ↩
Anthony & Bartlett 2009. - Anthony, Martin; Bartlett, Peter L. (2009). Neural Network Learning: Theoretical Foundations. ISBN 9780521118620. ↩
Morgenstern & Roughgarden 2015. - Morgenstern, Jamie H.; Roughgarden, Tim (2015). On the Pseudo-Dimension of Nearly Optimal Auctions. NIPS. arXiv:1506.03684. Bibcode:2015arXiv150603684M. http://papers.nips.cc/paper/5766-on-the-pseudo-dimension-of-nearly-optimal-auctions ↩
Karpinski & Macintyre 1997. - Karpinski, Marek; Macintyre, Angus (February 1997). "Polynomial Bounds for VC Dimension of Sigmoidal and General Pfaffian Neural Networks". Journal of Computer and System Sciences. 54 (1): 169–176. doi:10.1006/jcss.1997.1477. https://ora.ox.ac.uk/objects/uuid:a14465ce-11d9-4f89-aeec-fcf0bea603ed ↩