Vapnik–Chervonenkis dimension

<h2 id="definitions">Definitions</h2>
<h3>VC dimension of a set-family</h3>
<p>Let 
  
    
      
        
          
            C
          
        
        =
        {
        C
        
          }
          
            C
            ∈
            
              
                C
              
            
          
        
      
    
    {\displaystyle {\mathcal {C}}=\{C\}_{C\in {\mathcal {C}}}}
  
 be a <a href="/facts/Set_family/WbYof7lP">family of sets</a> (also called set family, collection of sets or set of sets) and 
  
    
      
        X
      
    
    {\displaystyle X}
  
 a set. Their <i>intersection</i> is defined as the following set family:
</p>

C
          
        
        ∩
        X
        :=
        {
        C
        ∩
        X
        ∣
        C
        ∈
        
          
            C
          
        
        }
        .
      
    
    {\displaystyle {\mathcal {C}}\cap X:=\{C\cap X\mid C\in {\mathcal {C}}\}.}

<p>Here typically 
  
    
      
        X
      
    
    {\displaystyle X}
  
 and each 
  
    
      
        C
        ∈
        
          
            C
          
        
      
    
    {\displaystyle C\in {\mathcal {C}}}
  
 are subsets of a big "universe" of possibilities 
  
    
      
        U
      
    
    {\displaystyle U}
  
 where intersection takes place.
</p><p>We say that a set 
  
    
      
        X
      
    
    {\displaystyle X}
  
 is <i><a href="/facts/Shattered_set/Kum02mT4">shattered</a></i> by 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
 if 
  
    
      
        
          
            P
          
        
        (
        X
        )
        =
        
          
            C
          
        
        ∩
        X
      
    
    {\displaystyle {\mathcal {P}}(X)={\mathcal {C}}\cap X}
  
 i.e. the set of intersections contains (hence is equal to) all the subsets of 
  
    
      
        X
      
    
    {\displaystyle X}
  
. For finite sets 
  
    
      
        X
      
    
    {\displaystyle X}
  
 this is equivalent to 
</p>

|
        
        
          
            C
          
        
        ∩
        X
        
          |
        
        =
        
          2
          
            
              |
            
            X
            
              |
            
          
        
        .
      
    
    {\displaystyle |{\mathcal {C}}\cap X|=2^{|X|}.}

<p>The <i>VC dimension</i> 
  
    
      
        D
      
    
    {\displaystyle D}
  
 of 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
 is the <a href="/facts/Cardinality/m6VPojwT">cardinality</a> of the largest set that is shattered by 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
. If arbitrarily large sets can be shattered, the VC dimension of 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
 is 
  
    
      
        ∞
      
    
    {\displaystyle \infty }
  
.
</p>
<h3>VC dimension of a classification model</h3>
<p>A binary classification model 
  
    
      
        f
      
    
    {\displaystyle f}
  
 with some parameter vector 
  
    
      
        θ
      
    
    {\displaystyle \theta }
  
 is said to <i><a href="/facts/Shattered_set/Kum02mT4">shatter</a></i> a set of <a href="/facts/General_position/p0YpbkJv">generally positioned</a> data points 
  
    
      
        (
        
          x
          
            1
          
        
        ,
        
          x
          
            2
          
        
        ,
        …
        ,
        
          x
          
            n
          
        
        )
      
    
    {\displaystyle (x_{1},x_{2},\ldots ,x_{n})}
  
 if, for every assignment of labels to those points, there exists a 
  
    
      
        θ
      
    
    {\displaystyle \theta }
  
 such that the model 
  
    
      
        f
      
    
    {\displaystyle f}
  
 makes no errors when evaluating that set of data points.
</p><p>The VC dimension of a model 
  
    
      
        f
      
    
    {\displaystyle f}
  
 is the maximum number of points that can be arranged so that 
  
    
      
        f
      
    
    {\displaystyle f}
  
 shatters them.  More formally, it is the maximum cardinal 
  
    
      
        D
      
    
    {\displaystyle D}
  
 such that there exists a generally positioned data point set of <a href="/facts/Cardinality/m6VPojwT">cardinality</a> 
  
    
      
        D
      
    
    {\displaystyle D}
  
 that can be shattered by 
  
    
      
        f
      
    
    {\displaystyle f}
  
.
</p>
<h2 id="examples">Examples</h2>
<ol><li>
  
    
      
        f
      
    
    {\displaystyle f}
  
 is a constant classifier (with no parameters); Its VC dimension is 0 since it cannot shatter even a single point. In general, the VC dimension of a finite classification model, which can return at most 
  
    
      
        
          2
          
            d
          
        
      
    
    {\displaystyle 2^{d}}
  
 different classifiers, is at most 
  
    
      
        d
      
    
    {\displaystyle d}
  
 (this is an upper bound on the VC dimension; the <a href="/facts/Sauer%25E2%2580%2593Shelah_lemma/X0ezaEyw">Sauer–Shelah lemma</a> gives a lower bound on the dimension).</li>
<li>
  
    
      
        f
      
    
    {\displaystyle f}
  
 is a single-parametric threshold classifier on real numbers; i.e., for a certain threshold 
  
    
      
        θ
      
    
    {\displaystyle \theta }
  
, the classifier 
  
    
      
        
          f
          
            θ
          
        
      
    
    {\displaystyle f_{\theta }}
  
 returns 1 if the input number is larger than 
  
    
      
        θ
      
    
    {\displaystyle \theta }
  
 and 0 otherwise. The VC dimension of 
  
    
      
        f
      
    
    {\displaystyle f}
  
  is 1 because: (a) It can shatter a single point. For every point 
  
    
      
        x
      
    
    {\displaystyle x}
  
, a classifier 
  
    
      
        
          f
          
            θ
          
        
      
    
    {\displaystyle f_{\theta }}
  
 labels it as 0 if 
  
    
      
        θ
        >
        x
      
    
    {\displaystyle \theta >x}
  
 and labels it as 1 if 
  
    
      
        θ
        <
        x
      
    
    {\displaystyle \theta <x}
  
. (b) It cannot shatter all the sets with two points. For every set of two numbers, if the smaller is labeled 1, then the larger must also be labeled 1, so not all labelings are possible.</li>
<li>
  
    
      
        f
      
    
    {\displaystyle f}
  
 is a single-parametric interval classifier on real numbers; i.e., for a certain parameter 
  
    
      
        θ
      
    
    {\displaystyle \theta }
  
, the classifier 
  
    
      
        
          f
          
            θ
          
        
      
    
    {\displaystyle f_{\theta }}
  
 returns 1 if the input number is in the interval 
  
    
      
        [
        θ
        ,
        θ
        +
        4
        ]
      
    
    {\displaystyle [\theta ,\theta +4]}
  
 and 0 otherwise. The VC dimension of 
  
    
      
        f
      
    
    {\displaystyle f}
  
 is 2 because: (a) It can shatter some sets of two points. E.g., for every set 
  
    
      
        {
        x
        ,
        x
        +
        2
        }
      
    
    {\displaystyle \{x,x+2\}}
  
, a classifier 
  
    
      
        
          f
          
            θ
          
        
      
    
    {\displaystyle f_{\theta }}
  
 labels it as (0,0) if 
  
    
      
        θ
        <
        x
        −
        4
      
    
    {\displaystyle \theta <x-4}
  
 or if 
  
    
      
        θ
        >
        x
        +
        2
      
    
    {\displaystyle \theta >x+2}
  
, as (1,0) if 
  
    
      
        θ
        ∈
        [
        x
        −
        4
        ,
        x
        −
        2
        )
      
    
    {\displaystyle \theta \in [x-4,x-2)}
  
, as (1,1) if 
  
    
      
        θ
        ∈
        [
        x
        −
        2
        ,
        x
        ]
      
    
    {\displaystyle \theta \in [x-2,x]}
  
, and as (0,1) if 
  
    
      
        θ
        ∈
        (
        x
        ,
        x
        +
        2
        ]
      
    
    {\displaystyle \theta \in (x,x+2]}
  
. (b) It cannot shatter any set of three points. For every set of three numbers, if the smallest and the largest are labeled 1, then the middle one must also be labeled 1, so not all labelings are possible.</li>
<li>
  
    
      
        f
      
    
    {\displaystyle f}
  
 is a <a href="/facts/Linear_classifier/vsE823wQ">straight line</a> as a classification model on points in a two-dimensional plane (this is the model used by a <a href="/facts/Perceptron/ArxdkAC1">perceptron</a>). The line should separate positive data points from negative data points. There exist sets of 3 points that can indeed be shattered using this model (any 3 points that are not collinear can be shattered). However, no set of 4 points can be shattered: by <a href="/facts/Radon%2527s_theorem/BNnYPxxC">Radon's theorem</a>, any four points can be partitioned into two subsets with intersecting <a href="/facts/Convex_hull/D4Id3JDS">convex hulls</a>, so it is not possible to separate one of these two subsets from the other. Thus, the VC dimension of this particular classifier is 3. It is important to remember that while one can choose any arrangement of points, the arrangement of those points cannot change when attempting to shatter for some label assignment. Note, only 3 of the 23 = 8 possible label assignments are shown for the three points.</li>
<li>
  
    
      
        f
      
    
    {\displaystyle f}
  
 is a single-parametric <a href="/facts/Sine/gQ6LXK8z">sine</a> classifier,  i.e., for a certain parameter 
  
    
      
        θ
      
    
    {\displaystyle \theta }
  
, the classifier 
  
    
      
        
          f
          
            θ
          
        
      
    
    {\displaystyle f_{\theta }}
  
 returns 1 if the input number 
  
    
      
        x
      
    
    {\displaystyle x}
  
 has 
  
    
      
        sin
        ⁡
        (
        θ
        x
        )
        >
        0
      
    
    {\displaystyle \sin(\theta x)>0}
  
 and 0 otherwise. The VC dimension of 
  
    
      
        f
      
    
    {\displaystyle f}
  
 is infinite, since it can shatter any finite subset of the set 
  
    
      
        {
        
          2
          
            −
            m
          
        
        ∣
        m
        ∈
        
          N
        
        }
      
    
    {\displaystyle \{2^{-m}\mid m\in \mathbb {N} \}}
  
.<a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a>: 57 </li></ol>
<table><tbody><tr><td></td><td></td><td></td><td></td></tr><tr><td colspan="3">3 points shattered</td><td>4 points impossible</td></tr></tbody></table>
<h2 id="uses">Uses</h2>
<h3>In statistical learning theory</h3>
<p>The VC dimension can predict a <a href="/facts/Probabilistic/fgYvMhwb">probabilistic</a> <a href="/facts/Upper_bound/YJje9oC2">upper bound</a> on the test error of a classification model. Vapnik<a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a> proved that the probability of the test error (i.e., risk with 0–1 loss function) distancing from an upper bound (on data that is drawn <a href="/facts/Independent_identically-distributed_random_variables/othIRaWt">i.i.d.</a> from the same distribution as the training set) is given by:
</p>

Pr
        
          (
          
            
              test error
            
            ⩽
            
              training error
            
            +
            
              
                
                  
                    1
                    N
                  
                
                
                  [
                  
                    D
                    
                      (
                      
                        log
                        ⁡
                        
                          (
                          
                            
                              
                                
                                  2
                                  N
                                
                                D
                              
                            
                          
                          )
                        
                        +
                        1
                      
                      )
                    
                    −
                    log
                    ⁡
                    
                      (
                      
                        
                          
                            η
                            4
                          
                        
                      
                      )
                    
                  
                  ]
                
              
            
            
          
          )
        
        =
        1
        −
        η
        ,
      
    
    {\displaystyle \Pr \left({\text{test error}}\leqslant {\text{training error}}+{\sqrt {{\frac {1}{N}}\left[D\left(\log \left({\tfrac {2N}{D}}\right)+1\right)-\log \left({\tfrac {\eta }{4}}\right)\right]}}\,\right)=1-\eta ,}

<p>where 
  
    
      
        D
      
    
    {\displaystyle D}
  
 is the VC dimension of the classification model, 
  
    
      
        0
        <
        η
        ⩽
        1
      
    
    {\displaystyle 0<\eta \leqslant 1}
  
, and 
  
    
      
        N
      
    
    {\displaystyle N}
  
 is the size of the training set (restriction: this formula is valid when 
  
    
      
        D
        ≪
        N
      
    
    {\displaystyle D\ll N}
  
. When 
  
    
      
        D
      
    
    {\displaystyle D}
  
 is larger, the test-error may be much higher than the training-error. This is due to <a href="/facts/Overfitting/5xnFLcMg">overfitting</a>).
</p><p>The VC dimension also appears in <a href="/facts/Sample-complexity_bounds/ncW95JA6">sample-complexity bounds</a>. A space of binary functions with VC dimension 
  
    
      
        D
      
    
    {\displaystyle D}
  
 can be learned with:<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a>: 73 
</p>

N
        =
        Θ
        
          (
          
            
              
                D
                +
                ln
                ⁡
                
                  
                    1
                    δ
                  
                
              
              
                ε
                
                  2
                
              
            
          
          )
        
      
    
    {\displaystyle N=\Theta \left({\frac {D+\ln {1 \over \delta }}{\varepsilon ^{2}}}\right)}

<p>samples, where 
  
    
      
        ε
      
    
    {\displaystyle \varepsilon }
  
 is the learning error and 
  
    
      
        δ
      
    
    {\displaystyle \delta }
  
 is the failure probability. Thus, the sample-complexity is a linear function of the VC dimension of the hypothesis space.
</p>
<h3>In <a href="/facts/Computational_geometry/eeaotQtl">computational geometry</a></h3>
<p>The VC dimension is one of the critical parameters in the size of <a href="/facts/E-net_(computational_geometry)/a0Cga9RM">ε-nets</a>, which determines the complexity of approximation algorithms based on them; range sets without finite VC dimension may not have finite ε-nets at all.
</p>
<h2 id="bounds">Bounds</h2>
<ol><li></li><li> The VC dimension of the dual set-family of 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
 is strictly less than 
  
    
      
        
          2
          
            vc
            ⁡
            (
            
              
                C
              
            
            )
            +
            1
          
        
      
    
    {\displaystyle 2^{\operatorname {vc} ({\mathcal {C}})+1}}
  
, and this is best possible.</li>
<li>The VC dimension of a finite set-family 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
 is at most 
  
    
      
        
          log
          
            2
          
        
        ⁡
        
          |
        
        
          
            C
          
        
        
          |
        
      
    
    {\displaystyle \log _{2}|{\mathcal {C}}|}
  
.<a class="footnote-ref" id="fnref:5" href="#fn:5"><sup>5</sup></a>: 56  This is because 
  
    
      
        
          |
        
        
          
            C
          
        
        ∩
        X
        
          |
        
        ≤
        
          |
        
        X
        
          |
        
      
    
    {\displaystyle |{\mathcal {C}}\cap X|\leq |X|}
  
 by definition.</li>
<li>Given a set-family 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
, define 
  
    
      
        
          
            
              C
            
          
          
            s
          
        
      
    
    {\displaystyle {\mathcal {C}}_{s}}
  
 as a set-family that contains all intersections of 
  
    
      
        s
      
    
    {\displaystyle s}
  
 elements of 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
. Then:<a class="footnote-ref" id="fnref:6" href="#fn:6"><sup>6</sup></a>: 57  
  
    
      
        VCDim
        ⁡
        (
        
          
            
              C
            
          
          
            s
          
        
        )
        ≤
        VCDim
        ⁡
        (
        
          
            C
          
        
        )
        ⋅
        (
        2
        s
        
          log
          
            2
          
        
        ⁡
        (
        3
        s
        )
        )
      
    
    {\displaystyle \operatorname {VCDim} ({\mathcal {C}}_{s})\leq \operatorname {VCDim} ({\mathcal {C}})\cdot (2s\log _{2}(3s))}
  
</li>
<li>Given a set-family 
  
    
      
        
          
            C
          
        
      
    
    {\displaystyle {\mathcal {C}}}
  
 and an element 
  
    
      
        
          C
          
            0
          
        
        ∈
        
          
            C
          
        
      
    
    {\displaystyle C_{0}\in {\mathcal {C}}}
  
, define 
  
    
      
        
          
            C
          
        
        
        Δ
        
          C
          
            0
          
        
        :=
        {
        C
        
        Δ
        
          C
          
            0
          
        
        ∣
        C
        ∈
        H
        }
      
    
    {\displaystyle {\mathcal {C}}\,\Delta C_{0}:=\{C\,\Delta C_{0}\mid C\in H\}}
  
 where 
  
    
      
        Δ
      
    
    {\displaystyle \Delta }
  
 denotes <a href="/facts/Symmetric_set_difference/W2QST1Qr">symmetric set difference</a>. Then:<a class="footnote-ref" id="fnref:7" href="#fn:7"><sup>7</sup></a>: 58  
  
    
      
        VCDim
        ⁡
        (
        
          
            C
          
        
        
        Δ
        
          C
          
            0
          
        
        )
        =
        VCDim
        ⁡
        (
        
          
            C
          
        
        )
      
    
    {\displaystyle \operatorname {VCDim} ({\mathcal {C}}\,\Delta C_{0})=\operatorname {VCDim} ({\mathcal {C}})}
  
</li></ol>
<h2 id="examples-of-vc-classes">Examples of VC Classes</h2>
<h3>VC dimension of a finite projective plane</h3>
<p>A <a href="/facts/Finite_projective_plane/p6ot8oNv">finite projective plane</a> of order <i>n</i> is a collection of <i>n</i>2 + <i>n</i> + 1 sets (called "lines") over <i>n</i>2 + <i>n</i> + 1 elements (called "points"), for which:
</p>
<ul><li>Each line contains exactly <i>n</i> + 1 points.</li>
<li>Each line intersects every other line in exactly one point.</li>
<li>Each point is contained in exactly <i>n</i> + 1 lines.</li>
<li>Each point is in exactly one line in common with every other point.</li>
<li>At least four points do not lie in a common line.</li></ul>
<p>The VC dimension of a finite projective plane is 2.<a class="footnote-ref" id="fnref:8" href="#fn:8"><sup>8</sup></a>
</p><p><i>Proof</i>: (a) For each pair of distinct points, there is one line that contains both of them, lines that contain only one of them, and lines that contain none of them, so every set of size 2 is shattered. (b) For any triple of three distinct points, if there is a line <i>x</i> that contain all three, then there is no line <i>y</i> that contains exactly two (since then <i>x</i> and <i>y</i> would intersect in two points, which is contrary to the definition of a projective plane). Hence, no set of size 3 is shattered.
</p>
<h3>VC dimension of a boosting classifier</h3>
<p>Suppose we have a base class 
  
    
      
        B
      
    
    {\displaystyle B}
  
 of simple classifiers, whose VC dimension is 
  
    
      
        D
      
    
    {\displaystyle D}
  
.
</p><p>We can construct a more powerful classifier by combining several different classifiers from 
  
    
      
        B
      
    
    {\displaystyle B}
  
; this technique is called <a href="/facts/Boosting_(machine_learning)/HgejTPPu">boosting</a>. Formally, given 
  
    
      
        T
      
    
    {\displaystyle T}
  
 classifiers 
  
    
      
        
          h
          
            1
          
        
        ,
        …
        ,
        
          h
          
            T
          
        
        ∈
        B
      
    
    {\displaystyle h_{1},\ldots ,h_{T}\in B}
  
 and a weight vector 
  
    
      
        w
        ∈
        
          
            R
          
          
            T
          
        
      
    
    {\displaystyle w\in \mathbb {R} ^{T}}
  
, we can define the following classifier:
</p>

f
        (
        x
        )
        =
        sign
        ⁡
        
          (
          
            
              ∑
              
                t
                =
                1
              
              
                T
              
            
            
              w
              
                t
              
            
            ⋅
            
              h
              
                t
              
            
            (
            x
            )
          
          )
        
      
    
    {\displaystyle f(x)=\operatorname {sign} \left(\sum _{t=1}^{T}w_{t}\cdot h_{t}(x)\right)}

<p>The VC dimension of the set of all such classifiers (for all selections of 
  
    
      
        T
      
    
    {\displaystyle T}
  
 classifiers from 
  
    
      
        B
      
    
    {\displaystyle B}
  
 and a weight-vector from 
  
    
      
        
          
            R
          
          
            T
          
        
      
    
    {\displaystyle \mathbb {R} ^{T}}
  
), assuming 
  
    
      
        T
        ,
        D
        ≥
        3
      
    
    {\displaystyle T,D\geq 3}
  
, is at most:<a class="footnote-ref" id="fnref:9" href="#fn:9"><sup>9</sup></a>: 108–109 
</p>

T
        ⋅
        (
        D
        +
        1
        )
        ⋅
        (
        3
        log
        ⁡
        (
        T
        ⋅
        (
        D
        +
        1
        )
        )
        +
        2
        )
      
    
    {\displaystyle T\cdot (D+1)\cdot (3\log(T\cdot (D+1))+2)}

<h3>VC dimension of a neural network</h3>

<p>A <a href="/facts/Neural_network/6V1jMlkx">neural network</a> is described by a <a href="/facts/Directed_acyclic_graph/k6zq1os9">directed acyclic graph</a> <i>G</i>(<i>V</i>,<i>E</i>), where:
</p>
<ul><li><i>V</i> is the set of nodes. Each node is a simple computation cell.</li>
<li><i>E</i> is the set of edges, Each edge has a weight.</li>
<li>The input to the network is represented by the sources of the graph – the nodes with no incoming edges.</li>
<li>The output of the network is represented by the sinks of the graph – the nodes with no outgoing edges.</li>
<li>Each intermediate node gets as input a weighted sum of the outputs of the nodes at its incoming edges, where the weights are the weights on the edges.</li>
<li>Each intermediate node outputs a certain increasing function of its input, such as the <a href="/facts/Sign_function/VADy9zQq">sign function</a> or the <a href="/facts/Sigmoid_function/lAXu3AwW">sigmoid function</a>. This function is called the <i>activation function</i>.</li></ul>
<p>The VC dimension of a neural network is bounded as follows:<a class="footnote-ref" id="fnref:10" href="#fn:10"><sup>10</sup></a>: 234–235 
</p>
<ul><li>If the activation function is the sign function and the weights are general, then the VC dimension is at most 
  
    
      
        O
        (
        
          |
        
        E
        
          |
        
        ⋅
        log
        ⁡
        (
        
          |
        
        E
        
          |
        
        )
        )
      
    
    {\displaystyle O(|E|\cdot \log(|E|))}
  
.</li>
<li>If the activation function is the sigmoid function and the weights are general, then the VC dimension is at least 
  
    
      
        Ω
        (
        
          |
        
        E
        
          
            |
          
          
            2
          
        
        )
      
    
    {\displaystyle \Omega (|E|^{2})}
  
 and at most 
  
    
      
        O
        (
        
          |
        
        E
        
          
            |
          
          
            2
          
        
        ⋅
        
          |
        
        V
        
          
            |
          
          
            2
          
        
        )
      
    
    {\displaystyle O(|E|^{2}\cdot |V|^{2})}
  
.</li>
<li>If the weights come from a finite family (e.g. the weights are real numbers that can be represented by at most 32 bits in a computer), then, for both activation functions, the VC dimension is at most 
  
    
      
        O
        (
        
          |
        
        E
        
          |
        
        )
      
    
    {\displaystyle O(|E|)}
  
.</li></ul>
<h2 id="generalizations">Generalizations</h2>
<p>The VC dimension is defined for spaces of binary functions (functions to {0,1}). Several generalizations have been suggested for spaces of non-binary functions.
</p>
<ul><li>For multi-class functions (e.g., functions to {0,...,n−1}), the <a href="/facts/Natarajan_dimension/IdBjSnt9">Natarajan dimension</a>,<a class="footnote-ref" id="fnref:11" href="#fn:11"><sup>11</sup></a> and its generalization the DS dimension<a class="footnote-ref" id="fnref:12" href="#fn:12"><sup>12</sup></a> can be used.</li>
<li>For real-valued functions (e.g., functions to a real interval, [0,1]), the Graph dimension <a class="footnote-ref" id="fnref:13" href="#fn:13"><sup>13</sup></a> or Pollard's pseudo-dimension<a class="footnote-ref" id="fnref:14" href="#fn:14"><sup>14</sup></a><a class="footnote-ref" id="fnref:15" href="#fn:15"><sup>15</sup></a><a class="footnote-ref" id="fnref:16" href="#fn:16"><sup>16</sup></a> can be used.</li>
<li>The <a href="/facts/Rademacher_complexity/YTmfuHw5">Rademacher complexity</a> provides similar bounds to the VC, and can sometimes provide more insight than VC dimension calculations into such statistical methods such as those using <a href="/facts/Kernel_methods/fYuURIPk">kernels</a>.</li>
<li>The Memory Capacity (sometimes Memory Equivalent Capacity) gives a lower bound capacity, rather than an upper bound (see for example: <a href="/facts/Artificial_neural_network/6V1jMlkx">Artificial neural network#Capacity</a>) and therefore indicates the point of potential overfitting.</li></ul>
<h2 id="see-also">See also</h2>

Wikimedia Commons has media related to Vapnik-Chervonenkis dimension.

<ul><li><a href="/facts/Growth_function/Ynu7B3HN">Growth function</a></li>
<li><a href="/facts/Sauer%25E2%2580%2593Shelah_lemma/X0ezaEyw">Sauer–Shelah lemma</a>, a bound on the number of sets in a set system in terms of the VC dimension.</li>
<li>Karpinski–Macintyre theorem,<a class="footnote-ref" id="fnref:17" href="#fn:17"><sup>17</sup></a> a bound on the VC dimension of general Pfaffian formulas.</li></ul>
<h2 id="footnotes">Footnotes</h2>

<ul><li>Moore, Andrew. <a href="https://autonlab.org/assets/tutorials/vcdim08.pdf">"VC dimension tutorial"</a> (PDF).</li>
<li>Vapnik, Vladimir (2000). <i>The nature of statistical learning theory</i>. Springer.</li>
<li>Blumer, A.; Ehrenfeucht, A.; Haussler, D.; <a href="/facts/Manfred_K._Warmuth/G1bBxcVX">Warmuth, M. K.</a> (1989). <a href="http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-16/Papers/p929-blumer.pdf">"Learnability and the Vapnik–Chervonenkis dimension"</a> (PDF). <i>Journal of the ACM</i>. 36 (4): 929–865. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1145%2F76359.76371">10.1145/76359.76371</a>. <a href="/facts/S2CID_(identifier)/ldJsHa2Y">S2CID</a> <a href="https://api.semanticscholar.org/CorpusID:1138467">1138467</a>.</li>
<li>Burges, Christopher. <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/svmtutorial.pdf">"Tutorial on SVMs for Pattern Recognition"</a> (PDF). <i><a href="/facts/Microsoft/nGIDDXdx">Microsoft</a></i>. (containing information also for VC dimension)</li>
<li><a href="/facts/Bernard_Chazelle/ztGnQrMd">Chazelle, Bernard</a>. <a href="http://www.cs.princeton.edu/~chazelle/book.html">"The Discrepancy Method"</a>.</li>
<li>Natarajan, B.K. (1989). <a href="https://doi.org/10.1007%2FBF00114804">"On Learning sets and functions"</a>. <i>Machine Learning</i>. 4: 67–97. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1007%2FBF00114804">10.1007/BF00114804</a>.</li>
<li>Ben-David, Shai; Cesa-Bianchi, Nicolò; Long, Philip M. (1992). "Characterizations of learnability for classes of {O, …, <i>n</i>}-valued functions". <i>Proceedings of the fifth annual workshop on Computational learning theory – COLT '92</i>. p. 333. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1145%2F130385.130423">10.1145/130385.130423</a>. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 089791497X.</li>
<li>Brukhim, Nataly; Carmon, Daniel; Dinur, Irit; Moran, Shay; Yehudayoff, Amir (2022). "A Characterization of Multiclass Learnability". <i>2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS)</i>. <a href="/facts/ArXiv_(identifier)/H6EtgnBe">arXiv</a>:<a href="https://arxiv.org/abs/2203.01550">2203.01550</a>.</li>
<li>Pollard, D. (1984). <i>Convergence of Stochastic Processes</i>. Springer. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 9781461252542.</li>
<li>Anthony, Martin; Bartlett, Peter L. (2009). <i>Neural Network Learning: Theoretical Foundations</i>. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 9780521118620.</li>
<li>Morgenstern, Jamie H.; Roughgarden, Tim (2015). <a href="http://papers.nips.cc/paper/5766-on-the-pseudo-dimension-of-nearly-optimal-auctions"><i>On the Pseudo-Dimension of Nearly Optimal Auctions</i></a>. NIPS. <a href="/facts/ArXiv_(identifier)/H6EtgnBe">arXiv</a>:<a href="https://arxiv.org/abs/1506.03684">1506.03684</a>. <a href="/facts/Bibcode_(identifier)/9HtdQSGB">Bibcode</a>:<a href="https://ui.adsabs.harvard.edu/abs/2015arXiv150603684M">2015arXiv150603684M</a>.</li>
<li>Karpinski, Marek; Macintyre, Angus (February 1997). <a href="https://ora.ox.ac.uk/objects/uuid:a14465ce-11d9-4f89-aeec-fcf0bea603ed">"Polynomial Bounds for VC Dimension of Sigmoidal and General Pfaffian Neural Networks"</a>. <i>Journal of Computer and System Sciences</i>. 54 (1): 169–176. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1006%2Fjcss.1997.1477">10.1006/jcss.1997.1477</a>.</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025.

This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968.

The translation was reproduced as:
Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9. <a href="978-3-319-21851-9" target="_blank">978-3-319-21851-9</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. US, Massachusetts: MIT Press. ISBN 9780262018258. <a href="9780262018258" target="_blank">9780262018258</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Vapnik 2000. - Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer. <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135. <a href="9781107057135" target="_blank">9781107057135</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. US, Massachusetts: MIT Press. ISBN 9780262018258. <a href="9780262018258" target="_blank">9780262018258</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
<li id="fn:6"><p>Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. US, Massachusetts: MIT Press. ISBN 9780262018258. <a href="9780262018258" target="_blank">9780262018258</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></p></li>
<li id="fn:7"><p>Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. US, Massachusetts: MIT Press. ISBN 9780262018258. <a href="9780262018258" target="_blank">9780262018258</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></p></li>
<li id="fn:8"><p>Alon, N.; Haussler, D.; Welzl, E. (1987). "Partitioning and geometric embedding of range spaces of finite Vapnik-Chervonenkis dimension". Proceedings of the third annual symposium on Computational geometry – SCG '87. p. 331. doi:10.1145/41958.41994. ISBN 978-0897912310. S2CID 7394360. <a href="978-0897912310" target="_blank">978-0897912310</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></p></li>
<li id="fn:9"><p>Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135. <a href="9781107057135" target="_blank">9781107057135</a> <a href="#fnref:9" class="footnote-back-ref">↩</a></p></li>
<li id="fn:10"><p>Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135. <a href="9781107057135" target="_blank">9781107057135</a> <a href="#fnref:10" class="footnote-back-ref">↩</a></p></li>
<li id="fn:11"><p>Natarajan 1989. - Natarajan, B.K. (1989). "On Learning sets and functions". Machine Learning. 4: 67–97. doi:10.1007/BF00114804. <a href="https://doi.org/10.1007%2FBF00114804" target="_blank">https://doi.org/10.1007%2FBF00114804</a> <a href="#fnref:11" class="footnote-back-ref">↩</a></p></li>
<li id="fn:12"><p>Brukhim et al. 2022. - Brukhim, Nataly; Carmon, Daniel; Dinur, Irit; Moran, Shay; Yehudayoff, Amir (2022). "A Characterization of Multiclass Learnability". 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). arXiv:2203.01550. <a href="https://arxiv.org/abs/2203.01550" target="_blank">https://arxiv.org/abs/2203.01550</a> <a href="#fnref:12" class="footnote-back-ref">↩</a></p></li>
<li id="fn:13"><p>Natarajan 1989. - Natarajan, B.K. (1989). "On Learning sets and functions". Machine Learning. 4: 67–97. doi:10.1007/BF00114804. <a href="https://doi.org/10.1007%2FBF00114804" target="_blank">https://doi.org/10.1007%2FBF00114804</a> <a href="#fnref:13" class="footnote-back-ref">↩</a></p></li>
<li id="fn:14"><p>Pollard 1984. - Pollard, D. (1984). Convergence of Stochastic Processes. Springer. ISBN 9781461252542. <a href="#fnref:14" class="footnote-back-ref">↩</a></p></li>
<li id="fn:15"><p>Anthony & Bartlett 2009. - Anthony, Martin; Bartlett, Peter L. (2009). Neural Network Learning: Theoretical Foundations. ISBN 9780521118620. <a href="#fnref:15" class="footnote-back-ref">↩</a></p></li>
<li id="fn:16"><p>Morgenstern & Roughgarden 2015. - Morgenstern, Jamie H.; Roughgarden, Tim (2015). On the Pseudo-Dimension of Nearly Optimal Auctions. NIPS. arXiv:1506.03684. Bibcode:2015arXiv150603684M. <a href="http://papers.nips.cc/paper/5766-on-the-pseudo-dimension-of-nearly-optimal-auctions" target="_blank">http://papers.nips.cc/paper/5766-on-the-pseudo-dimension-of-nearly-optimal-auctions</a> <a href="#fnref:16" class="footnote-back-ref">↩</a></p></li>
<li id="fn:17"><p>Karpinski & Macintyre 1997. - Karpinski, Marek; Macintyre, Angus (February 1997). "Polynomial Bounds for VC Dimension of Sigmoidal and General Pfaffian Neural Networks". Journal of Computer and System Sciences. 54 (1): 169–176. doi:10.1006/jcss.1997.1477. <a href="https://ora.ox.ac.uk/objects/uuid:a14465ce-11d9-4f89-aeec-fcf0bea603ed" target="_blank">https://ora.ox.ac.uk/objects/uuid:a14465ce-11d9-4f89-aeec-fcf0bea603ed</a> <a href="#fnref:17" class="footnote-back-ref">↩</a></p></li>
</ol>

Vapnik–Chervonenkis dimension open-in-new

Vapnik–Chervonenkis dimension