Kernel method

<h2 id="motivation-and-informal-explanation">Motivation and informal explanation</h2>
<p>Kernel methods can be thought of as <a href="/facts/Instance-based_learning/8dLpDSjL">instance-based learners</a>: rather than learning some fixed set of parameters corresponding to the features of their inputs, they instead "remember" the 
  
    
      
        i
      
    
    {\displaystyle i}
  
-th training example 
  
    
      
        (
        
          
            x
          
          
            i
          
        
        ,
        
          y
          
            i
          
        
        )
      
    
    {\displaystyle (\mathbf {x} _{i},y_{i})}
  
 and learn for it a corresponding weight 
  
    
      
        
          w
          
            i
          
        
      
    
    {\displaystyle w_{i}}
  
.  Prediction for unlabeled inputs, i.e., those not in the training set, is treated by the application of a <a href="/facts/Similarity_function/FmmFco8c">similarity function</a> 
  
    
      
        k
      
    
    {\displaystyle k}
  
, called a kernel, between the unlabeled input 
  
    
      
        
          
            x
            ′
          
        
      
    
    {\displaystyle \mathbf {x'} }
  
 and each of the training inputs 
  
    
      
        
          
            x
          
          
            i
          
        
      
    
    {\displaystyle \mathbf {x} _{i}}
  
.  For instance, a kernelized <a href="/facts/Binary_classifier/MeYYAMwp">binary classifier</a> typically computes a weighted sum of similarities

y
              ^
            
          
        
        =
        sgn
        ⁡
        
          ∑
          
            i
            =
            1
          
          
            n
          
        
        
          w
          
            i
          
        
        
          y
          
            i
          
        
        k
        (
        
          
            x
          
          
            i
          
        
        ,
        
          
            x
            ′
          
        
        )
        ,
      
    
    {\displaystyle {\hat {y}}=\operatorname {sgn} \sum _{i=1}^{n}w_{i}y_{i}k(\mathbf {x} _{i},\mathbf {x'} ),}

where
</p>
<ul><li>
  
    
      
        
          
            
              y
              ^
            
          
        
        ∈
        {
        −
        1
        ,
        +
        1
        }
      
    
    {\displaystyle {\hat {y}}\in \{-1,+1\}}
  
 is the kernelized binary classifier's predicted label for the unlabeled input 
  
    
      
        
          
            x
            ′
          
        
      
    
    {\displaystyle \mathbf {x'} }
  
 whose hidden true label 
  
    
      
        y
      
    
    {\displaystyle y}
  
 is of interest;</li>
<li>
  
    
      
        k
        :
        
          
            X
          
        
        ×
        
          
            X
          
        
        →
        
          R
        
      
    
    {\displaystyle k\colon {\mathcal {X}}\times {\mathcal {X}}\to \mathbb {R} }
  
 is the kernel function that measures similarity between any pair of inputs 
  
    
      
        
          x
        
        ,
        
          
            x
            ′
          
        
        ∈
        
          
            X
          
        
      
    
    {\displaystyle \mathbf {x} ,\mathbf {x'} \in {\mathcal {X}}}
  
;</li>
<li>the sum ranges over the n labeled examples 
  
    
      
        {
        (
        
          
            x
          
          
            i
          
        
        ,
        
          y
          
            i
          
        
        )
        
          }
          
            i
            =
            1
          
          
            n
          
        
      
    
    {\displaystyle \{(\mathbf {x} _{i},y_{i})\}_{i=1}^{n}}
  
 in the classifier's training set, with 
  
    
      
        
          y
          
            i
          
        
        ∈
        {
        −
        1
        ,
        +
        1
        }
      
    
    {\displaystyle y_{i}\in \{-1,+1\}}
  
;</li>
<li>the 
  
    
      
        
          w
          
            i
          
        
        ∈
        
          R
        
      
    
    {\displaystyle w_{i}\in \mathbb {R} }
  
 are the weights for the training examples, as determined by the learning algorithm;</li>
<li>the <a href="/facts/Sign_function/VADy9zQq">sign function</a> 
  
    
      
        sgn
      
    
    {\displaystyle \operatorname {sgn} }
  
 determines whether the predicted classification 
  
    
      
        
          
            
              y
              ^
            
          
        
      
    
    {\displaystyle {\hat {y}}}
  
 comes out positive or negative.</li></ul>
<p>Kernel classifiers were described as early as the 1960s, with the invention of the <a href="/facts/Kernel_perceptron/esW7fKh4">kernel perceptron</a>.<a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a> They rose to great prominence with the popularity of the <a href="/facts/Support-vector_machine/XobxpdBG">support-vector machine</a> (SVM) in the 1990s, when the SVM was found to be competitive with <a href="/facts/Artificial_neural_network/6V1jMlkx">neural networks</a> on tasks such as <a href="/facts/Handwriting_recognition/RgdyWtDQ">handwriting recognition</a>.
</p>
<h2 id="mathematics-the-kernel-trick">Mathematics: the kernel trick</h2>

<p>The kernel trick avoids the explicit mapping that is needed to get linear <a href="/facts/Learning_algorithms/e0w0XJTu">learning algorithms</a> to learn a nonlinear function or <a href="/facts/Decision_boundary/dp8dApoX">decision boundary</a>.  For all 
  
    
      
        
          x
        
      
    
    {\displaystyle \mathbf {x} }
  
 and 
  
    
      
        
          
            x
            ′
          
        
      
    
    {\displaystyle \mathbf {x'} }
  
 in the input space 
  
    
      
        
          
            X
          
        
      
    
    {\displaystyle {\mathcal {X}}}
  
, certain functions 
  
    
      
        k
        (
        
          x
        
        ,
        
          
            x
            ′
          
        
        )
      
    
    {\displaystyle k(\mathbf {x} ,\mathbf {x'} )}
  
 can be expressed as an <a href="/facts/Inner_product/HoyjElEy">inner product</a> in another space 
  
    
      
        
          
            V
          
        
      
    
    {\displaystyle {\mathcal {V}}}
  
. The function 
  
    
      
        k
        :
        
          
            X
          
        
        ×
        
          
            X
          
        
        →
        
          R
        
      
    
    {\displaystyle k\colon {\mathcal {X}}\times {\mathcal {X}}\to \mathbb {R} }
  
 is often referred to as a <i>kernel</i> or a <i><a href="/facts/Kernel_function/Ws3zkxfl">kernel function</a></i>. The word "kernel" is used in mathematics to denote a weighting function for a weighted sum or <a href="/facts/Integral/V7lyV997">integral</a>.
</p><p>Certain problems in machine learning have more structure than an arbitrary weighting function 
  
    
      
        k
      
    
    {\displaystyle k}
  
.  The computation is made much simpler if the kernel can be written in the form of a "feature map" 
  
    
      
        φ
        :
        
          
            X
          
        
        →
        
          
            V
          
        
      
    
    {\displaystyle \varphi \colon {\mathcal {X}}\to {\mathcal {V}}}
  
 which satisfies

k
        (
        
          x
        
        ,
        
          
            x
            ′
          
        
        )
        =
        ⟨
        φ
        (
        
          x
        
        )
        ,
        φ
        (
        
          
            x
            ′
          
        
        )
        
          ⟩
          
            
              V
            
          
        
        .
      
    
    {\displaystyle k(\mathbf {x} ,\mathbf {x'} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {x'} )\rangle _{\mathcal {V}}.}
  
The key restriction is that 
  
    
      
        ⟨
        ⋅
        ,
        ⋅
        
          ⟩
          
            
              V
            
          
        
      
    
    {\displaystyle \langle \cdot ,\cdot \rangle _{\mathcal {V}}}
  
 must be a proper inner product. On the other hand, an explicit representation for 
  
    
      
        φ
      
    
    {\displaystyle \varphi }
  
 is not necessary, as long as 
  
    
      
        
          
            V
          
        
      
    
    {\displaystyle {\mathcal {V}}}
  
 is an <a href="/facts/Inner_product_space/HoyjElEy">inner product space</a>.  The alternative follows from <a href="/facts/Mercer%27s_theorem/6CmBHnHc">Mercer's theorem</a>: an implicitly defined function 
  
    
      
        φ
      
    
    {\displaystyle \varphi }
  
 exists whenever the space 
  
    
      
        
          
            X
          
        
      
    
    {\displaystyle {\mathcal {X}}}
  
 can be equipped with a suitable <a href="/facts/Measure_(mathematics)/yCq7zlI4">measure</a> ensuring the function 
  
    
      
        k
      
    
    {\displaystyle k}
  
 satisfies <a href="/facts/Mercer%27s_condition/6CmBHnHc">Mercer's condition</a>.
</p><p>Mercer's theorem is similar to a generalization of the result from linear algebra that <a href="/facts/Positive-definite_matrix/Hbr6AuPS">associates an inner product to any positive-definite matrix</a>. In fact, Mercer's condition can be reduced to this simpler case. If we choose as our measure the <a href="/facts/Counting_measure/TA4u39yF">counting measure</a> 
  
    
      
        μ
        (
        T
        )
        =
        
          |
        
        T
        
          |
        
      
    
    {\displaystyle \mu (T)=|T|}
  
 for all 
  
    
      
        T
        ⊂
        X
      
    
    {\displaystyle T\subset X}
  
, which counts the number of points inside the set 
  
    
      
        T
      
    
    {\displaystyle T}
  
, then the integral in Mercer's theorem reduces to a summation
  
    
      
        
          ∑
          
            i
            =
            1
          
          
            n
          
        
        
          ∑
          
            j
            =
            1
          
          
            n
          
        
        k
        (
        
          
            x
          
          
            i
          
        
        ,
        
          
            x
          
          
            j
          
        
        )
        
          c
          
            i
          
        
        
          c
          
            j
          
        
        ≥
        0.
      
    
    {\displaystyle \sum _{i=1}^{n}\sum _{j=1}^{n}k(\mathbf {x} _{i},\mathbf {x} _{j})c_{i}c_{j}\geq 0.}
  
If this summation holds for all finite sequences of points 
  
    
      
        (
        
          
            x
          
          
            1
          
        
        ,
        …
        ,
        
          
            x
          
          
            n
          
        
        )
      
    
    {\displaystyle (\mathbf {x} _{1},\dotsc ,\mathbf {x} _{n})}
  
 in 
  
    
      
        
          
            X
          
        
      
    
    {\displaystyle {\mathcal {X}}}
  
 and all choices of 
  
    
      
        n
      
    
    {\displaystyle n}
  
 real-valued coefficients 
  
    
      
        (
        
          c
          
            1
          
        
        ,
        …
        ,
        
          c
          
            n
          
        
        )
      
    
    {\displaystyle (c_{1},\dots ,c_{n})}
  
 (cf. <a href="/facts/Positive_definite_kernel/Ws3zkxfl">positive definite kernel</a>), then the function 
  
    
      
        k
      
    
    {\displaystyle k}
  
 satisfies Mercer's condition.
</p><p>Some algorithms that depend on arbitrary relationships in the native space 
  
    
      
        
          
            X
          
        
      
    
    {\displaystyle {\mathcal {X}}}
  
 would, in fact, have a linear interpretation in a different setting: the range space of 
  
    
      
        φ
      
    
    {\displaystyle \varphi }
  
. The linear interpretation gives us insight about the algorithm. Furthermore, there is often no need to compute 
  
    
      
        φ
      
    
    {\displaystyle \varphi }
  
 directly during computation, as is the case with <a href="/facts/Support-vector_machine/XobxpdBG">support-vector machines</a>. Some cite this running time shortcut as the primary benefit. Researchers also use it to justify the meanings and properties of existing algorithms.
</p><p>Theoretically, a <a href="/facts/Gram_matrix/SGb2VGTU">Gram matrix</a> 
  
    
      
        
          K
        
        ∈
        
          
            R
          
          
            n
            ×
            n
          
        
      
    
    {\displaystyle \mathbf {K} \in \mathbb {R} ^{n\times n}}
  
 with respect to 
  
    
      
        {
        
          
            x
          
          
            1
          
        
        ,
        …
        ,
        
          
            x
          
          
            n
          
        
        }
      
    
    {\displaystyle \{\mathbf {x} _{1},\dotsc ,\mathbf {x} _{n}\}}
  
 (sometimes also called a "kernel matrix"<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a>), where 
  
    
      
        
          K
          
            i
            j
          
        
        =
        k
        (
        
          
            x
          
          
            i
          
        
        ,
        
          
            x
          
          
            j
          
        
        )
      
    
    {\displaystyle K_{ij}=k(\mathbf {x} _{i},\mathbf {x} _{j})}
  
, must be <a href="/facts/Positive-definite_matrix/Hbr6AuPS">positive semi-definite (PSD)</a>.<a class="footnote-ref" id="fnref:5" href="#fn:5"><sup>5</sup></a> Empirically, for machine learning heuristics, choices of a function 
  
    
      
        k
      
    
    {\displaystyle k}
  
 that do not satisfy Mercer's condition may still perform reasonably if 
  
    
      
        k
      
    
    {\displaystyle k}
  
 at least approximates the intuitive idea of similarity.<a class="footnote-ref" id="fnref:6" href="#fn:6"><sup>6</sup></a> Regardless of whether 
  
    
      
        k
      
    
    {\displaystyle k}
  
 is a Mercer kernel, 
  
    
      
        k
      
    
    {\displaystyle k}
  
 may still be referred to as a "kernel".
</p><p>If the kernel function 
  
    
      
        k
      
    
    {\displaystyle k}
  
 is also a <a href="/facts/Covariance_function/pqUTwANG">covariance function</a> as used in <a href="/facts/Gaussian_processes/MrBq7kYW">Gaussian processes</a>, then the Gram matrix 
  
    
      
        
          K
        
      
    
    {\displaystyle \mathbf {K} }
  
 can also be called a <a href="/facts/Covariance_matrix/kLhgjHC6">covariance matrix</a>.<a class="footnote-ref" id="fnref:7" href="#fn:7"><sup>7</sup></a>
</p>
<h2 id="applications">Applications</h2>
<p>Application areas of kernel methods are diverse and include <a href="/facts/Geostatistics/UeChfgNG">geostatistics</a>,<a class="footnote-ref" id="fnref:8" href="#fn:8"><sup>8</sup></a> <a href="/facts/Kriging/Rk7X4qP5">kriging</a>, <a href="/facts/Inverse_distance_weighting/v4mmcFlP">inverse distance weighting</a>, <a href="/facts/3D_reconstruction/MlnAUm3l">3D reconstruction</a>, <a href="/facts/Bioinformatics/D5x2L8ee">bioinformatics</a>, <a href="/facts/Cheminformatics/BwBkT11c">cheminformatics</a>, <a href="/facts/Information_extraction/DL9nMD2X">information extraction</a> and <a href="/facts/Handwriting_recognition/RgdyWtDQ">handwriting recognition</a>.
</p>
<h2 id="popular-kernels">Popular kernels</h2>
<ul><li><a href="/facts/Fisher_kernel/wIpRJErI">Fisher kernel</a></li>
<li><a href="/facts/Graph_kernel/26cg65kb">Graph kernels</a></li>
<li><a href="/facts/Kernel_smoother/JaGaXr6s">Kernel smoother</a></li>
<li><a href="/facts/Polynomial_kernel/X6jvPUyU">Polynomial kernel</a></li>
<li><a href="/facts/Radial_basis_function_kernel/A88JLBJF">Radial basis function kernel</a> (RBF)</li>
<li><a href="/facts/String_kernel/22x4jvva">String kernels</a></li>
<li><a href="/facts/Neural_tangent_kernel/ybcmu6Nr">Neural tangent kernel</a></li>
<li><a href="/facts/Neural_network_Gaussian_process/q8Adj9Co">Neural network Gaussian process</a> (NNGP) kernel</li></ul>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Kernel_methods_for_vector_output/p3gzGg6S">Kernel methods for vector output</a></li>
<li><a href="/facts/Kernel_density_estimation/NmznyEW8">Kernel density estimation</a></li>
<li><a href="/facts/Representer_theorem/6bXmlSzK">Representer theorem</a></li>
<li><a href="/facts/Similarity_learning/peRWFSl4">Similarity learning</a></li>
<li><a href="/facts/Cover%27s_theorem/2MK5JlEP">Cover's theorem</a></li></ul>

<h2 id="further-reading">Further reading</h2>
<ul><li><a href="/facts/John_Shawe-Taylor/m0GHWgB8">Shawe-Taylor, J.</a>; <a href="/facts/Nello_Cristianini/j3xW52wj">Cristianini, N.</a> (2004). <i>Kernel Methods for Pattern Analysis</i>. Cambridge University Press. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 9780511809682.</li>
<li>Liu, W.; Principe, J.; Haykin, S. (2010). <a href="https://books.google.com/books?id=eWUwB_P5pW0C"><i>Kernel Adaptive Filtering: A Comprehensive Introduction</i></a>. Wiley. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 9781118211212.</li>
<li><a href="/facts/Bernhard_Sch%C3%B6lkopf/LCexnwav">Schölkopf, B.</a>; Smola, A. J.; Bach, F. (2018). <a href="https://books.google.com/books?id=ZQxiuAEACAAJ"><i>Learning with Kernels : Support Vector Machines, Regularization, Optimization, and Beyond</i></a>. MIT Press. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-262-53657-8.</li></ul>
<h2 id="external-links">External links</h2>
<ul><li><a href="http://www.kernel-machines.org">Kernel-Machines Org</a>—community website</li>
<li><a href="http://onlineprediction.net/?n=Main.KernelMethods">onlineprediction.net Kernel Methods Article</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>"Kernel method". Engati. Retrieved 2023-04-04. <a href="https://www.engati.com/glossary/kernel-method" target="_blank">https://www.engati.com/glossary/kernel-method</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Theodoridis, Sergios (2008). Pattern Recognition. Elsevier B.V. p. 203. ISBN 9780080949123. <a href="9780080949123" target="_blank">9780080949123</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Aizerman, M. A.; Braverman, Emmanuel M.; Rozonoer, L. I. (1964). "Theoretical foundations of the potential function method in pattern recognition learning". Automation and Remote Control. 25: 821–837. Cited in Guyon, Isabelle; Boser, B.; Vapnik, Vladimir (1993). Automatic capacity tuning of very large VC-dimension classifiers. Advances in neural information processing systems. CiteSeerX 10.1.1.17.7215. <a href="/wiki/CiteSeerX_(identifier)" target="_blank">/wiki/CiteSeerX_(identifier)</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Hofmann, Thomas; Schölkopf, Bernhard; Smola, Alexander J. (2008). "Kernel Methods in Machine Learning". The Annals of Statistics. 36 (3). arXiv:math/0701907. doi:10.1214/009053607000000677. S2CID 88516979. <a href="https://projecteuclid.org/download/pdfview_1/euclid.aos/1211819561" target="_blank">https://projecteuclid.org/download/pdfview_1/euclid.aos/1211819561</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. US, Massachusetts: MIT Press. ISBN 9780262018258. <a href="9780262018258" target="_blank">9780262018258</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
<li id="fn:6"><p>Sewell, Martin. "Support Vector Machines: Mercer's Condition". Support Vector Machines. Archived from the original on 2018-10-15. Retrieved 2014-05-30. <a href="https://web.archive.org/web/20181015031456/http://www.svms.org/mercer/" target="_blank">https://web.archive.org/web/20181015031456/http://www.svms.org/mercer/</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></p></li>
<li id="fn:7"><p>Rasmussen, Carl Edward; Williams, Christopher K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. ISBN 0-262-18253-X. [page needed] <a href="0-262-18253-X" target="_blank">0-262-18253-X</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></p></li>
<li id="fn:8"><p>Honarkhah, M.; Caers, J. (2010). "Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling". Mathematical Geosciences. 42 (5): 487–517. Bibcode:2010MaGeo..42..487H. doi:10.1007/s11004-010-9276-7. S2CID 73657847. <a href="/wiki/Mathematical_Geosciences" target="_blank">/wiki/Mathematical_Geosciences</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></p></li>
</ol>

Kernel method open-in-new

Kernel method