The training set is defined as S = { ( x 1 , y 1 ) , … , ( x n , y n ) } {\displaystyle S=\{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}} , where X {\displaystyle X} is the n × d {\displaystyle n\times d} input matrix and Y = ( y 1 , … , y n ) {\displaystyle Y=(y_{1},\dots ,y_{n})} is the output vector. Where applicable, the kernel function is denoted by k {\displaystyle k} , and the n × n {\displaystyle n\times n} kernel matrix is denoted by K {\displaystyle K} which has entries K i j = k ( x i , x j ) {\displaystyle K_{ij}=k(x_{i},x_{j})} and H {\displaystyle {\mathcal {H}}} denotes the Reproducing Kernel Hilbert Space (RKHS) with kernel k {\displaystyle k} . The regularization parameter is denoted by λ {\displaystyle \lambda } .
(Note: For g ∈ G {\displaystyle g\in G} and f ∈ F {\displaystyle f\in F} , with G {\displaystyle G} and F {\displaystyle F} being Hilbert spaces, given a linear, continuous operator L {\displaystyle L} , assume that g = L f {\displaystyle g=Lf} holds. In this setting, the direct problem would be to solve for g {\displaystyle g} given f {\displaystyle f} and the inverse problem would be to solve for f {\displaystyle f} given g {\displaystyle g} . If the solution exists, is unique and stable, the inverse problem (i.e. the problem of solving for f {\displaystyle f} ) is well-posed; otherwise, it is ill-posed.)
The connection between the regularized least squares (RLS) estimation problem (Tikhonov regularization setting) and the theory of ill-posed inverse problems is an example of how spectral regularization algorithms are related to the theory of ill-posed inverse problems.
The RLS estimator solves min f ∈ H 1 n ∑ i = 1 n ( y i − f ( x i ) ) 2 + λ ‖ f ‖ H 2 {\displaystyle \min _{f\in {\mathcal {H}}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-f(x_{i}))^{2}+\lambda \left\|f\right\|_{\mathcal {H}}^{2}} and the RKHS allows for expressing this RLS estimator as f S λ ( X ) = ∑ i = 1 n c i k ( x , x i ) {\displaystyle f_{S}^{\lambda }(X)=\sum _{i=1}^{n}c_{i}k(x,x_{i})} where ( K + n λ I ) c = Y {\displaystyle (K+n\lambda I)c=Y} with c = ( c 1 , … , c n ) {\displaystyle c=(c_{1},\dots ,c_{n})} .6 The penalization term is used for controlling smoothness and preventing overfitting. Since the solution of empirical risk minimization min f ∈ H 1 n ∑ i = 1 n ( y i − f ( x i ) ) 2 {\displaystyle \min _{f\in {\mathcal {H}}}{\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-f(x_{i}))^{2}} can be written as f S λ ( X ) = ∑ i = 1 n c i k ( x , x i ) {\displaystyle f_{S}^{\lambda }(X)=\sum _{i=1}^{n}c_{i}k(x,x_{i})} such that K c = Y {\displaystyle Kc=Y} , adding the penalty function amounts to the following change in the system that needs to be solved:7 { min f ∈ H 1 n ∑ i = 1 n ( y i − f ( x i ) ) 2 → min f ∈ H 1 n ∑ i = 1 n ( y i − f ( x i ) ) 2 + λ ‖ f ‖ H 2 } ≡ { K c = Y → ( K + n λ I ) c = Y } . {\displaystyle \left\{\min _{f\in {\mathcal {H}}}{\frac {1}{n}}\sum _{i=1}^{n}\left(y_{i}-f(x_{i})\right)^{2}\rightarrow \min _{f\in {\mathcal {H}}}{\frac {1}{n}}\sum _{i=1}^{n}\left(y_{i}-f(x_{i})\right)^{2}+\lambda \left\|f\right\|_{\mathcal {H}}^{2}\right\}\equiv {\biggl \{}Kc=Y\rightarrow \left(K+n\lambda I\right)c=Y{\biggr \}}.}
In this learning setting, the kernel matrix can be decomposed as K = Q Σ Q T {\displaystyle K=Q\Sigma Q^{T}} , with σ = diag ( σ 1 , … , σ n ) , σ 1 ≥ σ 2 ≥ ⋯ ≥ σ n ≥ 0 {\displaystyle \sigma =\operatorname {diag} (\sigma _{1},\dots ,\sigma _{n}),~\sigma _{1}\geq \sigma _{2}\geq \cdots \geq \sigma _{n}\geq 0} and q 1 , … , q n {\displaystyle q_{1},\dots ,q_{n}} are the corresponding eigenvectors. Therefore, in the initial learning setting, the following holds: c = K − 1 Y = Q Σ − 1 Q T Y = ∑ i = 1 n 1 σ i ⟨ q i , Y ⟩ q i . {\displaystyle c=K^{-1}Y=Q\Sigma ^{-1}Q^{T}Y=\sum _{i=1}^{n}{\frac {1}{\sigma _{i}}}\langle q_{i},Y\rangle q_{i}.}
Thus, for small eigenvalues, even small perturbations in the data can lead to considerable changes in the solution. Hence, the problem is ill-conditioned, and solving this RLS problem amounts to stabilizing a possibly ill-conditioned matrix inversion problem, which is studied in the theory of ill-posed inverse problems; in both problems, a main concern is to deal with the issue of numerical stability.
Each algorithm in the class of spectral regularization algorithms is defined by a suitable filter function, denoted here by G λ ( ⋅ ) {\displaystyle G_{\lambda }(\cdot )} . If the Kernel matrix is denoted by K {\displaystyle K} , then λ {\displaystyle \lambda } should control the magnitude of the smaller eigenvalues of G λ ( K ) {\displaystyle G_{\lambda }(K)} . In a filtering setup, the goal is to find estimators f S λ ( X ) := ∑ i = 1 n c i k ( x , x i ) {\displaystyle f_{S}^{\lambda }(X):=\sum _{i=1}^{n}c_{i}k(x,x_{i})} where c = G λ ( K ) Y {\displaystyle c=G_{\lambda }(K)Y} . To do so, a scalar filter function G λ ( σ ) {\displaystyle G_{\lambda }(\sigma )} is defined using the eigen-decomposition of the kernel matrix: G λ ( K ) = Q G λ ( Σ ) Q T , {\displaystyle G_{\lambda }(K)=QG_{\lambda }(\Sigma )Q^{T},} which yields G λ ( K ) Y = ∑ i = 1 n G λ ( σ i ) ⟨ q i , Y ⟩ q i . {\displaystyle G_{\lambda }(K)Y~=~\sum _{i=1}^{n}G_{\lambda }(\sigma _{i})\langle q_{i},Y\rangle q_{i}.}
Typically, an appropriate filter function should have the following properties:8
While the above items give a rough characterization of the general properties of filter functions for all spectral regularization algorithms, the derivation of the filter function (and hence its exact form) varies depending on the specific regularization method that spectral filtering is applied to.
In the Tikhonov regularization setting, the filter function for RLS is described below. As shown in,9 in this setting, c = ( K + n λ I ) − 1 Y {\displaystyle c=\left(K+n\lambda I\right)^{-1}Y} . Thus, c = ( K + n λ I ) − 1 Y = Q ( Σ + n λ I ) − 1 Q T Y = ∑ i = 1 n 1 σ i + n λ < q i , Y > q i . {\displaystyle c=(K+n\lambda I)^{-1}Y=Q(\Sigma +n\lambda I)^{-1}Q^{T}Y=\sum _{i=1}^{n}{\frac {1}{\sigma _{i}+n\lambda }}<q_{i},Y>q_{i}.}
The undesired components are filtered out using regularization:
The filter function for Tikhonov regularization is therefore defined as:10 G λ ( σ ) = 1 σ + n λ . {\displaystyle G_{\lambda }(\sigma )={\frac {1}{\sigma +n\lambda }}.}
The idea behind the Landweber iteration is gradient descent:11
In this setting, if n {\displaystyle n} is larger than K {\displaystyle K} 's largest eigenvalue, the above iteration converges by choosing η = 2 / n {\displaystyle \eta =2/n} as the step-size:.12 The above iteration is equivalent to minimizing 1 n ‖ Y − K c ‖ 2 2 {\displaystyle {\frac {1}{n}}\left\|Y-Kc\right\|_{2}^{2}} (i.e. the empirical risk) via gradient descent; using induction, it can be proved that at the t {\displaystyle t} -th iteration, the solution is given by 13 c = η ∑ i = 0 t − 1 ( I − η K ) i Y . {\displaystyle c=\eta \sum _{i=0}^{t-1}\left(I-\eta K\right)^{i}Y.}
Thus, the appropriate filter function is defined by: G λ ( σ ) = η ∑ i = 0 t − 1 ( I − η σ ) i . {\displaystyle G_{\lambda }(\sigma )=\eta \sum _{i=0}^{t-1}\left(I-\eta \sigma \right)^{i}.}
It can be shown that this filter function corresponds to a truncated power expansion of K − 1 {\displaystyle K^{-1}} ;14 to see this, note that the relation ∑ i ≥ 0 x i = 1 / ( 1 − x ) {\displaystyle \sum _{i\geq 0}x^{i}=1/(1-x)} , would still hold if x {\displaystyle x} is replaced by a matrix; thus, if K {\displaystyle K} (the kernel matrix), or rather I − η K {\displaystyle I-\eta K} , is considered, the following holds: K − 1 = η ∑ i = 0 ∞ ( I − η K ) i ∼ η ∑ i = 0 t − 1 ( I − η K ) i . {\displaystyle K^{-1}=\eta \sum _{i=0}^{\infty }\left(I-\eta K\right)^{i}\sim \eta \sum _{i=0}^{t-1}\left(I-\eta K\right)^{i}.}
In this setting, the number of iterations gives the regularization parameter; roughly speaking, t ∼ 1 / λ {\displaystyle t\sim 1/\lambda } .15 If t {\displaystyle t} is large, overfitting may be a concern. If t {\displaystyle t} is small, oversmoothing may be a concern. Thus, choosing an appropriate time for early stopping of the iterations provides a regularization effect.
In the TSVD setting, given the eigen-decomposition K = Q Σ Q T {\displaystyle K=Q\Sigma Q^{T}} and using a prescribed threshold λ n {\displaystyle \lambda n} , a regularized inverse can be formed for the kernel matrix by discarding all the eigenvalues that are smaller than this threshold.16 Thus, the filter function for TSVD can be defined as G λ ( σ ) = { 1 / σ , if σ ≥ λ n 0 , otherwise {\displaystyle G_{\lambda }(\sigma )={\begin{cases}1/\sigma ,&{\text{if }}\sigma \geq \lambda n\\[1ex]0,&{\text{otherwise}}\end{cases}}}
It can be shown that TSVD is equivalent to the (unsupervised) projection of the data using (kernel) Principal Component Analysis (PCA), and that it is also equivalent to minimizing the empirical risk on the projected data (without regularization).17 Note that the number of components kept for the projection is the only free parameter here.
H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems. Kluwer, 1996. /wiki/Heinz_Engl ↩
L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral Algorithms for Supervised Learning, Neural Computation, 20(7), 2008. ↩
P. C. Hansen, J. G. Nagy, and D. P. O'Leary. Deblurring Images: Matrices, Spectra, and Filtering, Fundamentals of Algorithms 3, SIAM, Philadelphia, 2006. ↩
L. Rosasco. Lecture 6 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2013. Available at https://www.mit.edu/~9.520/fall13/slides/class06/class06_RLSSVM.pdf https://www.mit.edu/~9.520/fall13/slides/class06/class06_RLSSVM.pdf ↩
L. Rosasco. Lecture 7 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2013. Available at https://www.mit.edu/~9.520/fall13/slides/class07/class07_spectral.pdf https://www.mit.edu/~9.520/fall13/slides/class07/class07_spectral.pdf ↩