Additive models are a class of non-parametric regression models of the form:
where each X 1 , X 2 , … , X p {\displaystyle X_{1},X_{2},\ldots ,X_{p}} is a variable in our p {\displaystyle p} -dimensional predictor X {\displaystyle X} , and Y {\displaystyle Y} is our outcome variable. ϵ {\displaystyle \epsilon } represents our inherent error, which is assumed to have mean zero. The f j {\displaystyle f_{j}} represent unspecified smooth functions of a single X j {\displaystyle X_{j}} . Given the flexibility in the f j {\displaystyle f_{j}} , we typically do not have a unique solution: α {\displaystyle \alpha } is left unidentifiable as one can add any constants to any of the f j {\displaystyle f_{j}} and subtract this value from α {\displaystyle \alpha } . It is common to rectify this by constraining
leaving
necessarily.
The backfitting algorithm is then:
where Smooth {\displaystyle {\text{Smooth}}} is our smoothing operator. This is typically chosen to be a cubic spline smoother but can be any other appropriate fitting operation, such as:
In theory, step (b) in the algorithm is not needed as the function estimates are constrained to sum to zero. However, due to numerical issues this might become a problem in practice.1
If we consider the problem of minimizing the expected squared error:
There exists a unique solution by the theory of projections given by:
for i = 1, 2, ..., p.
This gives the matrix interpretation:
where P i ( ⋅ ) = E ( ⋅ | X i ) {\displaystyle P_{i}(\cdot )=E(\cdot |X_{i})} . In this context we can imagine a smoother matrix, S i {\displaystyle S_{i}} , which approximates our P i {\displaystyle P_{i}} and gives an estimate, S i Y {\displaystyle S_{i}Y} , of E ( Y | X ) {\displaystyle E(Y|X)}
or in abbreviated form
An exact solution of this is infeasible to calculate for large np, so the iterative technique of backfitting is used. We take initial guesses f j ( 0 ) {\displaystyle f_{j}^{(0)}} and update each f j ( ℓ ) {\displaystyle f_{j}^{(\ell )}} in turn to be the smoothed fit for the residuals of all the others:
Looking at the abbreviated form it is easy to see the backfitting algorithm as equivalent to the Gauss–Seidel method for linear smoothing operators S.
Following,2 we can formulate the backfitting algorithm explicitly for the two dimensional case. We have:
If we denote f ^ 1 ( i ) {\displaystyle {\hat {f}}_{1}^{(i)}} as the estimate of f 1 {\displaystyle f_{1}} in the ith updating step, the backfitting steps are
By induction we get
and
If we set f ^ 2 ( 0 ) = 0 {\displaystyle {\hat {f}}_{2}^{(0)}=0} then we get
Where we have solved for f ^ 1 ( i ) {\displaystyle {\hat {f}}_{1}^{(i)}} by directly plugging out from f 2 = S 2 ( Y − f 1 ) {\displaystyle f_{2}=S_{2}(Y-f_{1})} .
We have convergence if ‖ S 1 S 2 ‖ < 1 {\displaystyle \|S_{1}S_{2}\|<1} . In this case, letting f ^ 1 ( i ) , f ^ 2 ( i ) → f ^ 1 ( ∞ ) , f ^ 2 ( ∞ ) {\displaystyle {\hat {f}}_{1}^{(i)},{\hat {f}}_{2}^{(i)}{\xrightarrow {}}{\hat {f}}_{1}^{(\infty )},{\hat {f}}_{2}^{(\infty )}} :
We can check this is a solution to the problem, i.e. that f ^ 1 ( i ) {\displaystyle {\hat {f}}_{1}^{(i)}} and f ^ 2 ( i ) {\displaystyle {\hat {f}}_{2}^{(i)}} converge to f 1 {\displaystyle f_{1}} and f 2 {\displaystyle f_{2}} correspondingly, by plugging these expressions into the original equations.
The choice of when to stop the algorithm is arbitrary and it is hard to know a priori how long reaching a specific convergence threshold will take. Also, the final model depends on the order in which the predictor variables X i {\displaystyle X_{i}} are fit.
As well, the solution found by the backfitting procedure is non-unique. If b {\displaystyle b} is a vector such that S ^ b = 0 {\displaystyle {\hat {S}}b=0} from above, then if f ^ {\displaystyle {\hat {f}}} is a solution then so is f ^ + α b {\displaystyle {\hat {f}}+\alpha b} is also a solution for any α ∈ R {\displaystyle \alpha \in \mathbb {R} } . A modification of the backfitting algorithm involving projections onto the eigenspace of S can remedy this problem.
We can modify the backfitting algorithm to make it easier to provide a unique solution. Let V 1 ( S i ) {\displaystyle {\mathcal {V}}_{1}(S_{i})} be the space spanned by all the eigenvectors of Si that correspond to eigenvalue 1. Then any b satisfying S ^ b = 0 {\displaystyle {\hat {S}}b=0} has b i ∈ V 1 ( S i ) ∀ i = 1 , … , p {\displaystyle b_{i}\in {\mathcal {V}}_{1}(S_{i})\forall i=1,\dots ,p} and ∑ i = 1 p b i = 0. {\displaystyle \sum _{i=1}^{p}b_{i}=0.} Now if we take A {\displaystyle A} to be a matrix that projects orthogonally onto V 1 ( S 1 ) + ⋯ + V 1 ( S p ) {\displaystyle {\mathcal {V}}_{1}(S_{1})+\dots +{\mathcal {V}}_{1}(S_{p})} , we get the following modified backfitting algorithm:
Hastie, Trevor, Robert Tibshirani and Jerome Friedman (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, ISBN 0-387-95284-5. /wiki/Trevor_Hastie ↩
Härdle, Wolfgang; et al. (June 9, 2004). "Backfitting". Archived from the original on 2015-05-10. Retrieved 2015-08-19. ↩