Backfitting algorithm

<h2 id="algorithm">Algorithm</h2>
Additive models are a class of <a href="/facts/Nonparametric_regression/3MuUSUFV">non-parametric regression</a> models of the form:

Y
          
            i
          
        
        =
        α
        +
        
          ∑
          
            j
            =
            1
          
          
            p
          
        
        
          f
          
            j
          
        
        (
        
          X
          
            i
            j
          
        
        )
        +
        
          ϵ
          
            i
          
        
      
    
    {\displaystyle Y_{i}=\alpha +\sum _{j=1}^{p}f_{j}(X_{ij})+\epsilon _{i}}

where each 
 
 
 
 
 X
 
 1
 
 
 ,
 
 X
 
 2
 
 
 ,
 …
 ,
 
 X
 
 p
 
 
 
 
 {\displaystyle X_{1},X_{2},\ldots ,X_{p}}
 
 is a variable in our 
 
 
 
 p
 
 
 {\displaystyle p}
 
-dimensional predictor 
 
 
 
 X
 
 
 {\displaystyle X}
 
, and 
 
 
 
 Y
 
 
 {\displaystyle Y}
 
 is our outcome variable. 
 
 
 
 ϵ
 
 
 {\displaystyle \epsilon }
 
 represents our inherent error, which is assumed to have mean zero. The 
 
 
 
 
 f
 
 j
 
 
 
 
 {\displaystyle f_{j}}
 
 represent unspecified smooth functions of a single 
 
 
 
 
 X
 
 j
 
 
 
 
 {\displaystyle X_{j}}
 
. Given the flexibility in the 
 
 
 
 
 f
 
 j
 
 
 
 
 {\displaystyle f_{j}}
 
, we typically do not have a unique solution: 
 
 
 
 α
 
 
 {\displaystyle \alpha }
 
 is left unidentifiable as one can add any constants to any of the 
 
 
 
 
 f
 
 j
 
 
 
 
 {\displaystyle f_{j}}
 
 and subtract this value from 
 
 
 
 α
 
 
 {\displaystyle \alpha }
 
. It is common to rectify this by constraining

∑
          
            i
            =
            1
          
          
            N
          
        
        
          f
          
            j
          
        
        (
        
          X
          
            i
            j
          
        
        )
        =
        0
      
    
    {\displaystyle \sum _{i=1}^{N}f_{j}(X_{ij})=0}
  
 for all 
  
    
      
        j
      
    
    {\displaystyle j}

leaving

α
        =
        1
        
          /
        
        N
        
          ∑
          
            i
            =
            1
          
          
            N
          
        
        
          y
          
            i
          
        
      
    
    {\displaystyle \alpha =1/N\sum _{i=1}^{N}y_{i}}

necessarily.
The backfitting algorithm is then:

 Initialize 
 
 
 
 
 
 
 α
 ^
 
 
 
 =
 1
 
 /
 
 N
 
 ∑
 
 i
 =
 1
 
 
 N
 
 
 
 y
 
 i
 
 
 ,
 
 
 
 
 f
 
 j
 
 
 ^
 
 
 
 ≡
 0
 
 
 {\displaystyle {\hat {\alpha }}=1/N\sum _{i=1}^{N}y_{i},{\hat {f_{j}}}\equiv 0}
 
,
 
 
 
 ∀
 j
 
 
 {\displaystyle \forall j}

Do until 
 
 
 
 
 
 
 
 f
 
 j
 
 
 ^
 
 
 
 
 
 {\displaystyle {\hat {f_{j}}}}
 
 converge:
 For each predictor j:
 (a) 
 
 
 
 
 
 
 
 f
 
 j
 
 
 ^
 
 
 
 ←
 
 Smooth
 
 [
 {
 
 y
 
 i
 
 
 −
 
 
 
 α
 ^
 
 
 
 −
 
 ∑
 
 k
 ≠
 j
 
 
 
 
 
 
 f
 
 k
 
 
 ^
 
 
 
 (
 
 x
 
 i
 k
 
 
 )
 
 }
 
 1
 
 
 N
 
 
 ]
 
 
 {\displaystyle {\hat {f_{j}}}\leftarrow {\text{Smooth}}[\lbrace y_{i}-{\hat {\alpha }}-\sum _{k\neq j}{\hat {f_{k}}}(x_{ik})\rbrace _{1}^{N}]}
 
 (backfitting step)
 (b) 
 
 
 
 
 
 
 
 f
 
 j
 
 
 ^
 
 
 
 ←
 
 
 
 
 f
 
 j
 
 
 ^
 
 
 
 −
 1
 
 /
 
 N
 
 ∑
 
 i
 =
 1
 
 
 N
 
 
 
 
 
 
 f
 
 j
 
 
 ^
 
 
 
 (
 
 x
 
 i
 j
 
 
 )
 
 
 {\displaystyle {\hat {f_{j}}}\leftarrow {\hat {f_{j}}}-1/N\sum _{i=1}^{N}{\hat {f_{j}}}(x_{ij})}
 
 (mean centering of estimated function)

where 
 
 
 
 
 Smooth
 
 
 
 {\displaystyle {\text{Smooth}}}
 
 is our smoothing operator. This is typically chosen to be a <a href="/facts/Smoothing_spline/oHeBXmE8">cubic spline smoother</a> but can be any other appropriate fitting operation, such as:

<ul><li>local <a href="/facts/Polynomial_regression/5nL7tBoc">polynomial regression</a></li>
<li><a href="/facts/Kernel_smoothing/JaGaXr6s">kernel smoothing</a> methods</li>
<li>more complex operators, such as surface smoothers for second and higher-order interactions</li></ul>
In theory, step (b) in the algorithm is not needed as the function estimates are constrained to sum to zero. However, due to numerical issues this might become a problem in practice.<a class="footnote-ref" id="fnref:1" href="#fn:1">1</a>

<h2 id="motivation">Motivation</h2>
If we consider the problem of minimizing the expected squared error:

min
        E
        [
        Y
        −
        (
        α
        +
        
          ∑
          
            j
            =
            1
          
          
            p
          
        
        
          f
          
            j
          
        
        (
        
          X
          
            j
          
        
        )
        )
        
          ]
          
            2
          
        
      
    
    {\displaystyle \min E[Y-(\alpha +\sum _{j=1}^{p}f_{j}(X_{j}))]^{2}}

There exists a unique solution by the theory of projections given by:

f
          
            i
          
        
        (
        
          X
          
            i
          
        
        )
        =
        E
        [
        Y
        −
        (
        α
        +
        
          ∑
          
            j
            ≠
            i
          
          
            p
          
        
        
          f
          
            j
          
        
        (
        
          X
          
            j
          
        
        )
        )
        
          |
        
        
          X
          
            i
          
        
        ]
      
    
    {\displaystyle f_{i}(X_{i})=E[Y-(\alpha +\sum _{j\neq i}^{p}f_{j}(X_{j}))|X_{i}]}

for i = 1, 2, ..., p.
This gives the matrix interpretation:

(
            
              
                
                  I
                
                
                  
                    P
                    
                      1
                    
                  
                
                
                  ⋯
                
                
                  
                    P
                    
                      1
                    
                  
                
              
              
                
                  
                    P
                    
                      2
                    
                  
                
                
                  I
                
                
                  ⋯
                
                
                  
                    P
                    
                      2
                    
                  
                
              
              
                
                  ⋮
                
                
                
                  ⋱
                
                
                  ⋮
                
              
              
                
                  
                    P
                    
                      p
                    
                  
                
                
                  ⋯
                
                
                  
                    P
                    
                      p
                    
                  
                
                
                  I
                
              
            
            )
          
        
        
          
            (
            
              
                
                  
                    f
                    
                      1
                    
                  
                  (
                  
                    X
                    
                      1
                    
                  
                  )
                
              
              
                
                  
                    f
                    
                      2
                    
                  
                  (
                  
                    X
                    
                      2
                    
                  
                  )
                
              
              
                
                  ⋮
                
              
              
                
                  
                    f
                    
                      p
                    
                  
                  (
                  
                    X
                    
                      p
                    
                  
                  )
                
              
            
            )
          
        
        =
        
          
            (
            
              
                
                  
                    P
                    
                      1
                    
                  
                  Y
                
              
              
                
                  
                    P
                    
                      2
                    
                  
                  Y
                
              
              
                
                  ⋮
                
              
              
                
                  
                    P
                    
                      p
                    
                  
                  Y
                
              
            
            )
          
        
      
    
    {\displaystyle {\begin{pmatrix}I&P_{1}&\cdots &P_{1}\\P_{2}&I&\cdots &P_{2}\\\vdots &&\ddots &\vdots \\P_{p}&\cdots &P_{p}&I\end{pmatrix}}{\begin{pmatrix}f_{1}(X_{1})\\f_{2}(X_{2})\\\vdots \\f_{p}(X_{p})\end{pmatrix}}={\begin{pmatrix}P_{1}Y\\P_{2}Y\\\vdots \\P_{p}Y\end{pmatrix}}}

where 
 
 
 
 
 P
 
 i
 
 
 (
 ⋅
 )
 =
 E
 (
 ⋅
 
 |
 
 
 X
 
 i
 
 
 )
 
 
 {\displaystyle P_{i}(\cdot )=E(\cdot |X_{i})}
 
. In this context we can imagine a smoother matrix, 
 
 
 
 
 S
 
 i
 
 
 
 
 {\displaystyle S_{i}}
 
, which approximates our 
 
 
 
 
 P
 
 i
 
 
 
 
 {\displaystyle P_{i}}
 
 and gives an estimate, 
 
 
 
 
 S
 
 i
 
 
 Y
 
 
 {\displaystyle S_{i}Y}
 
, of 
 
 
 
 E
 (
 Y
 
 |
 
 X
 )
 
 
 {\displaystyle E(Y|X)}

(
            
              
                
                  I
                
                
                  
                    S
                    
                      1
                    
                  
                
                
                  ⋯
                
                
                  
                    S
                    
                      1
                    
                  
                
              
              
                
                  
                    S
                    
                      2
                    
                  
                
                
                  I
                
                
                  ⋯
                
                
                  
                    S
                    
                      2
                    
                  
                
              
              
                
                  ⋮
                
                
                
                  ⋱
                
                
                  ⋮
                
              
              
                
                  
                    S
                    
                      p
                    
                  
                
                
                  ⋯
                
                
                  
                    S
                    
                      p
                    
                  
                
                
                  I
                
              
            
            )
          
        
        
          
            (
            
              
                
                  
                    f
                    
                      1
                    
                  
                
              
              
                
                  
                    f
                    
                      2
                    
                  
                
              
              
                
                  ⋮
                
              
              
                
                  
                    f
                    
                      p
                    
                  
                
              
            
            )
          
        
        =
        
          
            (
            
              
                
                  
                    S
                    
                      1
                    
                  
                  Y
                
              
              
                
                  
                    S
                    
                      2
                    
                  
                  Y
                
              
              
                
                  ⋮
                
              
              
                
                  
                    S
                    
                      p
                    
                  
                  Y
                
              
            
            )
          
        
      
    
    {\displaystyle {\begin{pmatrix}I&S_{1}&\cdots &S_{1}\\S_{2}&I&\cdots &S_{2}\\\vdots &&\ddots &\vdots \\S_{p}&\cdots &S_{p}&I\end{pmatrix}}{\begin{pmatrix}f_{1}\\f_{2}\\\vdots \\f_{p}\end{pmatrix}}={\begin{pmatrix}S_{1}Y\\S_{2}Y\\\vdots \\S_{p}Y\end{pmatrix}}}

or in abbreviated form

S
              ^
            
          
        
        f
        =
        Q
        Y
        
      
    
    {\displaystyle {\hat {S}}f=QY\,}

An exact solution of this is infeasible to calculate for large np, so the iterative technique of backfitting is used. We take initial guesses 
 
 
 
 
 f
 
 j
 
 
 (
 0
 )
 
 
 
 
 {\displaystyle f_{j}^{(0)}}
 
 and update each 
 
 
 
 
 f
 
 j
 
 
 (
 ℓ
 )
 
 
 
 
 {\displaystyle f_{j}^{(\ell )}}
 
 in turn to be the smoothed fit for the residuals of all the others:

f
                  
                    j
                  
                
                ^
              
            
          
          
            (
            ℓ
            )
          
        
        ←
        
          Smooth
        
        [
        {
        
          y
          
            i
          
        
        −
        
          
            
              α
              ^
            
          
        
        −
        
          ∑
          
            k
            ≠
            j
          
        
        
          
            
              
                f
                
                  k
                
              
              ^
            
          
        
        (
        
          x
          
            i
            k
          
        
        )
        
          }
          
            1
          
          
            N
          
        
        ]
      
    
    {\displaystyle {\hat {f_{j}}}^{(\ell )}\leftarrow {\text{Smooth}}[\lbrace y_{i}-{\hat {\alpha }}-\sum _{k\neq j}{\hat {f_{k}}}(x_{ik})\rbrace _{1}^{N}]}

Looking at the abbreviated form it is easy to see the backfitting algorithm as equivalent to the <a href="/facts/Gauss%E2%80%93Seidel/eK94QeFt">Gauss–Seidel method</a> for linear smoothing operators S.

<h2 id="explicit-derivation-for-two-dimensions">Explicit derivation for two dimensions</h2>
Following,<a class="footnote-ref" id="fnref:2" href="#fn:2">2</a> we can formulate the backfitting algorithm explicitly for the two dimensional case. We have:

f
          
            1
          
        
        =
        
          S
          
            1
          
        
        (
        Y
        −
        
          f
          
            2
          
        
        )
        ,
        
          f
          
            2
          
        
        =
        
          S
          
            2
          
        
        (
        Y
        −
        
          f
          
            1
          
        
        )
      
    
    {\displaystyle f_{1}=S_{1}(Y-f_{2}),f_{2}=S_{2}(Y-f_{1})}

If we denote 
 
 
 
 
 
 
 
 f
 ^
 
 
 
 
 1
 
 
 (
 i
 )
 
 
 
 
 {\displaystyle {\hat {f}}_{1}^{(i)}}
 
 as the estimate of 
 
 
 
 
 f
 
 1
 
 
 
 
 {\displaystyle f_{1}}
 
 in the ith updating step, the backfitting steps are

f
                ^
              
            
          
          
            1
          
          
            (
            i
            )
          
        
        =
        
          S
          
            1
          
        
        [
        Y
        −
        
          
            
              
                f
                ^
              
            
          
          
            2
          
          
            (
            i
            −
            1
            )
          
        
        ]
        ,
        
          
            
              
                f
                ^
              
            
          
          
            2
          
          
            (
            i
            )
          
        
        =
        
          S
          
            2
          
        
        [
        Y
        −
        
          
            
              
                f
                ^
              
            
          
          
            1
          
          
            (
            i
            )
          
        
        ]
      
    
    {\displaystyle {\hat {f}}_{1}^{(i)}=S_{1}[Y-{\hat {f}}_{2}^{(i-1)}],{\hat {f}}_{2}^{(i)}=S_{2}[Y-{\hat {f}}_{1}^{(i)}]}

By induction we get

f
                ^
              
            
          
          
            1
          
          
            (
            i
            )
          
        
        =
        Y
        −
        
          ∑
          
            α
            =
            0
          
          
            i
            −
            1
          
        
        (
        
          S
          
            1
          
        
        
          S
          
            2
          
        
        
          )
          
            α
          
        
        (
        I
        −
        
          S
          
            1
          
        
        )
        Y
        −
        (
        
          S
          
            1
          
        
        
          S
          
            2
          
        
        
          )
          
            i
            −
            1
          
        
        
          S
          
            1
          
        
        
          
            
              
                f
                ^
              
            
          
          
            2
          
          
            (
            0
            )
          
        
      
    
    {\displaystyle {\hat {f}}_{1}^{(i)}=Y-\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})Y-(S_{1}S_{2})^{i-1}S_{1}{\hat {f}}_{2}^{(0)}}

and

f
                ^
              
            
          
          
            2
          
          
            (
            i
            )
          
        
        =
        
          S
          
            2
          
        
        
          ∑
          
            α
            =
            0
          
          
            i
            −
            1
          
        
        (
        
          S
          
            1
          
        
        
          S
          
            2
          
        
        
          )
          
            α
          
        
        (
        I
        −
        
          S
          
            1
          
        
        )
        Y
        +
        
          S
          
            2
          
        
        (
        
          S
          
            1
          
        
        
          S
          
            2
          
        
        
          )
          
            i
            −
            1
          
        
        
          S
          
            1
          
        
        
          
            
              
                f
                ^
              
            
          
          
            2
          
          
            (
            0
            )
          
        
      
    
    {\displaystyle {\hat {f}}_{2}^{(i)}=S_{2}\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})Y+S_{2}(S_{1}S_{2})^{i-1}S_{1}{\hat {f}}_{2}^{(0)}}

If we set 
 
 
 
 
 
 
 
 f
 ^
 
 
 
 
 2
 
 
 (
 0
 )
 
 
 =
 0
 
 
 {\displaystyle {\hat {f}}_{2}^{(0)}=0}
 
 then we get

f
                ^
              
            
          
          
            1
          
          
            (
            i
            )
          
        
        =
        Y
        −
        
          S
          
            2
          
          
            −
            1
          
        
        
          
            
              
                f
                ^
              
            
          
          
            2
          
          
            (
            i
            )
          
        
        =
        [
        I
        −
        
          ∑
          
            α
            =
            0
          
          
            i
            −
            1
          
        
        (
        
          S
          
            1
          
        
        
          S
          
            2
          
        
        
          )
          
            α
          
        
        (
        I
        −
        
          S
          
            1
          
        
        )
        ]
        Y
      
    
    {\displaystyle {\hat {f}}_{1}^{(i)}=Y-S_{2}^{-1}{\hat {f}}_{2}^{(i)}=[I-\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})]Y}

f
                ^
              
            
          
          
            2
          
          
            (
            i
            )
          
        
        =
        [
        
          S
          
            2
          
        
        
          ∑
          
            α
            =
            0
          
          
            i
            −
            1
          
        
        (
        
          S
          
            1
          
        
        
          S
          
            2
          
        
        
          )
          
            α
          
        
        (
        I
        −
        
          S
          
            1
          
        
        )
        ]
        Y
      
    
    {\displaystyle {\hat {f}}_{2}^{(i)}=[S_{2}\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})]Y}

Where we have solved for 
 
 
 
 
 
 
 
 f
 ^
 
 
 
 
 1
 
 
 (
 i
 )
 
 
 
 
 {\displaystyle {\hat {f}}_{1}^{(i)}}
 
 by directly plugging out from 
 
 
 
 
 f
 
 2
 
 
 =
 
 S
 
 2
 
 
 (
 Y
 −
 
 f
 
 1
 
 
 )
 
 
 {\displaystyle f_{2}=S_{2}(Y-f_{1})}
 
.
We have convergence if 
 
 
 
 ‖
 
 S
 
 1
 
 
 
 S
 
 2
 
 
 ‖
 <
 1
 
 
 {\displaystyle \|S_{1}S_{2}\|<1}
 
. In this case, letting 
 
 
 
 
 
 
 
 f
 ^
 
 
 
 
 1
 
 
 (
 i
 )
 
 
 ,
 
 
 
 
 f
 ^
 
 
 
 
 2
 
 
 (
 i
 )
 
 
 
 
 →
 
 
 
 
 
 
 
 f
 ^
 
 
 
 
 1
 
 
 (
 ∞
 )
 
 
 ,
 
 
 
 
 f
 ^
 
 
 
 
 2
 
 
 (
 ∞
 )
 
 
 
 
 {\displaystyle {\hat {f}}_{1}^{(i)},{\hat {f}}_{2}^{(i)}{\xrightarrow {}}{\hat {f}}_{1}^{(\infty )},{\hat {f}}_{2}^{(\infty )}}
 
:

f
                ^
              
            
          
          
            1
          
          
            (
            ∞
            )
          
        
        =
        Y
        −
        
          S
          
            2
          
          
            −
            1
          
        
        
          
            
              
                f
                ^
              
            
          
          
            2
          
          
            (
            ∞
            )
          
        
        =
        Y
        −
        (
        I
        −
        
          S
          
            1
          
        
        
          S
          
            2
          
        
        
          )
          
            −
            1
          
        
        (
        I
        −
        
          S
          
            1
          
        
        )
        Y
      
    
    {\displaystyle {\hat {f}}_{1}^{(\infty )}=Y-S_{2}^{-1}{\hat {f}}_{2}^{(\infty )}=Y-(I-S_{1}S_{2})^{-1}(I-S_{1})Y}

f
                ^
              
            
          
          
            2
          
          
            (
            ∞
            )
          
        
        =
        
          S
          
            2
          
        
        (
        I
        −
        
          S
          
            1
          
        
        
          S
          
            2
          
        
        
          )
          
            −
            1
          
        
        (
        I
        −
        
          S
          
            1
          
        
        )
        Y
      
    
    {\displaystyle {\hat {f}}_{2}^{(\infty )}=S_{2}(I-S_{1}S_{2})^{-1}(I-S_{1})Y}

We can check this is a solution to the problem, i.e. that 
 
 
 
 
 
 
 
 f
 ^
 
 
 
 
 1
 
 
 (
 i
 )
 
 
 
 
 {\displaystyle {\hat {f}}_{1}^{(i)}}
 
 and 
 
 
 
 
 
 
 
 f
 ^
 
 
 
 
 2
 
 
 (
 i
 )
 
 
 
 
 {\displaystyle {\hat {f}}_{2}^{(i)}}
 
 converge to 
 
 
 
 
 f
 
 1
 
 
 
 
 {\displaystyle f_{1}}
 
 and 
 
 
 
 
 f
 
 2
 
 
 
 
 {\displaystyle f_{2}}
 
 correspondingly, by plugging these expressions into the original equations.

<h2 id="issues">Issues</h2>
The choice of when to stop the algorithm is arbitrary and it is hard to know a priori how long reaching a specific convergence threshold will take. Also, the final model depends on the order in which the predictor variables 
 
 
 
 
 X
 
 i
 
 
 
 
 {\displaystyle X_{i}}
 
 are fit.
As well, the solution found by the backfitting procedure is non-unique. If 
 
 
 
 b
 
 
 {\displaystyle b}
 
 is a vector such that 
 
 
 
 
 
 
 S
 ^
 
 
 
 b
 =
 0
 
 
 {\displaystyle {\hat {S}}b=0}
 
 from above, then if 
 
 
 
 
 
 
 f
 ^
 
 
 
 
 
 {\displaystyle {\hat {f}}}
 
 is a solution then so is 
 
 
 
 
 
 
 f
 ^
 
 
 
 +
 α
 b
 
 
 {\displaystyle {\hat {f}}+\alpha b}
 
 is also a solution for any 
 
 
 
 α
 ∈
 
 R
 
 
 
 {\displaystyle \alpha \in \mathbb {R} }
 
. A modification of the backfitting algorithm involving projections onto the eigenspace of S can remedy this problem.

<h2 id="modified-algorithm">Modified algorithm</h2>
We can modify the backfitting algorithm to make it easier to provide a unique solution. Let 
 
 
 
 
 
 
 V
 
 
 
 1
 
 
 (
 
 S
 
 i
 
 
 )
 
 
 {\displaystyle {\mathcal {V}}_{1}(S_{i})}
 
 be the space spanned by all the eigenvectors of Si that correspond to eigenvalue 1. Then any b satisfying 
 
 
 
 
 
 
 S
 ^
 
 
 
 b
 =
 0
 
 
 {\displaystyle {\hat {S}}b=0}
 
 has 
 
 
 
 
 b
 
 i
 
 
 ∈
 
 
 
 V
 
 
 
 1
 
 
 (
 
 S
 
 i
 
 
 )
 ∀
 i
 =
 1
 ,
 …
 ,
 p
 
 
 {\displaystyle b_{i}\in {\mathcal {V}}_{1}(S_{i})\forall i=1,\dots ,p}
 
 and 
 
 
 
 
 ∑
 
 i
 =
 1
 
 
 p
 
 
 
 b
 
 i
 
 
 =
 0.
 
 
 {\displaystyle \sum _{i=1}^{p}b_{i}=0.}
 
 Now if we take 
 
 
 
 A
 
 
 {\displaystyle A}
 
 to be a matrix that projects orthogonally onto 
 
 
 
 
 
 
 V
 
 
 
 1
 
 
 (
 
 S
 
 1
 
 
 )
 +
 ⋯
 +
 
 
 
 V
 
 
 
 1
 
 
 (
 
 S
 
 p
 
 
 )
 
 
 {\displaystyle {\mathcal {V}}_{1}(S_{1})+\dots +{\mathcal {V}}_{1}(S_{p})}
 
, we get the following modified backfitting algorithm:

 Initialize 
 
 
 
 
 
 
 α
 ^
 
 
 
 =
 1
 
 /
 
 N
 
 ∑
 
 1
 
 
 N
 
 
 
 y
 
 i
 
 
 ,
 
 
 
 
 f
 
 j
 
 
 ^
 
 
 
 ≡
 0
 
 
 {\displaystyle {\hat {\alpha }}=1/N\sum _{1}^{N}y_{i},{\hat {f_{j}}}\equiv 0}
 
,
 
 
 
 ∀
 i
 ,
 j
 
 
 {\displaystyle \forall i,j}
 
, 
 
 
 
 
 
 
 
 f
 
 +
 
 
 ^
 
 
 
 =
 α
 +
 
 
 
 
 f
 
 1
 
 
 ^
 
 
 
 +
 ⋯
 +
 
 
 
 
 f
 
 p
 
 
 ^
 
 
 
 
 
 {\displaystyle {\hat {f_{+}}}=\alpha +{\hat {f_{1}}}+\dots +{\hat {f_{p}}}}

Do until 
  
    
      
        
          
            
              
                f
                
                  j
                
              
              ^
            
          
        
      
    
    {\displaystyle {\hat {f_{j}}}}
  
 converge:
       Regress 
  
    
      
        y
        −
        
          
            
              
                f
                
                  +
                
              
              ^
            
          
        
      
    
    {\displaystyle y-{\hat {f_{+}}}}
  
 onto the space 
  
    
      
        
          
            
              V
            
          
          
            1
          
        
        (
        
          S
          
            i
          
        
        )
        +
        ⋯
        +
        
          
            
              V
            
          
          
            1
          
        
        (
        
          S
          
            p
          
        
        )
      
    
    {\displaystyle {\mathcal {V}}_{1}(S_{i})+\dots +{\mathcal {V}}_{1}(S_{p})}
  
, setting 
  
    
      
        a
        =
        A
        (
        Y
        −
        
          
            
              
                f
                
                  +
                
              
              ^
            
          
        
        )
      
    
    {\displaystyle a=A(Y-{\hat {f_{+}}})}

For each predictor j:
 Apply backfitting update to 
 
 
 
 (
 Y
 −
 a
 )
 
 
 {\displaystyle (Y-a)}
 
 using the smoothing operator 
 
 
 
 (
 I
 −
 
 A
 
 i
 
 
 )
 
 S
 
 i
 
 
 
 
 {\displaystyle (I-A_{i})S_{i}}
 
, yielding new estimates for 
 
 
 
 
 
 
 
 f
 
 j
 
 
 ^
 
 
 
 
 
 {\displaystyle {\hat {f_{j}}}}

<ul><li>Breiman, L. & Friedman, J. H. (1985). "Estimating optimal transformations for multiple regression and correlations (with discussion)". Journal of the American Statistical Association. 80 (391): 580–619. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.2307%2F2288473">10.2307/2288473</a>. <a href="/facts/JSTOR_(identifier)/YTeVmaJ7">JSTOR</a> <a href="https://www.jstor.org/stable/2288473">2288473</a>.</li>
<li>Hastie, T. J. & Tibshirani, R. J. (1990). "Generalized Additive Models". Monographs on Statistics and Applied Probability. 43.</li>
<li>Härdle, Wolfgang; et al. (June 9, 2004). <a href="https://web.archive.org/web/20150510225240/http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/spm/spmhtmlnode37.html">"Backfitting"</a>. Archived from <a href="http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/spm/spmhtmlnode37.html">the original</a> on 2015-05-10. Retrieved 2015-08-19.</li></ul>
<h2 id="external-links">External links</h2>
<ul><li><a href="https://archive.today/20121211125906/http://rss.acs.unt.edu/Rdoc/library/gam/html/gam.html">R Package for GAM backfitting</a></li>
<li><a href="https://web.archive.org/web/20061121130651/http://pbil.univ-lyon1.fr/library/mda/html/bruto.html">R Package for BRUTO backfitting</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">Hastie, Trevor, Robert Tibshirani and Jerome Friedman (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, ISBN 0-387-95284-5. <a href="/wiki/Trevor_Hastie" target="_blank">/wiki/Trevor_Hastie</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">Härdle, Wolfgang; et al. (June 9, 2004). "Backfitting". Archived from the original on 2015-05-10. Retrieved 2015-08-19. <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
</ol>

Backfitting algorithm open-in-new

Backfitting algorithm