Hyper basis function network

<h2 id="network-architecture">Network Architecture</h2>
<p>The typical HyperBF network structure consists of a real input vector 
  
    
      
        x
        ∈
        
          
            R
          
          
            n
          
        
      
    
    {\displaystyle x\in \mathbb {R} ^{n}}
  
, a hidden layer of activation functions and a linear output layer. The output of the network is a scalar function of the input vector, 
  
    
      
        ϕ
        :
        
          
            R
          
          
            n
          
        
        →
        
          R
        
      
    
    {\displaystyle \phi :\mathbb {R} ^{n}\to \mathbb {R} }
  
, is given by 
</p>

ϕ
        (
        x
        )
        =
        
          ∑
          
            j
            =
            1
          
          
            N
          
        
        
          a
          
            j
          
        
        
          ρ
          
            j
          
        
        (
        
          |
        
        
          |
        
        x
        −
        
          μ
          
            j
          
        
        
          |
        
        
          |
        
        )
      
    
    {\displaystyle \phi (x)=\sum _{j=1}^{N}a_{j}\rho _{j}(||x-\mu _{j}||)}

<p>where 
  
    
      
        N
      
    
    {\displaystyle N}
  
 is a number of neurons in the hidden layer, 
  
    
      
        
          μ
          
            j
          
        
      
    
    {\displaystyle \mu _{j}}
  
  and 
  
    
      
        
          a
          
            j
          
        
      
    
    {\displaystyle a_{j}}
  
 are the center and weight of neuron 
  
    
      
        j
      
    
    {\displaystyle j}
  
. The <a href="/facts/Activation_function/S4NImL6L">activation function</a> 
  
    
      
        
          ρ
          
            j
          
        
        (
        
          |
        
        
          |
        
        x
        −
        
          μ
          
            j
          
        
        
          |
        
        
          |
        
        )
      
    
    {\displaystyle \rho _{j}(||x-\mu _{j}||)}
  
 at the HyperBF network takes the following form
</p>

ρ
          
            j
          
        
        (
        
          |
        
        
          |
        
        x
        −
        
          μ
          
            j
          
        
        
          |
        
        
          |
        
        )
        =
        
          e
          
            (
            x
            −
            
              μ
              
                j
              
            
            
              )
              
                T
              
            
            
              R
              
                j
              
            
            (
            x
            −
            
              μ
              
                j
              
            
            )
          
        
      
    
    {\displaystyle \rho _{j}(||x-\mu _{j}||)=e^{(x-\mu _{j})^{T}R_{j}(x-\mu _{j})}}

<p>where 
  
    
      
        
          R
          
            j
          
        
      
    
    {\displaystyle R_{j}}
  
 is a positive definite 
  
    
      
        d
        ×
        d
      
    
    {\displaystyle d\times d}
  
 matrix. Depending on the application, the following types of matrices 
  
    
      
        
          R
          
            j
          
        
      
    
    {\displaystyle R_{j}}
  
 are usually considered<a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a>
</p>
<ul><li>
  
    
      
        
          R
          
            j
          
        
        =
        
          
            1
            
              2
              
                σ
                
                  2
                
              
            
          
        
        
          
            I
          
          
            d
            ×
            d
          
        
      
    
    {\displaystyle R_{j}={\frac {1}{2\sigma ^{2}}}\mathbb {I} _{d\times d}}
  
, where 
  
    
      
        σ
        >
        0
      
    
    {\displaystyle \sigma >0}
  
. This case corresponds to the regular RBF network.</li>
<li>
  
    
      
        
          R
          
            j
          
        
        =
        
          
            1
            
              2
              
                σ
                
                  j
                
                
                  2
                
              
            
          
        
        
          
            I
          
          
            d
            ×
            d
          
        
      
    
    {\displaystyle R_{j}={\frac {1}{2\sigma _{j}^{2}}}\mathbb {I} _{d\times d}}
  
, where 
  
    
      
        
          σ
          
            j
          
        
        >
        0
      
    
    {\displaystyle \sigma _{j}>0}
  
. In this case, the basis functions are radially symmetric, but are scaled with different width.</li>
<li>
  
    
      
        
          R
          
            j
          
        
        =
        d
        i
        a
        g
        
          (
          
            
              
                1
                
                  2
                  
                    σ
                    
                      j
                      1
                    
                    
                      2
                    
                  
                
              
            
            ,
            .
            .
            .
            ,
            
              
                1
                
                  2
                  
                    σ
                    
                      j
                      z
                    
                    
                      2
                    
                  
                
              
            
          
          )
        
        
          
            I
          
          
            d
            ×
            d
          
        
      
    
    {\displaystyle R_{j}=diag\left({\frac {1}{2\sigma _{j1}^{2}}},...,{\frac {1}{2\sigma _{jz}^{2}}}\right)\mathbb {I} _{d\times d}}
  
, where 
  
    
      
        
          σ
          
            j
            i
          
        
        >
        0
      
    
    {\displaystyle \sigma _{ji}>0}
  
. Every neuron has an elliptic shape with a varying size.</li>
<li>Positive definite matrix, but not diagonal.</li></ul>
<h2 id="training">Training</h2>
<p>Training HyperBF networks involves estimation of weights 
  
    
      
        
          a
          
            j
          
        
      
    
    {\displaystyle a_{j}}
  
, shape and centers of neurons 
  
    
      
        
          R
          
            j
          
        
      
    
    {\displaystyle R_{j}}
  
 and 
  
    
      
        
          μ
          
            j
          
        
      
    
    {\displaystyle \mu _{j}}
  
. Poggio and Girosi (1990) describe the training method with moving centers and adaptable neuron shapes. The outline of the method is provided below.
</p><p>Consider the quadratic loss of the network 
  
    
      
        H
        [
        
          ϕ
          
            ∗
          
        
        ]
        =
        
          ∑
          
            i
            =
            1
          
          
            N
          
        
        (
        
          y
          
            i
          
        
        −
        
          ϕ
          
            ∗
          
        
        (
        
          x
          
            i
          
        
        )
        
          )
          
            2
          
        
      
    
    {\displaystyle H[\phi ^{*}]=\sum _{i=1}^{N}(y_{i}-\phi ^{*}(x_{i}))^{2}}
  
.  The following conditions must be satisfied at the optimum:
</p>

∂
              H
              (
              
                ϕ
                
                  ∗
                
              
              )
            
            
              ∂
              
                a
                
                  j
                
              
            
          
        
        =
        0
      
    
    {\displaystyle {\frac {\partial H(\phi ^{*})}{\partial a_{j}}}=0}
  
, 
  
    
      
        
          
            
              ∂
              H
              (
              
                ϕ
                
                  ∗
                
              
              )
            
            
              ∂
              
                μ
                
                  j
                
              
            
          
        
        =
        0
      
    
    {\displaystyle {\frac {\partial H(\phi ^{*})}{\partial \mu _{j}}}=0}
  
, 
  
    
      
        
          
            
              ∂
              H
              (
              
                ϕ
                
                  ∗
                
              
              )
            
            
              ∂
              W
            
          
        
        =
        0
      
    
    {\displaystyle {\frac {\partial H(\phi ^{*})}{\partial W}}=0}

<p>where 
  
    
      
        
          R
          
            j
          
        
        =
        
          W
          
            T
          
        
        W
      
    
    {\displaystyle R_{j}=W^{T}W}
  
. Then in the gradient descent method the values of 
  
    
      
        
          a
          
            j
          
        
        ,
        
          μ
          
            j
          
        
        ,
        W
      
    
    {\displaystyle a_{j},\mu _{j},W}
  
 that minimize 
  
    
      
        H
        [
        
          ϕ
          
            ∗
          
        
        ]
      
    
    {\displaystyle H[\phi ^{*}]}
  
 can be found as a stable fixed point of the following dynamic system:
</p>

a
                
                  j
                
              
              ˙
            
          
        
        =
        −
        ω
        
          
            
              ∂
              H
              (
              
                ϕ
                
                  ∗
                
              
              )
            
            
              ∂
              
                a
                
                  j
                
              
            
          
        
      
    
    {\displaystyle {\dot {a_{j}}}=-\omega {\frac {\partial H(\phi ^{*})}{\partial a_{j}}}}
  
, 
  
    
      
        
          
            
              
                μ
                
                  j
                
              
              ˙
            
          
        
        =
        −
        ω
        
          
            
              ∂
              H
              (
              
                ϕ
                
                  ∗
                
              
              )
            
            
              ∂
              
                μ
                
                  j
                
              
            
          
        
      
    
    {\displaystyle {\dot {\mu _{j}}}=-\omega {\frac {\partial H(\phi ^{*})}{\partial \mu _{j}}}}
  
, 
  
    
      
        
          
            
              W
              ˙
            
          
        
        =
        −
        ω
        
          
            
              ∂
              H
              (
              
                ϕ
                
                  ∗
                
              
              )
            
            
              ∂
              W
            
          
        
      
    
    {\displaystyle {\dot {W}}=-\omega {\frac {\partial H(\phi ^{*})}{\partial W}}}

<p>where 
  
    
      
        ω
      
    
    {\displaystyle \omega }
  
 determines the rate of convergence.
</p><p>Overall, training HyperBF networks can be computationally challenging. Moreover, the high degree of freedom of HyperBF leads to overfitting and poor generalization. However, HyperBF networks have an important advantage that a small number of neurons is enough for learning complex functions.<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a>
</p>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>T. Poggio and F. Girosi (1990). "Networks for Approximation and Learning". Proc. IEEE Vol. 78, No. 9:1481-1497. <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>R.N. Mahdi, E.C. Rouchka (2011). "Reduced HyperBF Networks: Regularization by Explicit Complexity Reduction and Scaled Rprop-Based Training". IEEE Transactions of Neural Networks 2:673–686. <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5733426" target="_blank">https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5733426</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>F. Schwenker, H.A. Kestler and G. Palm (2001). "Three Learning Phases for Radial-Basis-Function Network" Neural Netw. 14:439-458. <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>R.N. Mahdi, E.C. Rouchka (2011). "Reduced HyperBF Networks: Regularization by Explicit Complexity Reduction and Scaled Rprop-Based Training". IEEE Transactions of Neural Networks 2:673–686. <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5733426" target="_blank">https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5733426</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
</ol>

Hyper basis function network open-in-new

Hyper basis function network