Ward's method

<h2 id="the-minimum-variance-criterion">The minimum variance criterion</h2>
<p>Ward's minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging. This increase is a weighted squared distance between cluster centers. At the initial step, all clusters are singletons (clusters containing a single point). To apply a <a href="/facts/Recursive_algorithm/2uXo18Gv">recursive algorithm</a> under this <a href="/facts/Objective_function/xv5ozuhl">objective function</a>, the initial distance between individual objects must be (proportional to) squared <a href="/facts/Euclidean_distance/9qDoQKQe">Euclidean distance</a>. 
</p><p>The initial cluster distances in Ward's minimum variance method are therefore defined to be the squared Euclidean distance between points:
</p>

d
          
            i
            j
          
        
        =
        d
        (
        {
        
          X
          
            i
          
        
        }
        ,
        {
        
          X
          
            j
          
        
        }
        )
        =
        
          ‖
          
            X
            
              i
            
          
          −
          
            X
            
              j
            
          
          
            ‖
            
              2
            
          
        
        .
      
    
    {\displaystyle d_{ij}=d(\{X_{i}\},\{X_{j}\})={\|X_{i}-X_{j}\|^{2}}.}

<p>Note: In software that implements Ward's method, it is important to check whether the function arguments should specify Euclidean distances or squared Euclidean distances.
</p>
<h2 id="lancewilliams-algorithms">Lance–Williams algorithms</h2>
<p>Ward's minimum variance method can be defined and implemented recursively by a Lance–Williams algorithm. The Lance–Williams algorithms are an infinite family of agglomerative hierarchical clustering algorithms which are represented by a recursive formula for updating cluster distances at each step (each time a pair of clusters is merged). At each step, it is necessary to optimize the objective function (find the optimal pair of clusters to merge). The recursive formula simplifies finding the optimal pair.
</p><p>Suppose that clusters 
  
    
      
        
          C
          
            i
          
        
      
    
    {\displaystyle C_{i}}
  
 and 
  
    
      
        
          C
          
            j
          
        
      
    
    {\displaystyle C_{j}}
  
 were next to be merged. At this point all of the current pairwise cluster distances are known. The recursive formula gives the updated cluster distances following the pending merge of clusters 
  
    
      
        
          C
          
            i
          
        
      
    
    {\displaystyle C_{i}}
  
 and 
  
    
      
        
          C
          
            j
          
        
      
    
    {\displaystyle C_{j}}
  
. Let
</p>
<ul><li>
  
    
      
        
          d
          
            i
            j
          
        
      
    
    {\displaystyle d_{ij}}
  
, 
  
    
      
        
          d
          
            i
            k
          
        
      
    
    {\displaystyle d_{ik}}
  
, and 
  
    
      
        
          d
          
            j
            k
          
        
      
    
    {\displaystyle d_{jk}}
  
 be the pairwise distances between clusters 
  
    
      
        
          C
          
            i
          
        
      
    
    {\displaystyle C_{i}}
  
, 
  
    
      
        
          C
          
            j
          
        
      
    
    {\displaystyle C_{j}}
  
, and 
  
    
      
        
          C
          
            k
          
        
      
    
    {\displaystyle C_{k}}
  
, respectively,</li>
<li>
  
    
      
        
          d
          
            (
            i
            j
            )
            k
          
        
      
    
    {\displaystyle d_{(ij)k}}
  
 be the distance between the new cluster 
  
    
      
        
          C
          
            i
          
        
        ∪
        
          C
          
            j
          
        
      
    
    {\displaystyle C_{i}\cup C_{j}}
  
 and 
  
    
      
        
          C
          
            k
          
        
      
    
    {\displaystyle C_{k}}
  
.</li></ul>
<p>An algorithm belongs to the Lance-Williams family if the updated cluster distance 
  
    
      
        
          d
          
            (
            i
            j
            )
            k
          
        
      
    
    {\displaystyle d_{(ij)k}}
  
 can be computed recursively by
</p>

d
          
            (
            i
            j
            )
            k
          
        
        =
        
          α
          
            i
          
        
        
          d
          
            i
            k
          
        
        +
        
          α
          
            j
          
        
        
          d
          
            j
            k
          
        
        +
        β
        
          d
          
            i
            j
          
        
        +
        γ
        
          |
        
        
          d
          
            i
            k
          
        
        −
        
          d
          
            j
            k
          
        
        
          |
        
        ,
      
    
    {\displaystyle d_{(ij)k}=\alpha _{i}d_{ik}+\alpha _{j}d_{jk}+\beta d_{ij}+\gamma |d_{ik}-d_{jk}|,}

<p>where 
  
    
      
        
          α
          
            i
          
        
        ,
        
          α
          
            j
          
        
        ,
        β
        ,
      
    
    {\displaystyle \alpha _{i},\alpha _{j},\beta ,}
  
 and 
  
    
      
        γ
      
    
    {\displaystyle \gamma }
  
 are parameters, which may depend on cluster sizes, that together with the cluster distance function 
  
    
      
        
          d
          
            i
            j
          
        
      
    
    {\displaystyle d_{ij}}
  
 determine the clustering algorithm. Several standard clustering algorithms such as <a href="/facts/Single-linkage_clustering/suwgF2yb">single linkage</a>, <a href="/facts/Complete-linkage_clustering/Ir4PLgKm">complete linkage</a>, and group average method have a recursive formula of the above type. A table of parameters for standard methods is given by several authors.<a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a><a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a><a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a>
</p><p>Ward's minimum variance method can be implemented by the Lance–Williams formula. For disjoint clusters 
  
    
      
        
          C
          
            i
          
        
        ,
        
          C
          
            j
          
        
        ,
      
    
    {\displaystyle C_{i},C_{j},}
  
 and 
  
    
      
        
          C
          
            k
          
        
      
    
    {\displaystyle C_{k}}
  
 with sizes 
  
    
      
        
          n
          
            i
          
        
        ,
        
          n
          
            j
          
        
        ,
      
    
    {\displaystyle n_{i},n_{j},}
  
 and 
  
    
      
        
          n
          
            k
          
        
      
    
    {\displaystyle n_{k}}
  
 respectively:
</p>

d
        (
        
          C
          
            i
          
        
        ∪
        
          C
          
            j
          
        
        ,
        
          C
          
            k
          
        
        )
        =
        
          
            
              
                n
                
                  i
                
              
              +
              
                n
                
                  k
                
              
            
            
              
                n
                
                  i
                
              
              +
              
                n
                
                  j
                
              
              +
              
                n
                
                  k
                
              
            
          
        
        
        d
        (
        
          C
          
            i
          
        
        ,
        
          C
          
            k
          
        
        )
        +
        
          
            
              
                n
                
                  j
                
              
              +
              
                n
                
                  k
                
              
            
            
              
                n
                
                  i
                
              
              +
              
                n
                
                  j
                
              
              +
              
                n
                
                  k
                
              
            
          
        
        
        d
        (
        
          C
          
            j
          
        
        ,
        
          C
          
            k
          
        
        )
        −
        
          
            
              n
              
                k
              
            
            
              
                n
                
                  i
                
              
              +
              
                n
                
                  j
                
              
              +
              
                n
                
                  k
                
              
            
          
        
        
        d
        (
        
          C
          
            i
          
        
        ,
        
          C
          
            j
          
        
        )
        .
      
    
    {\displaystyle d(C_{i}\cup C_{j},C_{k})={\frac {n_{i}+n_{k}}{n_{i}+n_{j}+n_{k}}}\;d(C_{i},C_{k})+{\frac {n_{j}+n_{k}}{n_{i}+n_{j}+n_{k}}}\;d(C_{j},C_{k})-{\frac {n_{k}}{n_{i}+n_{j}+n_{k}}}\;d(C_{i},C_{j}).}

<p>Hence Ward's method can be implemented as a Lance–Williams algorithm with
</p>

α
          
            i
          
        
        =
        
          
            
              
                n
                
                  i
                
              
              +
              
                n
                
                  k
                
              
            
            
              
                n
                
                  i
                
              
              +
              
                n
                
                  j
                
              
              +
              
                n
                
                  k
                
              
            
          
        
        ,
        
        
          α
          
            j
          
        
        =
        
          
            
              
                n
                
                  j
                
              
              +
              
                n
                
                  k
                
              
            
            
              
                n
                
                  i
                
              
              +
              
                n
                
                  j
                
              
              +
              
                n
                
                  k
                
              
            
          
        
        ,
        
        β
        =
        
          
            
              −
              
                n
                
                  k
                
              
            
            
              
                n
                
                  i
                
              
              +
              
                n
                
                  j
                
              
              +
              
                n
                
                  k
                
              
            
          
        
        ,
        
        γ
        =
        0.
      
    
    {\displaystyle \alpha _{i}={\frac {n_{i}+n_{k}}{n_{i}+n_{j}+n_{k}}},\qquad \alpha _{j}={\frac {n_{j}+n_{k}}{n_{i}+n_{j}+n_{k}}},\qquad \beta ={\frac {-n_{k}}{n_{i}+n_{j}+n_{k}}},\qquad \gamma =0.}

<h2 id="variations">Variations</h2>
<p>The popularity of the Ward's method has led to variations of it. For instance, Wardp introduces the use of cluster specific feature weights, following the intuitive idea that features could have different degrees of relevance at different clusters. <a class="footnote-ref" id="fnref:5" href="#fn:5"><sup>5</sup></a>
</p>

<h2 id="further-reading">Further reading</h2>
<ul><li>Everitt, B. S., Landau, S. and Leese, M. (2001), <i>Cluster Analysis, 4th Edition</i>, Oxford University Press, Inc., New York; Arnold, London. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 0340761199</li>
<li>Hartigan, J. A. (1975), <i>Clustering Algorithms</i>, New York: Wiley.</li>
<li><a href="/facts/Anil_K._Jain_(computer_scientist%2c_born_1948)/3zVJJa8r">Jain, A. K.</a> and Dubes, R. C. (1988), <i>Algorithms for Clustering Data</i>, New Jersey: Prentice–Hall.</li>
<li>Kaufman, L. and Rousseeuw, P. J. (1990), <i>Finding Groups in Data: An Introduction to Cluster Analysis</i>, New York: Wiley.</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Ward, J. H., Jr. (1963), "Hierarchical Grouping to Optimize an Objective Function", Journal of the American Statistical Association, 58, 236–244. <a href="/wiki/Journal_of_the_American_Statistical_Association" target="_blank">/wiki/Journal_of_the_American_Statistical_Association</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Cormack, R. M. (1971), "A Review of Classification", Journal of the Royal Statistical Society, Series A, 134(3), 321-367. <a href="/wiki/Journal_of_the_Royal_Statistical_Society" target="_blank">/wiki/Journal_of_the_Royal_Statistical_Society</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Gordon, A. D. (1999), Classification, 2nd Edition, Chapman and Hall, Boca Raton. <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Milligan, G. W. (1979), "Ultrametric Hierarchical Clustering Algorithms", Psychometrika, 44(3), 343–346. <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>R.C. de Amorim (2015). "Feature Relevance in Ward's Hierarchical Clustering Using the Lp Norm" (PDF). Journal of Classification. 32 (1): 46–62. doi:10.1007/s00357-015-9167-1. S2CID 18099326. <a href="http://repository.essex.ac.uk/20365/1/MW_Ward.pdf" target="_blank">http://repository.essex.ac.uk/20365/1/MW_Ward.pdf</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
</ol>

Ward's method open-in-new

Ward's method