The algorithm proceeds as follows, using two thresholds T 1 {\displaystyle T_{1}} (the loose distance) and T 2 {\displaystyle T_{2}} (the tight distance), where T 1 > T 2 {\displaystyle T_{1}>T_{2}} .23
An important note is that individual data points may be part of several canopies. As an additional speed-up, an approximate and fast distance metric can be used for 3, where a more accurate and slow distance metric can be used for step 4.
Since the algorithm uses distance functions and requires the specification of distance thresholds, its applicability for high-dimensional data is limited by the curse of dimensionality. Only when a cheap and approximative – low-dimensional – distance function is available, the produced canopies will preserve the clusters produced by K-means.
Its benefits include:
McCallum, A.; Nigam, K.; and Ungar L.H. (2000) "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178 doi:10.1145/347090.347123 http://www.kamalnigam.com/papers/canopy-kdd00.pdf ↩
"The Canopies Algorithm". courses.cs.washington.edu. Retrieved 2014-09-06. http://courses.cs.washington.edu/courses/cse590q/04au/slides/DannyMcCallumKDD00.ppt ↩
Mahout description of Canopy-Clustering Retrieved 2022-07-02. https://mahout.apache.org/docs/latest/algorithms/clustering/canopy/ ↩