Assume the data have been clustered via any technique, such as k-medoids or k-means, into k {\displaystyle k} clusters.
For data point i ∈ C I {\displaystyle i\in C_{I}} (data point i {\displaystyle i} in the cluster C I {\displaystyle C_{I}} ), let
be the mean distance between i {\displaystyle i} and all other data points in the same cluster, where | C I | {\displaystyle |C_{I}|} is the number of points belonging to cluster C I {\displaystyle C_{I}} , and d ( i , j ) {\displaystyle d(i,j)} is the distance between data points i {\displaystyle i} and j {\displaystyle j} in the cluster C I {\displaystyle C_{I}} (we divide by | C I | − 1 {\displaystyle |C_{I}|-1} because we do not include the distance d ( i , i ) {\displaystyle d(i,i)} in the sum). We can interpret a ( i ) {\displaystyle a(i)} as a measure of how well i {\displaystyle i} is assigned to its cluster (the smaller the value, the better the assignment).
We then define the mean dissimilarity of point i {\displaystyle i} to some cluster C J {\displaystyle C_{J}} as the mean of the distance from i {\displaystyle i} to all points in C J {\displaystyle C_{J}} (where C J ≠ C I {\displaystyle C_{J}\neq C_{I}} ).
For each data point i ∈ C I {\displaystyle i\in C_{I}} , we now define
to be the smallest (hence the min {\displaystyle \min } operator in the formula) mean distance of i {\displaystyle i} to all points in any other cluster (i.e., in any cluster of which i {\displaystyle i} is not a member). The cluster with this smallest mean dissimilarity is said to be the "neighboring cluster" of i {\displaystyle i} because it is the next best fit cluster for point i {\displaystyle i} .
We now define a silhouette (value) of one data point i {\displaystyle i}
and
Which can be also written as:
From the above definition it is clear that
Note that a ( i ) {\displaystyle a(i)} is not clearly defined for clusters with size = 1, in which case we set s ( i ) = 0 {\displaystyle s(i)=0} . This choice is arbitrary, but neutral in the sense that it is at the midpoint of the bounds, -1 and 1.4
For s ( i ) {\displaystyle s(i)} to be close to 1 we require a ( i ) ≪ b ( i ) {\displaystyle a(i)\ll b(i)} . As a ( i ) {\displaystyle a(i)} is a measure of how dissimilar i {\displaystyle i} is to its own cluster, a small value means it is well matched. Furthermore, a large b ( i ) {\displaystyle b(i)} implies that i {\displaystyle i} is badly matched to its neighbouring cluster. Thus an s ( i ) {\displaystyle s(i)} close to 1 means that the data is appropriately clustered. If s ( i ) {\displaystyle s(i)} is close to -1, then by the same logic we see that i {\displaystyle i} would be more appropriate if it was clustered in its neighbouring cluster. An s ( i ) {\displaystyle s(i)} near zero means that the datum is on the border of two natural clusters.
The mean s ( i ) {\displaystyle s(i)} over all points of a cluster is a measure of how tightly grouped all the points in the cluster are. Thus the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset is a measure of how appropriately the data have been clustered. If there are too many or too few clusters, as may occur when a poor choice of k {\displaystyle k} is used in the clustering algorithm (e.g., k-means), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and means may be used to determine the natural number of clusters within a dataset. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific.5
Kaufman et al. introduced the term silhouette coefficient for the maximum value of the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset,6 i.e.,
where s ~ ( k ) {\displaystyle {\tilde {s}}\left(k\right)} represents the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset for a specific number of clusters k {\displaystyle k} .
Computing the silhouette coefficient needs all O ( N 2 ) {\displaystyle {\mathcal {O}}(N^{2})} pairwise distances, making this evaluation much more costly than clustering with k-means. For a clustering with centers μ C I {\displaystyle \mu _{C_{I}}} for each cluster C I {\displaystyle C_{I}} , we can use the following simplified Silhouette for each point i ∈ C I {\displaystyle i\in C_{I}} instead, which can be computed using only O ( N k ) {\displaystyle {\mathcal {O}}(Nk)} distances:
which has the additional benefit that a ′ ( i ) {\displaystyle a'(i)} is always defined, then define accordingly the simplified silhouette and simplified silhouette coefficient7
If the cluster centers are medoids (as in k-medoids clustering) instead of arithmetic means (as in k-means clustering), this is also called the medoid-based silhouette8 or medoid silhouette.9
If every object is assigned to the nearest medoid (as in k-medoids clustering), we know that a ′ ( i ) ≤ b ′ ( i ) {\displaystyle a'(i)\leq b'(i)} , and hence s ′ ( i ) = b ′ ( i ) − a ′ ( i ) b ′ ( i ) = 1 − a ′ ( i ) b ′ ( i ) {\displaystyle s'(i)={\frac {b'(i)-a'(i)}{b'(i)}}=1-{\frac {a'(i)}{b'(i)}}} .10
Instead of using the average silhouette to evaluate a clustering obtained from, e.g., k-medoids or k-means, we can try to directly find a solution that maximizes the Silhouette. We do not have a closed form solution to maximize this, but it will usually be best to assign points to the nearest cluster as done by these methods. Van der Laan et al.11 proposed to adapt the standard algorithm for k-medoids, PAM, for this purpose and call this algorithm PAMSIL:
The loop in step 3 is executed for O ( N k ) {\displaystyle {\mathcal {O}}(Nk)} pairs, and involves computing the silhouette in O ( N 2 ) {\displaystyle {\mathcal {O}}(N^{2})} , hence this algorithm needs O ( N 3 k i ) {\displaystyle {\mathcal {O}}(N^{3}ki)} time, where i is the number of iterations.
Because this is a fairly expensive operation, the authors propose to also use the medoid-based silhouette, and call the resulting algorithm PAMMEDSIL.12 It needs O ( N 2 k 2 i ) {\displaystyle {\mathcal {O}}(N^{2}k^{2}i)} time.
Batool et al. propose a similar algorithm under the name OSil, and propose a CLARA-like sampling strategy for larger data sets, that solves the problem only for a sub-sample.13
By adopting recent improvements to the PAM algorithm, FastMSC reduces the runtime using the medoid silhouette to just O ( N 2 i ) {\displaystyle {\mathcal {O}}(N^{2}i)} .14
By starting with a maximum number of clusters kmax and iteratively removing the worst center (in terms of a change in silhouette) and re-optimizing, the best (highest medoid silhouette) clustering can be automatically determined. As data structures can be reused, this reduces the computation cost substantially over repeatedly running the algorithm for different numbers of clusters.15 This algorithm needs pairwise distances and is typically implemented with a pairwise distance matrix. The O ( N 2 ) {\displaystyle {\mathcal {O}}(N^{2})} memory requirement is the main limiting factor for applying this to very large data sets.
Peter J. Rousseeuw (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics. 20: 53–65. doi:10.1016/0377-0427(87)90125-7. /wiki/Peter_J._Rousseeuw ↩
Beyer, Kevin; Goldstein, Jonathan; Ramakrishnan, Raghu; Shaft, Uri (1999). "When is "nearest neighbor" meaningful?}". Database Theory—ICDT'99: 7th International Conference Jerusalem, Israel, January 10--12, 1999 Proceedings 7. ↩
Monshizadeh, Mehrnoosh; Khatri, Vikramajeet; Kantola, Raimo; Yan, Zheng (2022-11-01). "A deep density based and self-determining clustering approach to label unknown traffic". Journal of Network and Computer Applications. 207: 103513. doi:10.1016/j.jnca.2022.103513. ISSN 1084-8045. However, both measures [silhouette coefficient and edge correlation] prefer convex-shaped clusters and cannot adapt to all cluster shapes produced by DBSCAN. https://doi.org/10.1016%2Fj.jnca.2022.103513 ↩
R.C. de Amorim, C. Hennig (2015). "Recovering the number of clusters in data sets with noise features using feature rescaling factors". Information Sciences. 324: 126–145. arXiv:1602.06989. doi:10.1016/j.ins.2015.06.039. S2CID 315803. /wiki/ArXiv_(identifier) ↩
Leonard Kaufman; Peter J. Rousseeuw (1990). Finding groups in data : An introduction to cluster analysis. Hoboken, NJ: Wiley-Interscience. p. 87. doi:10.1002/9780470316801. ISBN 9780471878766. 9780471878766 ↩
Hruschka, E.R.; de Castro, L.N.; Campello, R.J.G.B. (2004). Evolutionary Algorithms for Clustering Gene-Expression Data. Fourth IEEE International Conference on Data Mining (ICDM'04). IEEE. pp. 403–406. doi:10.1109/ICDM.2004.10073. https://ieeexplore.ieee.org/document/1410321 ↩
Van der Laan, Mark; Pollard, Katherine; Bryan, Jennifer (2003). "A new partitioning around medoids algorithm". Journal of Statistical Computation and Simulation. 73 (8): 575–584. doi:10.1080/0094965031000136012. ISSN 0094-9655. S2CID 17437463. http://www.tandfonline.com/doi/abs/10.1080/0094965031000136012 ↩
Lenssen, Lars; Schubert, Erich (2022). Clustering by Direct Optimization of the Medoid Silhouette. International Conference on Similarity Search and Applications. pp. 190–204. arXiv:2209.12553. doi:10.1007/978-3-031-17849-8_15. Retrieved 2022-10-20. https://link.springer.com/10.1007/978-3-031-17849-8_15 ↩
Batool, Fatima; Hennig, Christian (2021). "Clustering with the Average Silhouette Width". Computational Statistics & Data Analysis. 158: 107190. arXiv:1910.11339. doi:10.1016/j.csda.2021.107190. S2CID 219260336. https://linkinghub.elsevier.com/retrieve/pii/S0167947321000244 ↩
Lenssen, Lars; Schubert, Erich (2024-02-01). "Medoid Silhouette clustering with automatic cluster number selection". Information Systems. 120: 102290. arXiv:2309.03751. doi:10.1016/j.is.2023.102290. ISSN 0306-4379. https://www.sciencedirect.com/science/article/pii/S0306437923001266 ↩