|
| Qi HE (何奇) | Home | Research | Publication | Software | Technology | Coding | Life | Miscellaneous |
When the going gets tough, the tough gets goingChief Operating Officer
There are a bunch of well-known clustering measurements, relying or not relying on the ground-truth. In this following I shall introduce a tip of iceberg. Rand Index Rand Index or Rand Measure is a measure of how the clustering results are close to the original classes. Given an example as shown in the below contingency table H(C,K), data from three classes C1, C2 and C3 are clustered into three clusters K1, K2 and K3.
Define a as the number of pairs of objects which originate from the same class and are clustered into the same cluster, b as the number of pairs of objects which originate from the same class yet are clustered into various clusters, c as the number of pairs of objects which originate from various classes yet are clustered into the same cluster, d as the number of pairs of objects which originate from various classes and are also clustered into various clusters. Let x the total number of possible pairs of objects from the data. Apparently, x = a + b + c + d. Rand Index is defined as (a + d) / x. In the above example, a = 7, b = 6, c = 7, d = 25, x = 45, and Rand Index is 0.711. Rand Index ranges from 0 to 1. The value of 1 indicates a perfect clustering, yet the value of 0 tells a total wrong clustering. The advantage of Rand Index is its simple form. Intuitively, Rand Index works as a comprise of cluster or class entropy/purity. Some variation of Rand Index is also introduced in Details of the Adjusted Rand index and Clustering algorithms(Bioinformatics) by increasing the index range, which is used for increasing the sensitivity of the index value. normalized Mutual Information (nMI) nMI is another external measure of how the clustering results are close to the latent classes. If the clustering tends to separate large classes into tiles, i.e., with high entropy, the Rand Index still remains a large absolute value, which is not what we expect. We shall define the nMI as
where
Note that n is the number of documents pending on clustering. Given the above example (see Rand Index), we have H(C)=0.4581, H(K)=0.4472, H(C|K)=0.2830, I(C;K)=0.1752, and nMI=0.3824. There are three measurements for cluster quality we could use actually.
|