Qi HE (何奇) Home Research Publication Software Technology Coding Life Miscellaneous

When the going gets tough, the tough gets going

  • Caogn.com
  • Clustering Measurements
  • Caogn.com(草根博书网)

    Chief Operating Officer

    The first self-startup company, launched recently, has developed a Web based platform to support the automatic designing, printing, and binding of blog-based books, whose content are automatically extracted from user's online blogs.

    • Currently, only the BlogDriver(博客动力), a Chinese blog Website, is supported.
    • The product was funded by the Guangzhou Oversea Student Startup plan for 130K RMB in 2007.

    Clustering Measurements

    There are a bunch of well-known clustering measurements, relying or not relying on the ground-truth. In this following I shall introduce a tip of iceberg.

    Rand Index

    Rand Index or Rand Measure is a measure of how the clustering results are close to the original classes.

    Given an example as shown in the below contingency table H(C,K), data from three classes C1, C2 and C3 are clustered into three clusters K1, K2 and K3.

    class    cluster

    K1

    K2

    K3

    sum

    C1

    1

    1

    0

    2

    C2

    1

    2

    1

    4

    C3

    0

    0

    4

    4

    sum

    2

    3

    5

    10

    Define a as the number of pairs of objects which originate from the same class and are clustered into the same cluster, b as the number of pairs of objects which originate from the same class yet are clustered into various clusters, c as the number of pairs of objects which originate from various classes yet are clustered into the same cluster, d as the number of pairs of objects which originate from various classes and are also clustered into various clusters. Let x the total number of possible pairs of objects from the data. Apparently, x = a + b + c + d.

    Rand Index is defined as (a + d) / x. In the above example, a = 7, b = 6, c = 7, d = 25, x = 45, and Rand Index is 0.711. Rand Index ranges from 0 to 1. The value of 1 indicates a perfect clustering, yet the value of 0 tells a total wrong clustering. The advantage of Rand Index is its simple form. Intuitively, Rand Index works as a comprise of cluster or class entropy/purity.

    Some variation of Rand Index is also introduced in Details of the Adjusted Rand index and Clustering algorithms(Bioinformatics) by increasing the index range, which is used for increasing the sensitivity of the index value.

    normalized Mutual Information (nMI)

    nMI is another external measure of how the clustering results are close to the latent classes. If the clustering tends to separate large classes into tiles,  i.e., with high entropy, the Rand Index still remains a large absolute value, which is not what we expect.

    We shall define the nMI as

    ,

    where

    ,

    ,

    ,

    ,

    .

    Note that n is the number of documents pending on clustering. Given the above example (see Rand Index), we have H(C)=0.4581, H(K)=0.4472, H(C|K)=0.2830, I(C;K)=0.1752, and nMI=0.3824. There are three measurements for cluster quality we could use actually.

    • H(C|K) is equivalent to asking why we should expect it to be a good measure of the usefulness of the elements of K as predictors of the corresponding elements of C. The worst-case value of H(C|K) is just H(C), the number of bits it takes to encode C with no help from K. This case corresponds to I(C;K)=0. The best-case value H(C|K) is 0, which corresponds to perfect correspondence between C and K. This corresponds to I(C;K)=H(C).

    • I(C;K) is equivalent to H(C|K) because H(C) is fixed for a given ground-truth set.

    • nMI is a normalized version of I(C;K) by leveraging the cluster entropy and class entropy. The best value is 1 and the worst value is 0.