ACM Home Page
Please provide us with feedback. Feedback
Information theoretic measures for clusterings comparison: is a correction for chance necessary?
Full text PdfPdf (741 KB)
Source ACM International Conference Proceeding Series; Vol. 382 archive
Proceedings of the 26th Annual International Conference on Machine Learning table of contents
Montreal, Quebec, Canada
Pages 1073-1080  
Year of Publication: 2009
ISBN:978-1-60558-516-1
Authors
Nguyen Xuan Vinh  The University of New South Wales, Sydney, Australia & ATP Laboratory, National ICT Australia (NICTA)
Julien Epps  The University of New South Wales, Sydney, Australia & ATP Laboratory, National ICT Australia (NICTA)
James Bailey  The University of Melbourne, Australia & Victoria Research Laboratory, National ICT Australia
Sponsors
: MITACS
: NSF
Microsoft Research : Microsoft Research
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 50,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1553374.1553511
What is a DOI?

ABSTRACT

Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Albatineh, A. N., Niewiadomska-Bugaj, M., & Mihalko, D. (2006). On similarity indices and correction for chance agreement. Journal of Classification, 23, 301--313.
 
2
 
3
 
4
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 193--218.
 
5
Lancaster, H. (1969). The chi-squared distribution. New York: John Wiley.
6
 
7
 
8
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846--850.
 
9
 
10
Vinh, N. X., & Epps, J. (2009). A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. BIBE'09: The IEEE International Conference on BioInformatics and BioEngineering, to appear.
 
11
Vinh, N. X., Epps, J., & Bailey, J. (2009). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. to be submitted.
 
12
Warrens, M. (2008). On similarity coefficients for 2x2 tables and correction for chance. Psychometrika, 73, 487--502.
 
13

Collaborative Colleagues:
Nguyen Xuan Vinh: colleagues
Julien Epps: colleagues
James Bailey: colleagues