|
ABSTRACT
Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as ground-truth and then design algorithms to (approximately) optimize various graph-based objective functions. However, in most applications, this similarity information is merely based on some heuristic; the ground truth is really the unknown correct clustering of the data points and the real goal is to achieve low error on the data. In this work, we develop a theoretical approach to clustering from this perspective. In particular, motivated by recent work in learning theory that asks "what natural properties of a similarity (or kernel) function are sufficient to be able to learn well?" we ask "what natural properties of a similarity function are sufficient to be able to cluster well?" To study this question we develop a theoretical framework that can be viewed as an analog of the PAC learning model for clustering, where the object of study, rather than being a concept class, is a class of (concept, similarity function) pairs, or equivalently, a property the similarity function should satisfy with respect to the ground truth clustering. We then analyze both algorithmic and information theoretic issues in our model. While quite strong properties are needed if the goal is to produce a single approximately-correct clustering, we find that a number of reasonable properties are sufficient under two natural relaxations: (a) list clustering: analogous to the notion of list-decoding, the algorithm can produce a small list of clusterings (which a user can select from) and (b) hierarchical clustering: the algorithm's goal is to produce a hierarchy such that desired clustering is some pruning of this tree (which a user could navigate). We develop a notion of the clustering complexity of a given property (analogous to notions of capacity in learning theory), that characterizes its information-theoretic usefulness for clustering. We analyze this quantity for several natural game-theoretic and learning-theoretic properties, as well as design new efficient algorithms that are able to take advantage of them. Our algorithms for hierarchical clustering combine recent learning-theoretic approaches with linkage-style methods. We also show how our algorithms can be extended to the inductive case, i.e., by using just a constant-sized sample, as in property testing. The analysis here uses regularity-type results of [FK] and [AFKK].
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In COLT, 2005.
|
 |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In ACM Symposium on Theory of Computing, 2005.
|
 |
6
|
|
| |
7
|
|
| |
8
|
M.-F. Balcan, A. Blum, and S. Vempala. A theory of similarity functions for clustering. Technical Report, CMU-CS-07-142, 2007.
|
| |
9
|
|
| |
10
|
|
| |
11
|
S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of recent advances. ESAIM: Probability and Statistics, 9:9:323--375, 2005.
|
 |
12
|
Moses Charikar , Sudipto Guha , Éva Tardos , David B. Shmoys, A constant-factor approximation algorithm for the k-median problem (extended abstract), Proceedings of the thirty-first annual ACM symposium on Theory of computing, p.1-10, May 01-04, 1999, Atlanta, Georgia, United States
[doi> 10.1145/301250.301257]
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
Anirban Dasgupta , John Hopcroft , Ravi Kannan , Pradipta Mitra, Spectral clustering by recursive partitioning, Proceedings of the 14th conference on Annual European Symposium, p.256-267, September 11-13, 2006, Zurich, Switzerland
[doi> 10.1007/11841036_25]
|
| |
17
|
|
 |
18
|
W. Fernandez de la Vega , Marek Karpinski , Claire Kenyon , Yuval Rabani, Approximation schemes for clustering problems, Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, June 09-11, 2003, San Diego, CA, USA
[doi> 10.1145/780542.780550]
|
| |
19
|
L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.
|
| |
20
|
A. Frieze and R. Kannan. Quick approximation to matrices and applications. Combinatorica, 19(2):175--220, 1999.
|
| |
21
|
|
 |
22
|
|
| |
23
|
R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In Proc. COLT, 2005.
|
 |
24
|
|
| |
25
|
J. Kleinberg. An impossibility theorem for clustering. In NIPS, 2002.
|
| |
26
|
|
| |
27
|
R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures. In Proc. ICML, 2002.
|
| |
28
|
|
| |
29
|
B. Scholkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, 2004.
|
| |
30
|
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, 1998.
|
| |
31
|
|
| |
32
|
N. Srebro. How good is a kernel as a similarity function? In Proc. 20th Annual Conference on Learning Theory, 2007.
|
| |
33
|
|
 |
34
|
|
| |
35
|
V. N. Vapnik. Statistical Learning Theory. Wiley and Sons, 1998.
|
| |
36
|
|
|