|
ABSTRACT
The Johnson--Lindenstrauss lemma states that n points in a
high-dimensional Hilbert space can be embedded with small
distortion of the distances into an O(log n)
dimensional space by applying a random linear transformation. We
show that similar (though weaker) properties hold for certain
random linear transformations over the Hamming cube. We use these
transformations to solve NP-hard clustering problems in the cube as
well as in geometric settings.More specifically, we address the
following clustering problem. Given n points in a larger set
(e.g., ℝd) endowed with a distance function (e.g.,
L2 distance), we would like to partition the data
set into k disjoint clusters, each with a "cluster center,"
so as to minimize the sum over all data points of the distance
between the point and the center of the cluster containing the
point. The problem is provably NP-hard in some high-dimensional
geometric settings, even for k = 2. We give polynomial-time
approximation schemes for this problem in several settings,
including the binary cube {0,1}d with Hamming distance,
and ℝd either with L1 distance,
or with L2 distance, or with the square of
L2 distance. In all these settings, the best
previous results were constant factor approximation guarantees.We
note that our problem is similar in flavor to the k-median
problem (and the related facility location problem), which has been
considered in graph-theoretic and fixed dimensional geometric
settings, where it becomes hard when k is part of the input.
In contrast, we study the problem when k is fixed, but the
dimension is part of the input.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
Alon, N., and Spencer, J. 1992. The Probabilistic Method. Wiley. New York.
|
| |
4
|
|
 |
5
|
Sanjeev Arora , David Karger , Marek Karpinski, Polynomial time approximation schemes for dense instances of NP-hard problems, Proceedings of the twenty-seventh annual ACM symposium on Theory of computing, p.284-293, May 29-June 01, 1995, Las Vegas, Nevada, United States
[doi> 10.1145/225058.225140]
|
 |
6
|
Sanjeev Arora , Prabhakar Raghavan , Satish Rao, Approximation schemes for Euclidean k-medians and related problems, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.106-113, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276718]
|
 |
7
|
Allan Borodin , Rafail Ostrovsky , Yuval Rabani, Subquadratic approximation algorithms for clustering problems in high dimensional spaces, Proceedings of the thirty-first annual ACM symposium on Theory of computing, p.435-444, May 01-04, 1999, Atlanta, Georgia, United States
[doi> 10.1145/301250.301367]
|
| |
8
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
 |
13
|
Moses Charikar , Sudipto Guha , Éva Tardos , David B. Shmoys, A constant-factor approximation algorithm for the k-median problem (extended abstract), Proceedings of the thirty-first annual ACM symposium on Theory of computing, p.1-10, May 01-04, 1999, Atlanta, Georgia, United States
[doi> 10.1145/301250.301257]
|
 |
14
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen, Constant interaction-time scatter/gather browsing of very large document collections, Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, p.126-134, June 27-July 01, 1993, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/160688.160706]
|
 |
15
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
| |
16
|
|
| |
17
|
|
| |
18
|
Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Soc. Inf. Sci. 41, 6, 391--407.
|
| |
19
|
P. Drineas , Alan Frieze , Ravi Kannan , Santosh Vempala , V. Vinay, Clustering in large graphs and matrices, Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms, p.291-299, January 17-19, 1999, Baltimore, Maryland, United States
|
| |
20
|
Dumais, S. T. 1991. Improving the retrieval of information from external sources. Behav. Res. Meth. Instrum. Comput. 23, 2, 229--236.
|
 |
21
|
S. T. Dumais , G. W. Furnas , T. K. Landauer , S. Deerwester , R. Harshman, Using latent semantic analysis to improve access to textual information, Proceedings of the SIGCHI conference on Human factors in computing systems, p.281-285, May 15-19, 1988, Washington, D.C., United States
[doi> 10.1145/57167.57214]
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
 |
27
|
|
| |
28
|
|
| |
29
|
Johnson, W. B., and Lindenstrauss, J. 1984. Extensions of Lipschitz mappings into Hilbert space. Contemp. Math. 26, 189--206.
|
| |
30
|
|
| |
31
|
Kannan, R., and Vinay, V.2001. The Manjara Meta-Search Engine. http://cluster.cs. yale.edu/about.html.
|
| |
32
|
Karp, R. M. 2000. The genomics revolution and its challenges for algorithmic research. Bull. EATCS 71 (June), 151--159.
|
 |
33
|
|
 |
34
|
|
 |
35
|
Jon Kleinberg , Christos Papadimitriou , Prabhakar Raghavan, Segmentation problems, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.473-482, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276860]
|
| |
36
|
|
| |
37
|
Linial, N., London, E., and Rabinovich, Y. 1995. The geometry of graphs and some of its algorithmic applications. Combinatorica 15, 2, 215--245.
|
| |
38
|
|
| |
39
|
|
 |
40
|
|
| |
41
|
Zamir, O., Etzioni, O., Madani, O., and Karp, R. M. 1997. Fast and intuitive clustering of web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. (Newport Beach, Calif., Aug.), D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, Eds. AAAI Press, pp. 287--290.
|
CITED BY 5
|
|
W. Fernandez de la Vega , Marek Karpinski , Claire Kenyon , Yuval Rabani, Approximation schemes for clustering problems, Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, June 09-11, 2003, San Diego, CA, USA
|
|
|
|
|
|
Nir Ailon , Bernard Chazelle , Seshadhri Comandur , Ding Liu, Self-improving algorithms, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, p.261-270, January 22-26, 2006, Miami, Florida
|
|
|
Amit Deshpande , Luis Rademacher , Santosh Vempala , Grant Wang, Matrix approximation and projective clustering via volume sampling, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, p.1117-1126, January 22-26, 2006, Miami, Florida
|
|
|
|
|