|
ABSTRACT
We study clustering problems in the streaming model, where the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storage space. Our main result is a randomized algorithm for the k--Median problem which produces a constant factor approximation in one pass using storage space O(k poly log n). This is a significant improvement of the previous best algorithm which yielded a 2O(1/ε) approximation using O(nε) space. Next we give a streaming algorithm for the k--Median problem with an arbitrary distance function. We also study algorithms for clustering problems with outliers in the streaming model. Here, we give bicriterion guarantees, producing constant factor approximations by increasing the allowed fraction of outliers slightly.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Miklós Ajtai , T. S. Jayram , Ravi Kumar , D. Sivakumar, Approximate counting of inversions in a data stream, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.509964]
|
| |
2
|
|
| |
3
|
|
 |
4
|
Sanjeev Arora , Prabhakar Raghavan , Satish Rao, Approximation schemes for Euclidean k-medians and related problems, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.106-113, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276718]
|
 |
5
|
Vijay Arya , Naveen Garg , Rohit Khandekar , Kamesh Munagala , Vinayaka Pandit, Local search heuristic for k-median and facility location problems, Proceedings of the thirty-third annual ACM symposium on Theory of computing, p.21-29, July 2001, Hersonissos, Greece
[doi> 10.1145/380752.380755]
|
| |
6
|
Ziv Bar-Yosseff , Ravi Kumar , D. Sivakumar, Reductions in streaming algorithms, with an application to counting triangles in graphs, Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.623-632, January 06-08, 2002, San Francisco, California
|
 |
7
|
Moses Charikar , Chandra Chekuri , Tomás Feder , Rajeev Motwani, Incremental clustering and dynamic information retrieval, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.626-635, May 04-06, 1997, El Paso, Texas, United States
[doi> 10.1145/258533.258657]
|
| |
8
|
|
 |
9
|
Moses Charikar , Sudipto Guha , Éva Tardos , David B. Shmoys, A constant-factor approximation algorithm for the k-median problem (extended abstract), Proceedings of the thirty-first annual ACM symposium on Theory of computing, p.1-10, May 01-04, 1999, Atlanta, Georgia, United States
[doi> 10.1145/301250.301257]
|
| |
10
|
Moses Charikar , Samir Khuller , David M. Mount , Giri Narasimhan, Algorithms for facility location problems with outliers, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, p.642-651, January 07-09, 2001, Washington, D.C., United States
|
 |
11
|
|
| |
12
|
M. Dyer and A. M. Frieze. A simple heuristic for the p--center problem. Operations Research Letters, v. 3, 1985.
|
| |
13
|
J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate L1-difference algorithm for massive data streams. Proc. FOCS, 2000.
|
 |
14
|
Anna C. Gilbert , Sudipto Guha , Piotr Indyk , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, Fast, small-space algorithms for approximate histogram maintenance, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.509966]
|
 |
15
|
|
| |
16
|
|
| |
17
|
O. Kariv and S. L. Hakimi. An algorithmic approach to network location problems, part ii: p-medians. SIAM Journal of Appl. Math, v. 37, 1979.
|
| |
18
|
M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. In DIMACS series in Discrete Mathematics and Theoretical Computer Science, v. 50, 1999.
|
| |
19
|
D. S. Hochbaum and D. B. Shmoys. A best possible approximation algorithm for the k--Center problem. Mathematics of Operations Research, v. 10, 1985.
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
 |
23
|
|
| |
24
|
Madhukar R. Korupolu , C. Greg Plaxton , Rajmohan Rajaraman, Analysis of a local search heuristic for facility location problems, Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, p.1-10, January 25-27, 1998, San Francisco, California, United States
|
| |
25
|
|
 |
26
|
|
 |
27
|
Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay, Random sampling techniques for space efficient online computation of order statistics of large datasets, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.251-262, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
28
|
|
| |
29
|
R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Proc. UAI, 2002.
|
| |
30
|
|
| |
31
|
A. Meyerson, L. O'Callaghan and S. Plotkin. Approximating k--Median: A sampling-based approach. Unpublished manuscript, 2001.
|
| |
32
|
Nina Mishra , Dan Oblinger , Leonard Pitt, Sublinear time approximate clustering, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, p.439-447, January 07-09, 2001, Washington, D.C., United States
|
| |
33
|
|
| |
34
|
Neal E. Young, K-medians, facility location, and the Chernoff-Wald bound, Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.86-95, January 09-11, 2000, San Francisco, California, United States
|
CITED BY 27
|
|
Brain Babcock , Mayur Datar , Rajeev Motwani , Liadan O'Callaghan, Maintaining variance and k-medians over data stream windows, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.234-243, June 09-11, 2003, San Diego, California
|
|
|
|
|
|
|
|
|
|
|
|
Mohamed Medhat Gaber , Shonali Krishnaswamy , Arkady Zaslavsky, Cost-efficient mining techniques for data streams, Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation, p.109-114, January 01, 2004, Dunedin, New Zealand
|
|
|
|
|
|
|
|
|
|
|
|
Michael Elkin , Jian Zhang, Efficient algorithms for constructing (1+,ε, β)-spanners in the distributed and streaming models, Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, July 25-28, 2004, St. John's, Newfoundland, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|