ACM Home Page
Please provide us with feedback. Feedback
Summarizing spatial data streams using ClusterHulls
Full text PdfPdf (10.91 MB)
Source Journal of Experimental Algorithmics (JEA) archive
Volume 13 ,  (February 2009) table of contents
SECTION: SECTION: 2 - Selected Papers from ALENEX 2006 table of contents
Article No. 4  
Year of Publication: 2009
ISSN:1084-6654
Authors
John Hershberger  Mentor Graphics Corp., Wilsonville, Ohio
Nisheeth Shrivastava  Bell-Labs Research, Bangalore, India
Subhash Suri  University of California, Santa Barbara, Santa Barbara, California
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 168,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1412228.1412238
What is a DOI?

ABSTRACT

We consider the following problem: given an on-line, possibly unbounded stream of two-dimensional (2D) points, how can we summarize its spatial distribution or shape using a small, bounded amount of memory? We propose a novel scheme, called ClusterHull, which represents the shape of the stream as a dynamic collection of convex hulls, with a total of at most m vertices, where m is the size of the memory. The algorithm dynamically adjusts both the number of hulls and the number of vertices in each hull to best represent the stream using its fixed-memory budget. This algorithm addresses a problem whose importance is increasingly recognized, namely, the problem of summarizing real-time data streams to enable on-line analytical processing. As a motivating example, consider habitat monitoring using wireless sensor networks. The sensors produce a steady stream of geographic data, namely, the locations of objects being tracked. In order to conserve their limited resources (power, bandwidth, and storage), the sensors can compute, store, and exchange ClusterHull summaries of their data, without losing important geometric information. We are not aware of other schemes specifically designed for capturing shape information in geometric data streams and so we compare ClusterHull with some of the best general-purpose clustering schemes, such as CURE, k-medians, and LSEARCH. We show through experiments that ClusterHull is able to represent the shape of two-dimensional data streams more faithfully and flexibly than the stream versions of these clustering algorithms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Agarwal, P. and Procopiuc, C. 2000. Approximation algorithms for projective clustering. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA. 538--547.
 
2
Agarwal, P. K., Har-Peled, S., and Varadarajan, K. R. 2004. Approximating extent measures of points. J. ACM 51, 4, 606--635.
 
3
Aggarwal, C. C., Han, J., Wang, J., and Yu., P. S. 2003. A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Databases. Morgan Kaufmann, San Francisco, CA.
 
4
Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing. ACM Press, New York. 20--29.
 
5
Aluru, S. 2005. Quadtrees and octrees. In Handbook of Data Structures and Applications, D. P. Mehta and S. Sahni, Eds. CRC Press LLC, Boca Raton, FL. Chapter 19, 1--26.
 
6
Amenta, N., Bern, M., and Eppstein, D. 1998. The crust and the β-skeleton: Combinatorial curve reconstruction. Graphical Models and Image Processing 60, 125--135.
 
7
Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. 1998. An optimal algorithm for approximate nearest neighbor searching. J. ACM 45, 891--923.
 
8
Böhm, C., Faloutsos, C., Pan, J.-Y., and Plant, C. 2006. Robust information-theoretic clustering. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining. ACM Press, New York. 65--75.
 
9
Bradley, P. S., Fayyad, U. M., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. AAAI, New York. 9--15.
 
10
Chan, T. M. 2004. Faster core-set constructions and data stream algorithms in fixed dimensions. In Proceedings of the 20th Annual Symposium on Computational Geometry. ACM Press, New York. 152--159.
 
11
Charikar, M., Chekuri, C., Feder, T., and Motwani, R. 1997. Incremental clustering and dynamic information retrieval. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. ACM Press, New York. 626--635.
 
12
Charikar, M., O'Callaghan, L., and Panigrahy, R. 2003. Better streaming algorithms for clustering problems. In Proceedings of the 25th Annual ACM Symposium on Theory of Computing. ACM Press, New York. 30--39.
 
13
Chen, K. 2006. On k-median clustering in high dimensions. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA. 1177--1185.
 
14
Cheng, D., Vempala, S., Kannan, R., and Wang, G. 2005. A divide-and-merge methodology for clustering. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press, New York. 196--205.
 
15
Cormode, G. and Muthukrishnan, S. 2003. Radial histogram for spatial streams. Technical Report 2003-11, DIMACS.
 
16
Curless, B. and Levoy, M. 1996. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. ACM Press, New York. 303--312.
 
17
Dey, T. K., Mehlhorn, K., and Ramos, E. 2000. Curve reconstruction: Connecting dots with good reason. Comput Geom. Theory Appl. 15, 229--244.
 
18
Domingos, P. and Hulten, G. 2001a. A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA. 106--113.
 
19
Domingos, P. and Hulten, G. 2001b. Learning from infinite data in finite time. In Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, MA. 673--680.
 
20
Drineas, P., Frieze, A., Kannan, R., Vempala, S., and Vinay, V. 1999. Clustering in large graphs and matrices. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA. 291--299.
 
21
Edelsbrunner, H. 1987. Algorithms in Combinatorial Geometry. EATCS Monographs on Theoretical Computer Science, vol. 10. Springer-Verlag, New York.
 
22
Ester, M., Kriegel, H.-P., and Xu, X. 1995. A database interface for clustering in large spatial databases. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining. AAAI, New York. 94--99.
 
23
Feigenbaum, J., Kannan, S., and Zhang, J. 2002. Computing diameter in the streaming and sliding-window models. Manuscript.
 
24
Frahling, G. and Sohler, C. 2005. Coresets in dynamic geometric data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing. ACM Press, New York. 209--217.
 
25
Frieze, A., Kannan, R., and Vempala, S. 1998. Fast Monte-Carlo algorithms for finding low-rank approximations. In Proceedings of the 39th Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA.
 
26
Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York. 58--66.
 
27
Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York. 73--84.
 
28
Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O'Callaghan, L. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowledge Data Engi. 15, 3, 515--528.
 
29
Har-Peled, S. and Varadarajan, K. 2002. Projective clustering in high dimensions using core-sets. In Proceedings of the 18th Annual Symposium on Computational Geometry. ACM Press, New York. 312--318.
 
30
Har-Peled, S. and Kushal, A. 2005. Smaller coresets for k-median and k-means clustering. In Proceedings of the 21st Annual Symposium on Computational Geometry. ACM Press, New York. 126--134.
 
31
Hershberger, J. and Suri, S. 2004. Adaptive sampling for geometric problems over data streams. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press, New York. 252--262.
 
32
Kannan, R., Vempala, S., and Veta, A. 2000. On clusterings—good, bad and spectral. In Proceedings of the 41st Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA. 367.
 
33
Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Databases. Morgan Kaufmann, San Francisco, CA. 346--357.
 
34
Muthukrishnan, S. 2005. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, vol. 1, issue 2. Now Publishers, Delft, Netherlands.
 
35
Ng, R. T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th International Conference on Very Large Databases. Morgan Kaufmann, San Francisco, CA. 144--155.
 
36
O'Rourke, J. and Toussaint, G. T. 1997. Pattern recognition. In Handbook of Discrete and Computational Geometry, J. E. Goodman and J. O'Rourke, Eds. CRC Press LLC, Boca Raton, FL. Chapter 43, 797--814.
 
37
Pelleg, D. and Moore, A. W. 2000. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA. 727--734.
 
38
USGS. 2003. West Nile virus maps - 2002. http://cindi.usgs.gov/hazard/event/west_nile/.
 
39
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York. 103--114.

Collaborative Colleagues:
John Hershberger: colleagues
Nisheeth Shrivastava: colleagues
Subhash Suri: colleagues