|
ABSTRACT
With continued advances in communication network technology and sensing technology, there is astounding growth in the amount of data produced and made available through cyberspace. Efficient and high-quality clustering of large datasets continues to be one of the most important problems in large-scale data analysis. A commonly used methodology for cluster analysis on large datasets is the three-phase framework of sampling/summarization, iterative cluster analysis, and disk-labeling. There are three known problems with this framework which demand effective solutions. The first problem is how to effectively define and validate irregularly shaped clusters, especially in large datasets. Automated algorithms and statistical methods are typically not effective in handling these particular clusters. The second problem is how to effectively label the entire data on disk (disk-labeling) without introducing additional errors, including the solutions for dealing with outliers, irregular clusters, and cluster boundary extension. The third obstacle is the lack of research about issues related to effectively integrating the three phases. In this article, we describe iVIBRATE---an interactive visualization-based three-phase framework for clustering large datasets. The two main components of iVIBRATE are its VISTA visual cluster-rendering subsystem which invites human interplay into the large-scale iterative clustering process through interactive visualization, and its adaptive ClusterMap labeling subsystem which offers visualization-guided disk-labeling solutions that are effective in dealing with outliers, irregular clusters, and cluster boundary extension. Another important contribution of iVIBRATE development is the identification of the special issues presented in integrating the two components and the sampling approach into a coherent framework, as well as the solutions for improving the reliability of the framework and for minimizing the amount of errors generated within the cluster analysis process. We study the effectiveness of the iVIBRATE framework through a walkthrough example dataset of a million records and we experimentally evaluate the iVIBRATE approach using both real-life and synthetic datasets. Our results show that iVIBRATE can efficiently involve the user in the clustering process and generate high-quality clustering results for large datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
2
|
|
| |
3
|
|
| |
4
|
Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P. J., Hellerstein, J. M., Ioannidis, Y. E., Jagadish, H. V., Johnson, T., Ng, R. T., Poosala, V., Ross, K. A., and Sevcik, K. C. 1997. The New Jersey data reduction report. IEEE Data Eng. Bull. 20, 4, 3--45.
|
| |
5
|
Bradley, P. S., Fayyad, U. M., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the ACM SIGKDD Conference, 9--15.
|
 |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
Conover, W. 1998. Practical Nonparametric Statistics. John Wiley and Sons, New York.
|
| |
10
|
Cook, D., Buja, A., Cabrera, J., and Hurley, C. 1995. Grand tour and projection pursuit. J. Comput. Graphical Statistics 23, 155--172.
|
| |
11
|
Cox, T. F. and Cox, M. A. A. 2001. Multidimensional Scaling. Chapman and Hall, London, UK.
|
 |
12
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
| |
13
|
Dhillon, I. S., Modha, D. S., and Spangler, W. S. 1998. Visualizing class structure of multidimensional data. In Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics 30, 488--493.
|
| |
14
|
Dubes, R. C. and Jain, A. K. 1979. Validity studies in clustering methodologies. Pattern Recogn. Lett., 235--254.
|
| |
15
|
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 226--231.
|
 |
16
|
Christos Faloutsos , King-Ip Lin, FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.163-174, May 22-25, 1995, San Jose, California, United States
|
 |
17
|
|
| |
18
|
|
| |
19
|
Gray, J. 2000. What next? A few remaining problems in information technlogy. In Proceedings of the Sigmod Conference 1999, ACM Turing Award Lecture (video). ACM SIGMOD Digital Symposium Collection 2, 2.
|
 |
20
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
| |
21
|
|
 |
22
|
|
| |
23
|
Hinneburg, A. and Keim, D. A. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the ACM SIGKDD Conference, 58--65.
|
| |
24
|
|
| |
25
|
Patrick Hoffman , Georges Grinstein , Kenneth Marx , Ivo Grosse , Eugene Stanley, DNA visual and analytic data mining, Proceedings of the 8th conference on Visualization '97, p.437-ff., October 18-24, 1997, Phoenix, Arizona, United States
|
| |
26
|
|
| |
27
|
|
 |
28
|
|
 |
29
|
|
| |
30
|
|
 |
31
|
|
| |
32
|
Larkin, J. and Simon, H. 1987. Why a diagram is (sometimes) worth ten thousand words. Cognitive Sci. 11, 65--99.
|
| |
33
|
|
| |
34
|
|
| |
35
|
Littlefield, R. J. 1983. Using the GLYPH concept to create user-definable display formats. National Comput. Graphics Association, 697--706.
|
| |
36
|
|
| |
37
|
|
| |
38
|
Newman, M. E. 2003. The structure and function of complex networks. SIAM Rev., 167--256.
|
 |
39
|
|
| |
40
|
|
| |
41
|
Ross, S. M. 2000. Introduction to Probability Models. Academic Press, San Diego, CA.
|
 |
42
|
Dmitri Roussinov , Kristine Tolle , Marshall Ramsey , Hsinchun Chen, Interactive Internet search through automatic clustering (poster abstract): an empirical study, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.289-290, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312714]
|
 |
43
|
|
| |
44
|
|
| |
45
|
|
| |
46
|
|
 |
47
|
|
 |
48
|
|
| |
49
|
Sonka, M., Hlavac, V., and Boyle, R. 1999. Image Processing, Analysis and Machine Vision. Brooks/Cole Publishing, Pacific Grove, CA.
|
 |
50
|
|
| |
51
|
|
| |
52
|
|
| |
53
|
|
| |
54
|
Yang, J., Ward, M. O., and Rundensteiner, E. A. 2002. Interactive hierarchical displays: A general framework for visualization and exploration of large multivariate datasets. Comput. Graphics J. 27, 265--283.
|
 |
55
|
|
| |
56
|
|
 |
57
|
|
 |
58
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
|