|
ABSTRACT
XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Charu C. Aggarwal , Joel L. Wolf , Philip S. Yu , Cecilia Procopiuc , Jong Soo Park, Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.61-72, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
2
|
|
| |
3
|
T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, S. Arikawa. Efficient substructure discovery from large semi-structured data. ACM SIAM International Conference on Data Mining, 2002.
|
| |
4
|
|
| |
5
|
T. Dalamagas, T. Cheng, K. Winkel, T. Sellis. Clustering XML Documents Using Structural Summaries. Information Systems, Elsevier, January 2005. Also appeared in EDBT 2004 Workshops on Current Trends in Database Technology, 2004.
|
| |
6
|
|
| |
7
|
|
 |
8
|
Mong Li Lee , Liang Huai Yang , Wynne Hsu , Xia Yang, XClust: clustering XML schemas for effective integration, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
[doi> 10.1145/584792.584841]
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
Jian Pei , Jiawei Han , Behzad Mortazavi-Asl , Helen Pinto , Qiming Chen , Umeshwar Dayal , Meichun Hsu, PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth, Proceedings of the 17th International Conference on Data Engineering, p.215-224, April 02-06, 2001
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
 |
16
|
|
 |
17
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
 |
18
|
|
|