|
ABSTRACT
Extensible Markup Language (XML) is a simple and flexible text format derived from SGML [1]. It has been widely accepted as one of the crucial components in many information retrieval related applications, such as XML databases, web services, etc. One of the reasons for its wide acceptance is its customized format during data transmission or data storage. Classification is an important data mining task, which aims to assign unknown objects to classes which best characterize them. In this paper, we propose a method to classify XML documents under the assumption that they do not have a common schema, which may or may not be available. Our method is similarity-based. Its main characteristics is its way to handle the roles played by texts and the structural information. Unlike most existing methods, we use a bottom-up approach, i.e., we start from the text first, and then embed the structural information. This is based on the observation that in XML documents with diversified tag structures, the most informative information are carried by the terms in the texts. Our experiments show that this strategy can achieve a better performance than the existing methods for documents from sources that exhibit heterogeneous structures.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
A. Bratko and B. Filipič. Exploiting structural information in semi-structured document classification. In Proc. 13th International Electrotechnical and Computer Science Conference, ERK'2004, vol B, pages 145--149, Portorož, Slovenia, 2004.
|
 |
3
|
Anirban Dasgupta , Petros Drineas , Boulos Harb , Vanja Josifovski , Michael W. Mahoney, Feature selection methods for text classification, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
[doi> 10.1145/1281192.1281220]
|
| |
4
|
L Denoyer and P. Gallinari. A belief networks-based generative model for structured documents. an application to the xml categorization. In Machine Learning and Data Mining in Pattern Recognition, volume 2734, pages 277--302, 2003.
|
| |
5
|
|
| |
6
|
U. M. Fayyad and K. B Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In International Joint Conferences on Artificial Intelligence, pages 1022--1029, 1993.
|
| |
7
|
|
| |
8
|
A. Kurt and T. Engin. Classification of xslt-generated web documents with support vector machines. In Knowledge Discovery from XML Documents, volume 3915, pages 33--42, 2006.
|
| |
9
|
P. F. Marteau, G. Menie, and E. Popovic. Weighted na?ve bayes model for semi-structured document categorization. In 1st International Conference on Multidisciplinary Information Sciences and Technologies.
|
 |
10
|
|
 |
11
|
Dou Shen , Zheng Chen , Qiang Yang , Hua-Jun Zeng , Benyu Zhang , Yuchang Lu , Wei-Ying Ma, Web-page classification through summarization, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009035]
|
 |
12
|
|
| |
13
|
Weikum. G Theobald. M, Schenkel. R. Exploiting structure, annotation and ontological knowledge for automatic classification of xml data. In WebDB, San Diego, CA., 2003.
|
| |
14
|
Wu.X.J. Xml document classification with co-training. In ILP2007, 2007.
|
| |
15
|
G Xing and et al. Classifying xml documents based on structure/content similarity. In Lecture Notes in Computer Science, volume 4518, pages 444--457, 2007.
|
 |
16
|
|
 |
17
|
|
|