|
ABSTRACT
It has been shown that storing documents having similar structures together can reduce the fragmentation problem and improve query efficiency. Unlike the flat text document, the Web document has no standard vectorial representation, which is required in most existing classification algorithms. In this paper, we propose a vectorization method for XML documents by using multidimensional scaling (MDS) so that Web documents can be fed into an existing classification algorithm. The classical MDS embeds data points into an Euclidean space if the similarity matrix constructed by the data points is semidefinite. The semidefniteness condition, however, may not hold due to the inference technique used in practice. We will find a semi-definite matrix which is the closest to the distance matrix in the Euclidean space. Based on recent developments on strongly semismooth matrix valued functions, we solve the nearest semi-definite matrix problem with a Newton-type method. Experimental studies show that the classification accuracy can be improved.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
C. Burges, Geometric methods for feature extraction and dimensional reduction. In L. Rokach and O. Maimon (Eds.), Data mining and knowledge discovery handbook: A complete guide for practition- ers and researchers. Kluwer Academic Publishers, 2005
|
| |
4
|
Cox, T., and Cox, M. Multidimensional scaling. London: Chapman & Hall, 1994
|
 |
5
|
|
| |
6
|
|
| |
7
|
Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.
|
| |
8
|
F. H. Clarke, Optimization and Nonsmooth Analysis, John Wiley & Sons, New York, 1983.
|
 |
9
|
Minos Garofalakis , Aristides Gionis , Rajeev Rastogi , S. Seshadri , Kyuseok Shim, XTRACT: a system for extracting document type descriptors from XML documents, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.165-176, May 15-18, 2000, Dallas, Texas, United States
|
| |
10
|
N. J. Higham, Computing the nearest correlation matrix - a problem from finance, IMA J. Numer. Analysis 22 (2002), pp. 329--343.
|
| |
11
|
|
| |
12
|
|
| |
13
|
M. Murata Hedge Automata: A Formal Model for XML Schemata, http://www.xml.gr.jp/relax/hedge_nice.html
|
| |
14
|
|
| |
15
|
A. Nierman and H. V. Jagadish, Evaluating structural similarity in XML documents, WebDB 2002, Madison, Wisconsin, June 2002.
|
| |
16
|
|
| |
17
|
R. T. Rockafellar, Conjugate Duality and Optimization, SIAM, Philadelphia, 1974.
|
 |
18
|
|
| |
19
|
Schölkopf, B., K. Tsuda and J. P. Vert, Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, USA (2004).
|
| |
20
|
D. Shasha and K. Zhang, Approximate Tree Pattern Matching, Chapter 14 Pattern Matching Algorithms (eds. Apostolico, A. and Galil, Z.), Oxford University Press, June 1997.
|
| |
21
|
J. F. Sturm, Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones, Optimization Methos and Software, vol. 11--12, 625--653.
|
| |
22
|
Jian-Tao Sun , Ben-Yu Zhang , Zheng Chen , Yu-Chang Lu , Chun-Yi Shi , Wei-Ying Ma, GE-CKO: A Method to Optimize Composite Kernels for Web Page Classification, Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, p.299-305, September 20-24, 2004
[doi> 10.1109/WI.2004.74]
|
| |
23
|
V. de Silva and J. B. Tenenbaum, Global versus local methods in nonlinear dimensionality reduction, in Advances in Neural Information Processing Systems 15, S. T. S. Becker and K. Obermayer, Eds. Cambridge, MA: MIT Press, 2003, pp. 705--712.
|
| |
24
|
|
| |
25
|
|
| |
26
|
XML Document Mining Challenge, http://xmlmining.lip6.fr/.
|
|