| Supervised learning for the legacy document conversion |
| Full text |
Pdf
(181 KB)
|
| Source
|
Document Engineering
archive
Proceedings of the 2004 ACM symposium on Document engineering
table of contents
Milwaukee, Wisconsin, USA
SESSION: Theory and medels II
table of contents
Pages: 220 - 228
Year of Publication: 2004
ISBN:1-58113-938-1
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 25, Citation Count: 0
|
|
|
ABSTRACT
We consider the problem of document conversion from the rendering-oriented HTML markup into a semantic-oriented XML annotation defined by user-specific DTDs or XML Schema descriptions. We represent both source and target documents as rooted ordered trees so the conversion can be achieved by applying a set of tree transformations. We apply the supervised learning framework to the conversion task according to which the tree transformations are learned from a set of training examples. %Because of the complexity of tree-to-tree transformations, We develop a two-step approach to the conversion problem, that first labels leaves in the source trees and then recomposes target trees from the leaf labels. We present two solutions based of the leaf classification with the target terminals and paths. Moreover, we develop three methods for the leaf classification. All methods and solutions have been tested on two real collections.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alfred V. Aho , Ravi Sethi , Jeffrey D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1986
|
| |
2
|
|
| |
3
|
Oronzo Altamura, Floriana Esposito, and Donato Malerba. Transforming paper documents into XML format with WISDOM++. IJDAR, 4(1):2--17, 2001.
|
| |
4
|
N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. In Proc. ACM SIGMOD Workshop on Management of Semistructured Data, 1997.
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
I4I - The WORD is XML. www.i4i.com/life sciences.htm.
|
| |
12
|
|
| |
13
|
|
 |
14
|
Tova Milo , Dan Suciu , Victor Vianu, Typechecking for XML transformers, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.11-22, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335171]
|
 |
15
|
|
| |
16
|
OmniPage Pro 14 Office. http://www.scansoft.com/omnipage/.
|
 |
17
|
|
| |
18
|
W2X Convertor. www.turnkey.com.au/site/xice/xice/convert.html.
|
| |
19
|
Y. Wang, I. T. Phillips, and R. Haralick. From image to SGML/XML representation: One method. In International Workshop on Document Layout Interpretation and Its Applications (DLIAP'99), Bangalore, India, September 1999.
|
| |
20
|
D. Wood. Standard Generalized Markup Language: Mathematical and philosophical issues. Lecture Notes in Computer Science, 1000:344--365, 1995.
|
| |
21
|
Word and YAWC: A Poor Mans' XML Publishing Environment. www.idealliance.org/papers/xmle02/dx_xmle02/html/abstract/02-06-04.html.
|
| |
22
|
|
 |
23
|
|
 |
24
|
|
|