ACM Home Page
Please provide us with feedback. Feedback
Supervised learning for the legacy document conversion
Full text PdfPdf (181 KB)
Source Document Engineering archive
Proceedings of the 2004 ACM symposium on Document engineering table of contents
Milwaukee, Wisconsin, USA
SESSION: Theory and medels II table of contents
Pages: 220 - 228  
Year of Publication: 2004
ISBN:1-58113-938-1
Authors
Boris Chidlovskii  Xerox Research Centre Europe, Meylan, France
Jérôme Fuselier  Xerox Research Centre Europe, Meylan, France
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 26,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1030397.1030439
What is a DOI?

ABSTRACT

We consider the problem of document conversion from the rendering-oriented HTML markup into a semantic-oriented XML annotation defined by user-specific DTDs or XML Schema descriptions. We represent both source and target documents as rooted ordered trees so the conversion can be achieved by applying a set of tree transformations. We apply the supervised learning framework to the conversion task according to which the tree transformations are learned from a set of training examples. %Because of the complexity of tree-to-tree transformations, We develop a two-step approach to the conversion problem, that first labels leaves in the source trees and then recomposes target trees from the leaf labels. We present two solutions based of the leaf classification with the target terminals and paths. Moreover, we develop three methods for the leaf classification. All methods and solutions have been tested on two real collections.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
Oronzo Altamura, Floriana Esposito, and Donato Malerba. Transforming paper documents into XML format with WISDOM++. IJDAR, 4(1):2--17, 2001.
 
4
N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. In Proc. ACM SIGMOD Workshop on Management of Semistructured Data, 1997.
 
5
 
6
 
7
 
8
9
 
10
 
11
I4I - The WORD is XML. www.i4i.com/life sciences.htm.
 
12
 
13
14
15
 
16
OmniPage Pro 14 Office. http://www.scansoft.com/omnipage/.
17
 
18
W2X Convertor. www.turnkey.com.au/site/xice/xice/convert.html.
 
19
Y. Wang, I. T. Phillips, and R. Haralick. From image to SGML/XML representation: One method. In International Workshop on Document Layout Interpretation and Its Applications (DLIAP'99), Bangalore, India, September 1999.
 
20
D. Wood. Standard Generalized Markup Language: Mathematical and philosophical issues. Lecture Notes in Computer Science, 1000:344--365, 1995.
 
21
Word and YAWC: A Poor Mans' XML Publishing Environment. www.idealliance.org/papers/xmle02/dx_xmle02/html/abstract/02-06-04.html.
 
22
23
24

Collaborative Colleagues:
Boris Chidlovskii: colleagues
Jérôme Fuselier: colleagues