|
ABSTRACT
Research on information extraction from Web pages (wrapping) has seen much activity in recent times (particularly systems implementations), but little work has been done on formally studying the expressiveness of the formalisms proposed or on the theoretical foundations of wrapping.In this paper, we first study monadic datalog as a wrapping language (over ranked or unranked tree structures). Using previous work by Neven and Schwentick, we show that this simple language is equivalent to full monadic second order logic (MSO) in its ability to specify wrappers. We believe that MSO has the right expressiveness required for Web information extraction and thus propose MSO as a yardstick for evaluating and comparing wrappers.Using the above result, we study the kernel fragment Elog- of the Elog wrapping language used in the Lixto system (a visual wrapper generator). The striking fact here is that Elog- exactly captures MSO, yet is easier to use. Indeed, programs in this language can be entirely visually specified. We also formally compare Elog to other wrapping languages proposed in the literature.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web. Morgan Kaufmann Publishers, 2000.
|
 |
2
|
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
F. Azavant. Personal communication, Oct. 2001.
|
| |
7
|
|
| |
8
|
|
| |
9
|
A. Brüggemann-Klein and D. Wood. "Regular Tree Languages over Non-ranked Alphabets". Unpublished manuscript, 1998.
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
W. F. Dowling and J. H. Gallier. "Linear-Time Algorithms for Testing the Satisfiability of Propositional Horn Formulae". Journal of Logic Programming,1(3):267-284, 1984.
|
| |
14
|
M. Fernandez, J. Siméon, P. Wadler (eds.), S. Cluet, A. Deutsch, D. F. A. Levy, D. Maier, J. M. J. Robie, D. Suciu, and J. Widom. "XML Query Languages: Experiences and Exemplars", 1999. http://www-db.research.bell-labs.com/user/simeon/xquery.html.
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
L. Liu, C. Pu, and W. Han. "XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources", In Proceedings of the 16th International Conference on Data Engineering, 1998.
|
| |
21
|
|
| |
22
|
|
| |
23
|
I. Muslea, S. Minton, and C. Knoblock. "STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources", 1998.
|
 |
24
|
|
| |
25
|
|
| |
26
|
C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
Wolfgang Thomas, Languages, automata, and logic, Handbook of formal languages, vol. 3: beyond words, Springer-Verlag New York, Inc., New York, NY, 1997
|
| |
31
|
|
| |
32
|
World Wide Web Consortium. http://www.w3c.org/tr/xpath/.
|
| |
33
|
Sheng Yu, Regular languages, Handbook of formal languages, vol. 1: word, language, grammar, Springer-Verlag New York, Inc., New York, NY, 1997
|
|