|
ABSTRACT
We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
R. Baumgartner, S. Eichholz, S. Flesca, G. Gottlob, and M. Herzog. Semantic Markup of News Items with Lixto, 2003.
|
| |
3
|
|
| |
4
|
|
| |
5
|
R. Baumgartner, S. Flesca, G. Gottlob, and M. Herzog. "Building Dynamic Information Portals - A Case Study in the Agrarian Domain". In Proc. IS, 2002.
|
| |
6
|
R. Baumgartner, M. Herzog, and G. Gottlob. "Visual Programming of Web Data Aggregation Applications". In Proc. IIWeb-03, 2003.
|
 |
7
|
Stavros Cosmadakis , Haim Gaifman , Paris Kanellakis , Moshe Vardi, Decidable optimization problems for database logic programs, Proceedings of the twentieth annual ACM symposium on Theory of computing, p.477-490, May 02-04, 1988, Chicago, Illinois, United States
[doi> 10.1145/62212.62259]
|
| |
8
|
|
 |
9
|
|
| |
10
|
J. Doner. "Tree Acceptors and some of their Applications". Journal of Computer and System Sciences,4:406--451, 1970.
|
| |
11
|
|
| |
12
|
|
| |
13
|
E. Gold. "Language Identification in the Limit". Inform. Control, 10:447--474, 1967.
|
 |
14
|
|
| |
15
|
G. Gottlob, C. Koch, and R. Pichler. "Efficient Algorithms for Processing XPath Queries". In Proc. VLDB 2002, Hong Kong, China, 2002.
|
 |
16
|
|
| |
17
|
G. Gottlob, C. Koch, and R. Pichler. "XPath Query Evaluation: Improving Time and Space Efficiency". In ICDE'03, Bangalore, India, Mar. 2003.
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
C. Koch. "Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach". In Proc. VLDB 2003, pages 249--260, 2003.
|
| |
22
|
R. Kosala, H. Blockeel, M. Bruynooghe, and J. V. den Bussche. "Information Extraction from Web Documents based on Local Unranked Tree Automaton Inference". In Proc. IJCAI, 2003.
|
| |
23
|
N. Kushmerick, D. Weld, and R. Doorenbos. "Wrapper Induction for Information Extraction". In Proc. IJCAI, 1997.
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
Mostrare project. www.grappa.univ-lille3.fr/mostrare/.
|
 |
31
|
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
 |
36
|
|
| |
37
|
J. Thatcher and J. Wright. "Generalized Finite Automata Theory with an Application to a Decision Problem of Second-order Logic". Mathematical Systems Theory,2(1):57--81, 1968.
|
| |
38
|
Wolfgang Thomas, Languages, automata, and logic, Handbook of formal languages, vol. 3: beyond words, Springer-Verlag New York, Inc., New York, NY, 1997
|
| |
39
|
World Wide Web Consortium. XML Path Language (XPath) Recommendation. http://www.w3c.org/TR/xpath/, Nov. 1999.
|
CITED BY 13
|
|
|
|
|
|
|
|
Nicola Leone , Gianluigi Greco , Giovambattista Ianni , Vincenzino Lio , Giorgio Terracina , Thomas Eiter , Wolfgang Faber , Michael Fink , Georg Gottlob , Riccardo Rosati , Domenico Lembo , Maurizio Lenzerini , Marco Ruzzi , Edyta Kalka , Bartosz Nowicki , Witold Staniszkis, The INFOMIX system for advanced integration of incomplete and inconsistent data, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Warren Shen , Pedro DeRose , Robert McCann , AnHai Doan , Raghu Ramakrishnan, Toward best-effort information extraction, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Michael Toomim , Steven M. Drucker , Mira Dontcheva , Ali Rahimi , Blake Thomson , James A. Landay, Attaching UI enhancements to websites with end users, Proceedings of the 27th international conference on Human factors in computing systems, April 04-09, 2009, Boston, MA, USA
|
|
|
Xiaoyong Chai , Ba-Quy Vuong , AnHai Doan , Jeffrey F. Naughton, Efficiently incorporating user feedback into information extraction and integration programs, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|