|
ABSTRACT
Recently, the Web has been rapidly "deepened" by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says- or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some "concerted structure," by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar- and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax- that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach-it achieves above 85% accuracy for extracting query conditions across random sources.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alfred V. Aho , Ravi Sethi , Jeffrey D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1986
|
 |
2
|
|
| |
3
|
Bright Planet.com. The deep web: Surfacing hidden value. Accessible at http://brightplanet.com, July 2000.
|
| |
4
|
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, Department of Computer Science, UIUC, Feb. 2003.
|
| |
5
|
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository, 2003.
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
 |
9
|
Robert B. Doorenbos , Oren Etzioni , Daniel S. Weld, A scalable comparison-shopping agent for the World-Wide Web, Proceedings of the first international conference on Autonomous agents, p.39-48, February 05-08, 1997, Marina del Rey, California, United States
[doi> 10.1145/267658.267666]
|
| |
10
|
E. J. Golin. Parsing visual languages with picture layout grammars. Journal of Visual Languages and Computing, 4(2):371--394, 1991.
|
 |
11
|
|
| |
12
|
B. He, T. Tao, and K. C.-C. Chang. Clustering structured web sources: A schema-based, model-differentiation approach. In EDBT'04 Clust Web Workshop, 2004.
|
| |
13
|
H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In VLDB Conference, 2003.
|
| |
14
|
|
 |
15
|
Richard Helm , Kim Marruitt , Martin Odersky, Building visual language parsers, Proceedings of the SIGCHI conference on Human factors in computing systems: Reaching through technology, p.105-112, April 27-May 02, 1991, New Orleans, Louisiana, United States
[doi> 10.1145/108844.108860]
|
| |
16
|
|
| |
17
|
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.
|
| |
18
|
Stephen W. Liddle , Sai Ho Yau , David W. Embley, On the Automatic Extraction of Data from the Hidden Web, Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops, p.212-226, November 27-30, 2001
|
| |
19
|
K. Marriott. Constraint multiset grammars. In Proceedings of IEEE Symposium on Visual Language, pages 118--125, 1994.
|
| |
20
|
K. Marriott and B. Meyer. On the classification of visual languages by grammar hierarchies. Journal of Visual Languages and Computing, 8(4):375--402, 1997.
|
| |
21
|
|
| |
22
|
|
| |
23
|
K. Wittenburg, L. Weitzman, and J. Tally. Unification-based grammars and tabular parsing for graphical languages. Journal of Visual Languages and Computing, 4(2):347--370, 1991.
|
CITED BY 26
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Alberto Pan , Juan Raposo , Manuel Álvarez , Víctor Carneiro , Fernando Bellas, Automatically maintaining navigation sequences for querying semi-structured web sources, Data & Knowledge Engineering, v.63 n.3, p.795-810, December, 2007
|
|
|
|
|
|
Pierre Senellart , Avin Mittal , Daniel Muschick , Rémi Gilleron , Marc Tommasi, Automatic wrapper induction from hidden-web sources with domain knowledge, Proceeding of the 10th ACM workshop on Web information and data management, October 30-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|