ACM Home Page
Please provide us with feedback. Feedback
Understanding Web query interfaces: best-effort parsing with hidden syntax
Full text PdfPdf (431 KB)
Source International Conference on Management of Data archive
Proceedings of the 2004 ACM SIGMOD international conference on Management of data table of contents
Paris, France
SESSION: Research sessions: Web, XML and IR table of contents
Pages: 107 - 118  
Year of Publication: 2004
ISBN:1-58113-859-8
Authors
Zhen Zhang  University of Illinois at Urbana-Champaign
Bin He  University of Illinois at Urbana-Champaign
Kevin Chen-Chuan Chang  University of Illinois at Urbana-Champaign
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 89,   Citation Count: 26
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1007568.1007583
What is a DOI?

ABSTRACT

Recently, the Web has been rapidly "deepened" by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says- or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some "concerted structure," by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar- and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax- that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach-it achieves above 85% accuracy for extracting query conditions across random sources.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
Bright Planet.com. The deep web: Surfacing hidden value. Accessible at http://brightplanet.com, July 2000.
 
4
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, Department of Computer Science, UIUC, Feb. 2003.
 
5
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository, 2003.
 
6
 
7
 
8
9
 
10
E. J. Golin. Parsing visual languages with picture layout grammars. Journal of Visual Languages and Computing, 4(2):371--394, 1991.
11
 
12
B. He, T. Tao, and K. C.-C. Chang. Clustering structured web sources: A schema-based, model-differentiation approach. In EDBT'04 Clust Web Workshop, 2004.
 
13
H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In VLDB Conference, 2003.
 
14
15
 
16
 
17
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.
 
18
 
19
K. Marriott. Constraint multiset grammars. In Proceedings of IEEE Symposium on Visual Language, pages 118--125, 1994.
 
20
K. Marriott and B. Meyer. On the classification of visual languages by grammar hierarchies. Journal of Visual Languages and Computing, 8(4):375--402, 1997.
 
21
 
22
 
23
K. Wittenburg, L. Weitzman, and J. Tally. Unification-based grammars and tabular parsing for graphical languages. Journal of Visual Languages and Computing, 4(2):347--370, 1991.

CITED BY  26
Collaborative Colleagues:
Zhen Zhang: colleagues
Bin He: colleagues
Kevin Chen-Chuan Chang: colleagues