ACM Home Page
Please provide us with feedback. Feedback
On the complexity of schema inference from web pages in the presence of nullable data attributes
Full text PdfPdf (180 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the twelfth international conference on Information and knowledge management table of contents
New Orleans, LA, USA
SESSION: Database session 4: heterogeneous and distributed systems table of contents
Pages: 224 - 231  
Year of Publication: 2003
ISBN:1-58113-723-0
Authors
Guizhen Yang  University at Buffalo, Buffalo, NY
I. V. Ramakrishnan  Stony Brook University, Stony Brook, NY
Michael Kifer  Stony Brook University, Stony Brook, NY
Sponsors
ACM: Association for Computing Machinery
SIGMIS: ACM Special Interest Group on Management Information Systems
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 27,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956863.956907
What is a DOI?

ABSTRACT

An increasingly large number of Web pages are machine-generated by filling in templates with data stored in backend databases. These templates can be viewed as the implicit schemas of those Web pages. The ability to infer the implicit schema from a collection of Web pages is important for scalable data extraction, since the inferred schema can be used to automatically identify schema attributes that are "encoded" in Web pages.However, the task of inferring a "good" schema is complicated due to the existence of nullable (missing) data attributes. Usually if an attribute contains a null value, then it will be omitted in the generated Web page, giving rise to different variations and permutations of layout structures in Web pages that are generated from the same template.In this paper we investigate the complexity of schema inference from Web pages in the presence of nullable data attributes. We introduce the notion of unambiguity as a quality measure for inferred schemas and prove that the problem of inferring "good" (unambiguous) schemas is NP-complete. Our complexity results imply that ambiguity resolution is one of the root causes of the computational difficulty underlying schema inference from Web pages.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
D. Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39(3):337--350, 1978.
 
2
3
4
 
5
 
6
7
 
8
9
10
 
11
E. M. Gold. Language identication in the limit. Information and Control, 10(5):447--474, 1967.
 
12
E. M. Gold. Complexity of automaton identication from given data. Information and Control, 37(3):302--320, 1978.
 
13
 
14
 
15
 
16
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Articial Intelligence (IJCAI), 1997.
17
 
18
 
19
20
 
21
 
22
K.-J. R.aih. a and E. Ukkonen. The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science, 16:187--198, 1981.
 
23


Collaborative Colleagues:
Guizhen Yang: colleagues
I. V. Ramakrishnan: colleagues
Michael Kifer: colleagues