| On the complexity of schema inference from web pages in the presence of nullable data attributes |
| Full text |
Pdf
(180 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the twelfth international conference on Information and knowledge management
table of contents
New Orleans, LA, USA
SESSION: Database session 4: heterogeneous and distributed systems
table of contents
Pages: 224 - 231
Year of Publication: 2003
ISBN:1-58113-723-0
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 27, Citation Count: 3
|
|
|
ABSTRACT
An increasingly large number of Web pages are machine-generated by filling in templates with data stored in backend databases. These templates can be viewed as the implicit schemas of those Web pages. The ability to infer the implicit schema from a collection of Web pages is important for scalable data extraction, since the inferred schema can be used to automatically identify schema attributes that are "encoded" in Web pages.However, the task of inferring a "good" schema is complicated due to the existence of nullable (missing) data attributes. Usually if an attribute contains a null value, then it will be omitted in the generated Web page, giving rise to different variations and permutations of layout structures in Web pages that are generated from the same template.In this paper we investigate the complexity of schema inference from Web pages in the presence of nullable data attributes. We introduce the notion of unambiguity as a quality measure for inferred schemas and prove that the problem of inferring "good" (unambiguous) schemas is NP-complete. Our complexity results imply that ambiguity resolution is one of the root causes of the computational difficulty underlying schema inference from Web pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39(3):337--350, 1978.
|
| |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
 |
9
|
Hasan Davulcu , Guizhen Yang , Michael Kifer , I. V. Ramakrishnan, Computational aspects of resilient data extraction from semistructured sources (extended abstract), Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.136-144, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335216]
|
 |
10
|
Minos Garofalakis , Aristides Gionis , Rajeev Rastogi , S. Seshadri , Kyuseok Shim, XTRACT: a system for extracting document type descriptors from XML documents, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.165-176, May 15-18, 2000, Dallas, Texas, United States
|
| |
11
|
E. M. Gold. Language identication in the limit. Information and Control, 10(5):447--474, 1967.
|
| |
12
|
E. M. Gold. Complexity of automaton identication from given data. Information and Control, 37(3):302--320, 1978.
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Articial Intelligence (IJCAI), 1997.
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
K.-J. R.aih. a and E. Ukkonen. The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science, 16:187--198, 1981.
|
| |
23
|
|
CITED BY 5
|
|
D. C. Reis , P. B. Golgher , A. S. Silva , A. F. Laender, Automatic web news extraction using tree edit distance, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|