ACM Home Page
Please provide us with feedback. Feedback
Extracting structured data from Web pages
Full text PdfPdf (588 KB)
Source International Conference on Management of Data archive
Proceedings of the 2003 ACM SIGMOD international conference on Management of data table of contents
San Diego, California
SESSION: Data integration and sharing II table of contents
Pages: 337 - 348  
Year of Publication: 2003
ISBN:1-58113-634-X
Authors
Arvind Arasu  Stanford University, Palo Alto, CA
Hector Garcia-Molina  Stanford University, Palo Alto, CA
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 41,   Downloads (12 Months): 327,   Citation Count: 82
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/872757.872799
What is a DOI?

ABSTRACT

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Amazon.com. http://www.amazon.com.
 
3
4
 
5
 
6
Experimental results. http://www-db.stanford.edu/~arvind/extract/.
 
7
8
 
9
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.
 
10
 
11
 
12
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semi structure information from the web. In Proceedings of the Workshop on Management of Semistructured Data, 1997.
 
13
 
14
IEPAD:. http://www.csie/ncu.edu.tw/~chia.
 
15
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proc. of the 1997 Intl. Joint Conf. on Artificial Intelligence, pages 729--737, 1997.
16
 
17
 
18
19
 
20
 
21
RISE:. http://www.isi.edu/~muslea/RISE/.
 
22
J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.
 
23
ROADRUNNER:. http://www.dia.uniroma3.it/db/roadRunner/index.html.
 
24
S. Sarawagi. Automation in InformationExtraction and Data Integration (tutorial). VLDB, 2002.
 
25

CITED BY  83

Collaborative Colleagues:
Arvind Arasu: colleagues
Hector Garcia-Molina: colleagues