ACM Home Page
Please provide us with feedback. Feedback
Mining templates from search result records of search engines
Full text PdfPdf (972 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 884 - 893  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Hongkun Zhao  State University of New York at Binghamton
Weiyi Meng  State University of New York at Binghamton
Clement Yu  University of Illinois at Chicago
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 148,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281286
What is a DOI?

ABSTRACT

Metasearch engine, Comparison-shopping and Deep Web crawling applications need to extract search result records enwrapped in result pages returned from search engines in response to user queries. The search result records from a given search engine are usually formatted based on a template. Precisely identifying this template can greatly help extract and annotate the data units within each record correctly. In this paper, we propose a graph model to represent record template and develop a domain independent statistical method to automatically mine the record template for any search engine using sample search result records. Our approach can identify both template tags (HTML tags) and template texts (non-tag texts), and it also explicitly addresses the mismatches between the tag structures and the data structures of search result records. Our experimental results indicate that this approach is very effective.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
5
 
6
 
7
8
9
 
10
11
 
12
N. Kushmerick, D. Weld, R. Doorenbos. Wrapper Induction for Information Extraction. Int'l Joint Conf. on AI, 1997.
13
14
 
15
B. Liu and Y. Zhai. NET - A System for Extracting Web Data from Flat and Nested Data Records. WISE Conference, 2005.
 
16
L. Liu, C. Pu and W. Han. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. IEEE ICDE, 2000.
 
17
Y. Lu, H. He, H. Zhao, W. Meng, C. Yu. Annotating Structured Data of the Deep Web. IEEE ICDE, 2007.
18
19
 
20
21
22
 
23
24
25
 
26
Y. Zhai, B. Liu. Extracting Web Data Using Instance-Based Learning. WISE Conference, 2005.
27
28
 
29


Collaborative Colleagues:
Hongkun Zhao: colleagues
Weiyi Meng: colleagues
Clement Yu: colleagues