| Extracting data records from the web using tag path clustering |
| Full text |
Pdf
(2.76 MB)
|
Source
|
International World Wide Web Conference
archive
Proceedings of the 18th international conference on World wide web
table of contents
Madrid, Spain
SESSION: XML and web data/session: XML extraction and crawling
table of contents
Pages 981-990
Year of Publication: 2009
ISBN:978-1-60558-487-4
|
|
Authors
|
|
Gengxin Miao
|
University of California, Santa Barbara, Santa Barbara, CA, USA
|
|
Junichi Tatemura
|
NEC Laboratories America, Cupertino, CA, USA
|
|
Wang-Pin Hsiung
|
NEC Laboratories America, Cupertino, CA, USA
|
|
Arsany Sawires
|
NEC Laboratories America, Cupertino, CA, USA
|
|
Louise E. Moser
|
University of California, Santa Barbara, Santa Barbara, CA, USA
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 39, Downloads (12 Months): 177, Citation Count: 0
|
|
|
ABSTRACT
Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the first step of this object extraction process, identifies a set of Web page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suffice for simple search, but they often fail to handle more complicated or noisy Web page structures due to a key limitation -- their greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the DOM tree of the Web document. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals) to estimate how likely these two tag paths represent the same list of objects. The paper introduces a similarity measure that captures how closely the visual signals appear and interleave. Clustering of tag paths is then performed based on this similarity measure, and sets of tag paths that form the structure of data records are extracted. Experiments show that this method achieves higher accuracy than previous methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the power of tables on the Web. In Proceedings of the 34th International Conference on Very Large Data Bases, pages 538--549, 2008.
|
 |
4
|
|
 |
5
|
|
| |
6
|
Cobra: Java HTML Renderer and Parser, http://lobobrowser.org/cobra.jsp.
|
| |
7
|
|
| |
8
|
DBLP Computer Science Bibliography, http://www.informatik.uni-trier.de/~ley/db/.
|
 |
9
|
|
 |
10
|
|
 |
11
|
|
| |
12
|
B. Liu and Y. Zhai. NET: System for extracting Web data from flat and nested data records. In Proceedings of the Conference on Web Information Systems Engineering, pages 487--495, 2005.
|
| |
13
|
A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Proceedings of the Neural Information Processing Systems Conference, pages 849--856, 2001.
|
| |
14
|
|
 |
15
|
|
 |
16
|
Junichi Tatemura , Songting Chen , Fenglin Liao , Oliver Po , K. Selcuk Candan , Divyakant Agrawal, UQBE: uncertain query by example for web service mashup, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
[doi> 10.1145/1376616.1376754]
|
 |
17
|
|
 |
18
|
Yasuhiro Yamada , Nick Craswell , Tetsuya Nakatoh , Sachio Hirokawa, Testbed for information extraction from deep web, Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, May 19-21, 2004, New York, NY, USA
[doi> 10.1145/1013367.1013468]
|
 |
19
|
|
 |
20
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
[doi> 10.1145/1060745.1060760]
|
 |
21
|
|
 |
22
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, Simultaneous record detection and attribute labeling in web data extraction, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150457]
|
|