|
ABSTRACT
Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
L. Arlotta, V. Crescenzi, G. Mecca, and P. Marialdo. Automatic annotation of data extracted from large web sites. In Proceedings of the Sixth International Workshop on Web and Databases (WebDB03), 2003.
|
 |
3
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
 |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
Valter Crescenzi , Giansalvatore Mecca , Paolo Merialdo, Automatic Web Information Extraction in the ROADRUNNER System, Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops, p.264-277, November 27-30, 2001
|
| |
8
|
|
| |
9
|
C. Gazen. Thesis proposal, Carnegie Mellon University.
|
| |
10
|
Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Proc. Conf. Advances in Neural Information Processing Systems, NIPS, volume 8, pages 472--478. MIT Press, 1995.
|
| |
11
|
M. Hurst. Layout and language: Challenges for table understanding on the web. In In Web Document Analysis, Proceedings of the 1st International Workshop on Web Document Analysis, 2001.
|
| |
12
|
|
| |
13
|
Y. Jiang. Record-Boundary Discovery In Web Documents. PhD thesis, BYU, Utah, 1998.
|
| |
14
|
N. Kushmerick and B. Thoma. Intelligent Information Agents R&D in Europe: An AgentLink perspective, chapter Adaptive information extraction: Core technologies for information agents. Springer, 2002.
|
| |
15
|
|
| |
16
|
K. Lerman and S. Minton. Learning the Common Structure of Data. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-2000), Menlo Park, 2000. AAAI Press.
|
| |
17
|
K. Lerman, C. A. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In Proceedings of the workshop on Advances in Text Extraction and Mining (IJCAI-2001), Menlo Park, 2001. AAAI Press.
|
| |
18
|
K. Lerman, S. Minton, and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.
|
| |
19
|
K. Lerman, C. Gazen, S. Minton, and C. A. Knoblock,. Populating the Semantic Web. Submitted to the workshop on Advances in Text Extraction and Mining (ATEM-2004), 2004.
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
 |
23
|
|
 |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
J. P. Walser. Wsat(oip) package.
|
| |
28
|
J. P. Walser. Integer Optimization by Local Search: A Domain Independent Approach, volume 1637 of LNCS. Springer, New York, 1999.
|
| |
29
|
|
 |
30
|
|
| |
31
|
|
| |
32
|
M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web. In in Proceedings of the International Workshop on Web Document Analysis (WDA 2001), Seattle, U.S., September 2001.
|
CITED BY 34
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, Simultaneous record detection and attribute labeling in web data extraction, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
Guang Feng , Tie-Yan Liu , Ying Wang , Ying Bao , Zhiming Ma , Xu-Dong Zhang , Wei-Ying Ma, AggregateRank: bringing order to web sites, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
Zaiqing Nie , Yunxiao Ma , Shuming Shi , Ji-Rong Wen , Wei-Ying Ma, Web object retrieval, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
Wolfgang Gatterbauer , Paul Bohunsky , Marcus Herzog , Bernhard Krüpl , Bernhard Pollak, Towards domain-independent information extraction from web tables, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Jun Zhu , Bo Zhang , Zaiqing Nie , Ji-Rong Wen , Hsiao-Wuen Hon, Webpage understanding: an integrated approach, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Manuel Álvarez , Alberto Pan , Juan Raposo , Fernando Bellas , Fidel Cacheda, Extracting lists of data records from semi-structured web pages, Data & Knowledge Engineering, v.64 n.2, p.491-509, February, 2008
|
|
|
|
|
|
|
|
|
Gengxin Miao , Junichi Tatemura , Wang-Pin Hsiung , Arsany Sawires , Louise E. Moser, Extracting data records from the web using tag path clustering, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
Jiang-Ming Yang , Rui Cai , Yida Wang , Jun Zhu , Lei Zhang , Wei-Ying Ma, Incorporating site-level knowledge to extract structured data from web forums, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Nathanael Chambers , James Allen , Lucian Galescu , Hyuckchul Jung , William Taysom, Using semantics to identify web objects, proceedings of the 21st national conference on Artificial intelligence, p.1259-1264, July 16-20, 2006, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|