|
Warning: The download time has expired please click on the item to try again.
ABSTRACT
Recent work has shown the feasibility and promise of template-independent Web data extraction. However, existing approaches use decoupled strategies - attempting to do data record detection and attribute labeling in two separate phases. In this paper, we show that separately extracting data records and attributes is highly ineffective and propose a probabilistic model to perform these two tasks simultaneously. In our approach, record detection can benefit from the availability of semantics required in attribute labeling and, at the same time, the accuracy of attribute labeling can be improved when data records are labeled in a collective manner. The proposed model is called Hierarchical Conditional Random Fields. It can efficiently integrate all useful features by learning their importance, and it can also incorporate hierarchical interactions which are very important for Web data extraction. We empirically compare the proposed model with existing decoupled approaches for product information extraction, and the results show significant improvements in both record detection and attribute labeling.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Bunescu, R. C., and Mooney, R. J. Collective information extraction with relational Markov networks. In Proc. of ACL, 2004.
|
| |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
Chen, S. F., and Rosenfeld, R. A Gaussian Prior for Smoothing Maximum Entropy Models. Technical Report CMU-CS-99-108, Carnegie Mellon University, 1999.
|
 |
7
|
|
| |
8
|
Robert G. Cowell , Steffen L. Lauritzen , A. Philip David , David J. Spiegelhalter , V. Nair , J. Lawless , M. Jordan , David J. Spiegelhater, Probabilistic Networks and Expert Systems, Springer-Verlag New York, Inc., Secaucus, NJ, 1999
|
| |
9
|
|
 |
10
|
D. W. Embley , Y. Jiang , Y.-K. Ng, Record-boundary discovery in Web documents, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.467-478, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
11
|
|
| |
12
|
Finn, A., and Kushmerick, N. Multi-level boundary classification for information extraction. In Proc. of ECML, 2004.
|
| |
13
|
He, X., Zemel, R. S., and Carreira-Perpiñán, M. Á. Multi-scale Conditional Random Fields for Image Labeling. In Proc. of CVPR, 2004.
|
| |
14
|
Jensen, F. V., Lauritzen, S. L., and Olesen, K. G. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly, 4:269--82, 1990.
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
Lerman, K., Minton, S., and Knoblock, C. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.
|
| |
20
|
Liao, L., Fox, D., and Kautz, H. Location-based activity recognition. In Proc. of NIPS, 2005.
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
Nahm, U. Y., and Mooney, R. J. A Mutually Beneficial Integration of Data Mining and Information Extraction. In Proc. of AAAI, 2001.
|
| |
25
|
Sarawagi, S., and Cohen, W. W. Semi-Markov Conditional Random Fields for Information Extraction. In Proc. of NIPS, 2004.
|
| |
26
|
Skounakis, M., Craven, M., and Ray S. Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI, 2003.
|
 |
27
|
Ruihua Song , Haifeng Liu , Ji-Rong Wen , Wei-Ying Ma, Learning block importance models for web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988700]
|
 |
28
|
Charles Sutton , Khashayar Rohanimanesh , Andrew McCallum, Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data, Proceedings of the twenty-first international conference on Machine learning, p.99, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015422]
|
| |
29
|
Ben Wellner , Andrew McCallum , Fuchun Peng , Michael Hay, An integrated, conditional model of information extraction and coreference with application to citation matching, Proceedings of the 20th conference on Uncertainty in artificial intelligence, p.593-601, July 07-11, 2004, Banff, Canada
|
 |
30
|
|
 |
31
|
|
 |
32
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
[doi> 10.1145/1060745.1060760]
|
 |
33
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, 2D Conditional Random Fields for Web information extraction, Proceedings of the 22nd international conference on Machine learning, p.1044-1051, August 07-11, 2005, Bonn, Germany
[doi> 10.1145/1102351.1102483]
|
CITED BY 17
|
|
Zaiqing Nie , Yunxiao Ma , Shuming Shi , Ji-Rong Wen , Wei-Ying Ma, Web object retrieval, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
Jun Zhu , Zaiqing Nie , Bo Zhang , Ji-Rong Wen, Dynamic hierarchical Markov random fields and their application to web data extraction, Proceedings of the 24th international conference on Machine learning, p.1175-1182, June 20-24, 2007, Corvalis, Oregon
|
|
|
Xin Xin , Juanzi Li , Jie Tang , Qiong Luo, Academic conference homepage understanding using constrained hierarchical conditional random fields, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
Jun Zhu , Bo Zhang , Zaiqing Nie , Ji-Rong Wen , Hsiao-Wuen Hon, Webpage understanding: an integrated approach, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
Chunyu Yang , Yong Cao , Zaiqing Nie , Jie Zhou , Ji-Rong Wen, Closing the loop in webpage understanding, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Gengxin Miao , Junichi Tatemura , Wang-Pin Hsiung , Arsany Sawires , Louise E. Moser, Extracting data records from the web using tag path clustering, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
Jiang-Ming Yang , Rui Cai , Yida Wang , Jun Zhu , Lei Zhang , Wei-Ying Ma, Incorporating site-level knowledge to extract structured data from web forums, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
Jun Zhu , Zaiqing Nie , Xiaojiang Liu , Bo Zhang , Ji-Rong Wen, StatSnowball: a statistical approach to extracting entity relationships, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
Ping Luo , Fen Lin , Yuhong Xiong , Yong Zhao , Zhongzhi Shi, Towards combining web classification and web information extraction: a case study, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|