| Incorporating site-level knowledge to extract structured data from web forums |
| Full text |
Pdf
(1.69 MB)
|
Source
|
International World Wide Web Conference
archive
Proceedings of the 18th international conference on World wide web
table of contents
Madrid, Spain
SESSION: Data mining/session: learning
table of contents
Pages 181-190
Year of Publication: 2009
ISBN:978-1-60558-487-4
|
|
Authors
|
|
Jiang-Ming Yang
|
Microsoft Research Asia, Beijing, China
|
|
Rui Cai
|
Microsoft Research Asia, Beijing, China
|
|
Yida Wang
|
CSSAR, Chinese Academy of Sciences, Beijing, China
|
|
Jun Zhu
|
Dept. Computer Science and Technology, Tsinghua University, Beijing, China
|
|
Lei Zhang
|
Microsoft Research Asia, Beijing, China
|
|
Wei-Ying Ma
|
Microsoft Research Asia, Beijing, China
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 46, Downloads (12 Months): 174, Citation Count: 0
|
|
|
ABSTRACT
Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to find a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be significantly improved.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Big boards. http://directory.big--boards.com/, 2008.
|
| |
2
|
|
 |
3
|
|
 |
4
|
Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
[doi> 10.1145/1367497.1367558]
|
 |
5
|
Gao Cong , Long Wang , Chin-Yew Lin , Young-In Song , Yueheng Sun, Finding question-answer pairs from online forums, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
[doi> 10.1145/1390334.1390415]
|
 |
6
|
Natalie Glance , Matthew Hurst , Kamal Nigam , Matthew Siegler , Robert Stockton , Takashi Tomokiyo, Deriving marketing intelligence from online discussion, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
[doi> 10.1145/1081870.1081919]
|
| |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
K. Lerman, S. Minton, and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.
|
 |
11
|
|
| |
12
|
H. Poon and P. Domingos. Joint inference in information extraction. In Proc. 22nd AAAI, pages 913--918, Vancouver, Canada, July 2007.
|
| |
13
|
|
| |
14
|
P. Singla and P. Domingos. Discriminative training of markov logic networks. In Proc. 20nd AAAI, 2005.
|
| |
15
|
|
 |
16
|
Yida Wang , Jiang-Ming Yang , Wei Lai , Rui Cai , Lei Zhang , Wei-Ying Ma, Exploring traversal strategy for web forum crawling, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
[doi> 10.1145/1390334.1390413]
|
 |
17
|
|
 |
18
|
|
 |
19
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen , Di Wu, Joint optimization of wrapper generation and template detection, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
[doi> 10.1145/1281192.1281287]
|
 |
20
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, 2D Conditional Random Fields for Web information extraction, Proceedings of the 22nd international conference on Machine learning, p.1044-1051, August 07-11, 2005, Bonn, Germany
[doi> 10.1145/1102351.1102483]
|
 |
21
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, Simultaneous record detection and attribute labeling in web data extraction, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150457]
|
|