ACM Home Page
Please provide us with feedback. Feedback
Incorporating site-level knowledge to extract structured data from web forums
Full text PdfPdf (1.69 MB)
Source
International World Wide Web Conference archive
Proceedings of the 18th international conference on World wide web table of contents
Madrid, Spain
SESSION: Data mining/session: learning table of contents
Pages 181-190  
Year of Publication: 2009
ISBN:978-1-60558-487-4
Authors
Jiang-Ming Yang  Microsoft Research Asia, Beijing, China
Rui Cai  Microsoft Research Asia, Beijing, China
Yida Wang  CSSAR, Chinese Academy of Sciences, Beijing, China
Jun Zhu  Dept. Computer Science and Technology, Tsinghua University, Beijing, China
Lei Zhang  Microsoft Research Asia, Beijing, China
Wei-Ying Ma  Microsoft Research Asia, Beijing, China
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 46,   Downloads (12 Months): 174,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1526709.1526735
What is a DOI?

ABSTRACT

Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to find a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be significantly improved.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Big boards. http://directory.big--boards.com/, 2008.
 
2
3
4
5
6
 
7
 
8
9
 
10
K. Lerman, S. Minton, and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.
11
 
12
H. Poon and P. Domingos. Joint inference in information extraction. In Proc. 22nd AAAI, pages 913--918, Vancouver, Canada, July 2007.
 
13
 
14
P. Singla and P. Domingos. Discriminative training of markov logic networks. In Proc. 20nd AAAI, 2005.
 
15
16
17
18
19
20
21

Collaborative Colleagues:
Jiang-Ming Yang: colleagues
Rui Cai: colleagues
Yida Wang: colleagues
Jun Zhu: colleagues
Lei Zhang: colleagues
Wei-Ying Ma: colleagues