ACM Home Page
Please provide us with feedback. Feedback
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy
Full text MovMov (25:04),  PdfPdf (1.73 MB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Industrial track papers table of contents
Pages 1375-1384  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Jiang-Ming Yang  Microsoft Research, Asia, Beijing, China
Rui Cai  Microsoft Research, Asia, Beijing, China
Chunsong Wang  University of Wisconsin-Madison, Madison, WI, USA
Hua Huang  Beijing University of Posts and Telecommunications, Beijing, China
Lei Zhang  Microsoft Research, Asia, Beijing, China
Wei-Ying Ma  Microsoft Research, Asia, Beijing, China
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 31,   Downloads (12 Months): 96,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557166
What is a DOI?

ABSTRACT

We study in this paper the problem of incremental crawling of web forums, which is a very fundamental yet challenging step in many web applications. Traditional approaches mainly focus on scheduling the revisiting strategy of each individual page. However, simply assigning different weights for different individual pages is usually inefficient in crawling forum sites because of the different characteristics between forum sites and general websites. Instead of treating each individual page independently, we propose a list-wise strategy by taking into account the site-level knowledge. Such site-level knowledge is mined through reconstructing the linking structure, called sitemap, for a given forum site. With the sitemap, posts from the same thread but distributed on various pages can be concatenated according to their timestamps. After that, for each thread, we employ a regression model to predict the time when the next post arrives. Based on this model, we develop an efficient crawler which is 260% faster than some state-of-the-art methods in terms of fetching new generated content; and meanwhile our crawler also ensure a high coverage ratio. Experimental results show promising performance of Coverage, Bandwidth utilization, and Timeliness of our crawler on 18 various forums.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
4
 
5
6
 
7
E. Coffman, Z. Liu, and R. R. Weber. Optimal robot scheduling of web search engines. Journal of scheduling, 1, 1998.
8
 
9
10
11
12
13
14
15
16
 
17
18

Collaborative Colleagues:
Jiang-Ming Yang: colleagues
Rui Cai: colleagues
Chunsong Wang: colleagues
Hua Huang: colleagues
Lei Zhang: colleagues
Wei-Ying Ma: colleagues