| Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy |
| Full text |
Mov
(25:04),
Pdf
(1.73 MB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Industrial track papers
table of contents
Pages 1375-1384
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
Jiang-Ming Yang
|
Microsoft Research, Asia, Beijing, China
|
|
Rui Cai
|
Microsoft Research, Asia, Beijing, China
|
|
Chunsong Wang
|
University of Wisconsin-Madison, Madison, WI, USA
|
|
Hua Huang
|
Beijing University of Posts and Telecommunications, Beijing, China
|
|
Lei Zhang
|
Microsoft Research, Asia, Beijing, China
|
|
Wei-Ying Ma
|
Microsoft Research, Asia, Beijing, China
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 31, Downloads (12 Months): 96, Citation Count: 0
|
|
|
ABSTRACT
We study in this paper the problem of incremental crawling of web forums, which is a very fundamental yet challenging step in many web applications. Traditional approaches mainly focus on scheduling the revisiting strategy of each individual page. However, simply assigning different weights for different individual pages is usually inefficient in crawling forum sites because of the different characteristics between forum sites and general websites. Instead of treating each individual page independently, we propose a list-wise strategy by taking into account the site-level knowledge. Such site-level knowledge is mined through reconstructing the linking structure, called sitemap, for a given forum site. With the sitemap, posts from the same thread but distributed on various pages can be concatenated according to their timestamps. After that, for each thread, we employ a regression model to predict the time when the next post arrives. Based on this model, we develop an efficient crawler which is 260% faster than some state-of-the-art methods in terms of fetching new generated content; and meanwhile our crawler also ensure a high coverage ratio. Experimental results show promising performance of Coverage, Bandwidth utilization, and Timeliness of our crawler on 18 various forums.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
|
 |
4
|
Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
[doi> 10.1145/1367497.1367558]
|
| |
5
|
|
 |
6
|
|
| |
7
|
E. Coffman, Z. Liu, and R. R. Weber. Optimal robot scheduling of web search engines. Journal of scheduling, 1, 1998.
|
 |
8
|
Gao Cong , Long Wang , Chin-Yew Lin , Young-In Song , Yueheng Sun, Finding question-answer pairs from online forums, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
[doi> 10.1145/1390334.1390415]
|
| |
9
|
|
 |
10
|
|
 |
11
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
 |
12
|
|
 |
13
|
|
 |
14
|
Márcio L. A. Vidal , Altigran S. da Silva , Edleno S. de Moura , João M. B. Cavalcanti, Structure-driven crawler generation by example, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
[doi> 10.1145/1148170.1148223]
|
 |
15
|
Yida Wang , Jiang-Ming Yang , Wei Lai , Rui Cai , Lei Zhang , Wei-Ying Ma, Exploring traversal strategy for web forum crawling, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
[doi> 10.1145/1390334.1390413]
|
 |
16
|
J. L. Wolf , M. S. Squillante , P. S. Yu , J. Sethuraman , L. Ozsen, Optimal crawling strategies for web search engines, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511465]
|
| |
17
|
|
 |
18
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen , Di Wu, Joint optimization of wrapper generation and template detection, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
[doi> 10.1145/1281192.1281287]
|
|