ACM Home Page
Please provide us with feedback. Feedback
Topical web crawlers: Evaluating adaptive algorithms
Full text PdfPdf (679 KB)
Source ACM Transactions on Internet Technology (TOIT) archive
Volume 4 ,  Issue 4  (November 2004) table of contents
Pages: 378 - 419  
Year of Publication: 2004
ISSN:1533-5399
Authors
Filippo Menczer  Indiana University, Bloomington, IN
Gautam Pant  University of Utah, Salt Lake City, UT
Padmini Srinivasan  University of Iowa, Iowa City, IA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 37,   Downloads (12 Months): 311,   Citation Count: 18
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1031114.1031117
What is a DOI?

ABSTRACT

Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
Cyveillance. 2000. Sizing the internet. White paper. http://www.cyveillance.com/.
 
9
 
10
 
11
12
 
13
Haveliwala, T. 1999. Efficient computation of pagerank. Tech. rep., Stanford Database Group.
 
14
 
15
16
 
17
Kleinberg, J. and Lawrence, S. 2001. The structure of the Web. Science 294, 5548, 1849--1850.
 
18
 
19
 
20
Lawrence, S. and Giles, C. 1998. Searching the World Wide Web. Science 280, 98--100.
 
21
Lawrence, S. and Giles, C. 1999. Accessibility of information on the Web. Nature 400, 107--109.
 
22
 
23
 
24
 
25
26
 
27
 
28
Menczer, F. and Monge, A. 1999. Scalable Web search by adaptive online agents: An InfoSpiders case study. In Intelligent Information Agents: Agent-Based Information Discovery and Management on the Internet, M. Klusch, Ed. Springer, Berlin, 323--347.
29
30
31
 
32
 
33
 
34
Pant, G., Bradshaw, S., and Menczer, F. 2003. Search engine - crawler symbiosis. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), T. Koch and I. Solvberg, Eds. Lecture Notes in Computer Science, Vol. 2769. Springer Verlag, Berlin.
 
35
 
36
Pant, G. and Menczer, F. 2003. Topical crawling for business intelligence. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), T. Koch and I. Solvberg, Eds. Lecture Notes in Computer Science, Vol. 2769. Berlin.
 
37
Pant, G., Srinivasan, P., and Menczer, F. 2002. Exploration versus exploitation in topic driven crawlers. In Proceedings of the WWW-02 Workshop on Web Dynamics.
 
38
Pinkerton, B. 1994. Finding what people want: Experiences with the WebCrawler. In Proceedings of the 2nd International World Wide Web Conference (Chicago).
 
39
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
 
40
 
41
 
42
 
43

CITED BY  18
 
 
 
 
 
 
 
 
 

Collaborative Colleagues:
Filippo Menczer: colleagues
Gautam Pant: colleagues
Padmini Srinivasan: colleagues