|
ABSTRACT
Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Israel Ben-Shaul , Michael Herscovici , Michal Jacovi , Yoelle S. Maarek , Dan Pelleg , Menachem Shtalhaim , Vladimir Soroka , Sigalit Ur, Adding support for dynamic and focused search with Fetuccino, Computer Networks: The International Journal of Computer and Telecommunications Networking, v.31 n.11-16, p.1653-1665, May 17, 1999
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
Cyveillance. 2000. Sizing the internet. White paper. http://www.cyveillance.com/.
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
David Gibson , Jon Kleinberg , Prabhakar Raghavan, Inferring Web communities from link topology, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.225-234, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276652]
|
| |
13
|
Haveliwala, T. 1999. Efficient computation of pagerank. Tech. rep., Stanford Database Group.
|
| |
14
|
|
| |
15
|
Michael Hersovici , Michal Jacovi , Yoelle S. Maarek , Dan Pelleg , Menanchem Shtalhaim , Sigalit Ur, The shark-search algorithm. An application: tailored Web site mapping, Proceedings of the seventh international conference on World Wide Web 7, p.317-326, April 1998, Brisbane, Australia
|
 |
16
|
|
| |
17
|
Kleinberg, J. and Lawrence, S. 2001. The structure of the Web. Science 294, 5548, 1849--1850.
|
| |
18
|
R. Kumar , P. Raghavan , S. Rajagopalan , D. Sivakumar , A. Tomkins , E. Upfal, Stochastic models for the Web graph, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, p.57, November 12-14, 2000
|
| |
19
|
|
| |
20
|
Lawrence, S. and Giles, C. 1998. Searching the World Wide Web. Science 280, 98--100.
|
| |
21
|
Lawrence, S. and Giles, C. 1999. Accessibility of information on the Web. Nature 400, 107--109.
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
Menczer, F. and Monge, A. 1999. Scalable Web search by adaptive online agents: An InfoSpiders case study. In Intelligent Information Agents: Agent-Based Information Discovery and Management on the Internet, M. Klusch, Ed. Springer, Berlin, 323--347.
|
 |
29
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
 |
30
|
|
 |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
Pant, G., Bradshaw, S., and Menczer, F. 2003. Search engine - crawler symbiosis. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), T. Koch and I. Solvberg, Eds. Lecture Notes in Computer Science, Vol. 2769. Springer Verlag, Berlin.
|
| |
35
|
|
| |
36
|
Pant, G. and Menczer, F. 2003. Topical crawling for business intelligence. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), T. Koch and I. Solvberg, Eds. Lecture Notes in Computer Science, Vol. 2769. Berlin.
|
| |
37
|
Pant, G., Srinivasan, P., and Menczer, F. 2002. Exploration versus exploitation in topic driven crawlers. In Proceedings of the WWW-02 Workshop on Web Dynamics.
|
| |
38
|
Pinkerton, B. 1994. Finding what people want: Experiences with the WebCrawler. In Proceedings of the 2nd International World Wide Web Conference (Chicago).
|
| |
39
|
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
|
| |
40
|
|
| |
41
|
|
| |
42
|
|
| |
43
|
|
CITED BY 18
|
|
|
Guilherme T. de Assis , Alberto H. F. Laender , Altigran S. da Silva , Marcos André Gonçalves, The impact of term selection in genre-aware focused crawling, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
Zhumin Chen , Jun Ma , Jingsheng Lei , Bo Yuan , Li Lian , Ling Song, A cross-language focused crawling algorithm based on multiple relevance prediction strategies, Computers & Mathematics with Applications, v.57 n.6, p.1057-1072, March, 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
Subjects:
Search process
Additional Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.4
Systems and Software
Subjects:
Information networks;
Performance evaluation (efficiency and effectiveness);
Distributed systems
I.
Computing Methodologies
I.2
ARTIFICIAL INTELLIGENCE
I.2.11
Distributed Artificial Intelligence
Subjects:
Multiagent systems;
Intelligent agents
I.2.8
Problem Solving, Control Methods, and Search
Subjects:
Graph and tree search strategies;
Heuristic methods
General Terms:
Algorithms,
Design,
Measurement,
Performance
Keywords:
Efficiency,
evaluation,
evolution,
exploitation,
exploration,
reinforcement learning,
topical crawlers
|