|
ABSTRACT
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alfred V. Aho , John E. Hopcroft , Jeffrey Ullman , J. D. Ullman , J. E. Hopcroft, Data Structures and Algorithms, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1983
|
| |
2
|
ALBERT, R., BARABASI, A.-L., AND JEONG, H. 1999. Diameter of the World Wide Web. Nature 401, 6749 (Sept.).
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
BARABASI, A.-L. AND ALBERT, R. 1999. Emergence of scaling in random networks. Science 286, 5439 (Oct.), 509-512.
|
| |
7
|
|
| |
8
|
Krishna Bharat , Andrei Broder , Monika Henzinger , Puneet Kumar , Suresh Venkatasubramanian, The connectivity server: fast access to linkage information on the Web, Computer Networks and ISDN Systems, v.30 n.1-7, p.469-477, April 1, 1998
|
| |
9
|
|
| |
10
|
|
| |
11
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
| |
12
|
CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR,S.R.,RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1998a. Spectral filtering for resource discovery. In Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web (Melbourne, Australia). ACM Press, New York, NY.
|
 |
13
|
|
| |
14
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
 |
15
|
|
| |
16
|
|
| |
17
|
CHO,J.AND GARCIA-MOLINA, H. 2000a. Estimating frequency of change. Submitted for publication.
|
| |
18
|
|
 |
19
|
|
| |
20
|
|
| |
21
|
COFFMAN,E.J.,LIU, Z., AND WEBER, R. R. 1997. Optimal robot scheduling for web search engines. Tech. Rep. INRIA, Rennes, France.
|
| |
22
|
|
| |
23
|
|
| |
24
|
DOUGLIS, F., FELDMANN, A., AND KRISHNAMURTHY,, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the USENIX Symposium on Internetworking Technologies and Systems. USENIX Assoc., Berkeley, CA.
|
 |
25
|
S. T. Dumais , G. W. Furnas , T. K. Landauer , S. Deerwester , R. Harshman, Using latent semantic analysis to improve access to textual information, Proceedings of the SIGCHI conference on Human factors in computing systems, p.281-285, May 15-19, 1988, Washington, D.C., United States
[doi> 10.1145/57167.57214]
|
| |
26
|
EGGHE,L.AND ROUSSEAU, R. 1990. Introduction to Informetrics. Elsevier Science Inc., New York, NY.
|
 |
27
|
|
 |
28
|
|
| |
29
|
GARFIELD, E. 1972. Citation analysis as a tool in journal evaluation. Science 178, 471-479.
|
 |
30
|
David Gibson , Jon Kleinberg , Prabhakar Raghavan, Inferring Web communities from link topology, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.225-234, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276652]
|
| |
31
|
|
| |
32
|
HAVELIWALA, T. 1999. Efficient computation of pagerank. Tech. Rep. 1999-31. Computer Systems Laboratory, Stanford University, Stanford, CA. http://dbpubs.stanford.edu/ pub/1999-31.
|
| |
33
|
HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. 1998. Overview of TREC-7 very large collection track. In Proceedings of the 7th Conference on Text Retrieval (TREC-7).
|
| |
34
|
|
| |
35
|
HUBERMAN,B.A.AND ADAMIC, L. A. 1999. Growth dynamics of the world wide web. Nature 401, 6749 (Sept.).
|
 |
36
|
|
| |
37
|
KOSTER, M. 1995. Robots in the web: trick or treat? ConneXions 9, 4 (Apr.).
|
| |
38
|
|
| |
39
|
LAWRENCE,S.AND GILES, C. 1998. Searching the world wide web. Science 280, 98-100.
|
| |
40
|
LAWRENCE,S.AND GILES, C. 1999. Accessibility of information on the web. Nature 400, 107-109.
|
| |
41
|
|
 |
42
|
|
| |
43
|
MELNIK, S., RAGHAVAN, S., YANG, B., AND GARCIA-MOLINA, H. 2000. Building a distributed full-text index for the web. Tech. Rep. SIDL-WP-2000-0140, Stanford Digital Library Project. Computer Systems Laboratory, Stanford University, Stanford, CA. http://www-diglib.stanford.edu/cgi-bin/get/SIDL-WP-2000-0140.
|
 |
44
|
Sergey Melnik , Sriram Raghavan , Beverly Yang , Hector Garcia-Molina, Building a distributed full-text index for the Web, Proceedings of the 10th international conference on World Wide Web, p.396-406, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372095]
|
| |
45
|
|
| |
46
|
|
| |
47
|
PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. Rep.. Computer Systems Laboratory, Stanford University, Stanford, CA.
|
| |
48
|
PINSKI,G.AND NARIN, F. 1976. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Process. Manage. 12.
|
 |
49
|
James Pitkow , Peter Pirolli, Life, death, and lawfulness on the electronic frontier, Proceedings of the SIGCHI conference on Human factors in computing systems, p.383-390, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258805]
|
 |
50
|
|
| |
51
|
ROBOTS EXCLUSION PROTOCOL. 2000. Robots Exclusion Protocol. http://info.webcrawler.com/ mak/projects/robots/exclusion.html.
|
| |
52
|
|
| |
53
|
|
 |
54
|
|
| |
55
|
|
| |
56
|
|
CITED BY 71
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bodo Billerbeck , Falk Scholer , Hugh E. Williams , Justin Zobel, Query expansion using associated queries, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Michael Chau , Jialun Qin , Yilu Zhou , Chunju Tseng , Hsinchun Chen, SpidersRUs: automated development of vertical search engines in different domains and languages, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
|
|
|
|
|
|
Chun Q. Yin , L. Dwayne Nickels , Charles Zhi-kai Chen , T. Gavin Ng , Hsinchun Chen, DGPort: a web portal for digital government, Proceedings of the 2003 annual national conference on Digital government research, p.1-4, May 18-21, 2003, Boston, MA
|
|
|
Chun Q. Yin , L. Dwayne Nickels , Charles Zhi-kai Chen , T. Gavin Ng , Hsinchun Chen, DGPort: a web portal for digital government, Proceedings of the 2003 annual national conference on Digital government research, p.1-4, May 18-21, 2003, Boston, MA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Junghoo Cho , Hector Garcia-Molina , Taher Haveliwala , Wang Lam , Andreas Paepcke , Sriram Raghavan , Gary Wesley, Stanford WebBase components and applications, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.153-186, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
B. Barla Cambazoglu , Evren Karaca , Tayfun Kucukyilmaz , Ata Turk , Cevdet Aykanat, Architecture of a grid-enabled Web search engine, Information Processing and Management: an International Journal, v.43 n.3, p.609-623, May, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Michael Chau , Jialun Qin , Yilu Zhou , Chunju Tseng , Hsinchun Chen, SpidersRUs: Creating specialized search engines in multiple languages, Decision Support Systems, v.45 n.3, p.621-640, June, 2008
|
|
|
|
|
|
|
|
|
Ronny Lempel , Yosi Mass , Shila Ofek-Koifman , Dafna Sheinwald , Yael Petruschka , Ron Sivan, Just in time indexing for up to the second search, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
|
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shu Huang , Qiankun Zhao , Prasenjit Mitra , C. Lee Giles, Hierarchical location and topic based query expansion, Proceedings of the 23rd national conference on Artificial intelligence, p.1150-1155, July 13-17, 2008, Chicago, Illinois
|
|
|
|
REVIEW
"Jun Lin : Reviewer"
Are you curious about the way Web search engines provide users with a list of URLs after just a few keywords are entered? This article gives an overview on the core engine that makes this possible. The authors start by discussing the challen
more...
|