|
ABSTRACT
Most web pages are linked to others with related content. This idea, combined with another that says that text in, and possibly around, HTML anchors describe the pages to which they point, is the foundation for a usable World-Wide Web. In this paper, we examine to what extent these ideas hold by empirically testing whether topical locality mirrors spatial locality of pages on the Web. In particular, we find that the likelihood of linked pages having similar textual content to be high; the similarity of sibling pages increases when the links from the parent are close together; titles, descriptions, and anchor text represent at least part of the target page; and that anchor text may be a useful discriminator among unseen child pages. These results show the foundations necessary for the success of many web systems, including search engines, focused crawlers, linkage analyzers, and intelligent web agents.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. Amitay. Hypertext- The importance of being different. Master's thesis, Edinburgh University, Scotland, 1997. Also Technical Report No. HCRC/RP-94.
|
| |
2
|
E. Amitay. Using common hypertext links to identify the best phrasal description of target web documents. In Proceedings of the SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web, Melbourne, Australia, 1998.
|
 |
3
|
|
| |
4
|
Israel Ben-Shaul , Michael Herscovici , Michal Jacovi , Yoelle S. Maarek , Dan Pelleg , Menachem Shtalhaim , Vladimir Soroka , Sigalit Ur, Adding support for dynamic and focused search with Fetuccino, Proceeding of the eighth international conference on World Wide Web, p.1653-1665, May 1999, Toronto, Canada
|
| |
5
|
|
 |
6
|
|
| |
7
|
J. Boyan, D. Freitag, and T. Joachims. A Machine Learning Architecture for Optimizing Web Search Engines. In AAAI Workshop on Internet-Based Information Systems, Portland, OR, Aug. 1996.
|
| |
8
|
|
 |
9
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
10
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
| |
11
|
|
| |
12
|
|
| |
13
|
B.D. Davison. Adaptive Web Prefetching. In Proceedings of the 2nd Workshop on Adaptive Systems and User Modeling on the WWW, pages 105-106, Toronto, May 1999. Position paper. Proceedings published as Computing Science Report 99-07, Dept. of Mathematics and Computing Science, Eindhoven University of Technology.
|
| |
14
|
B. D. Davison. Topical locality in the Web: Experiments and observations. Technical Report DCS-TR-414, Department of Computer Science, Rutgers University, 2000.
|
| |
15
|
B.D. Davison, A. Gerasoulis, K. Kleisouris, Y. Lu, H. Set, W. Wang, and B. Wu. DiscoWeb: Applying Link Analysis to Web Search. In Poster proceedings of the Eighth International World Wide Web Conference, pages 148-149, Toronto, Canada, May 1999.
|
| |
16
|
|
 |
17
|
David Gibson , Jon Kleinberg , Prabhakar Raghavan, Inferring Web communities from link topology, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.225-234, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276652]
|
| |
18
|
A. Howe and D. Dreilinger. SavvySearch: A MetaSearch Engine that Learns Which Search Engines to Query. AI Magazine, 18(2), 1997.
|
| |
19
|
T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A Tour Guide for the World Wide Web. In Proceedings of the Feenth International Joint Conference on Artificial Intelligence, pages 770-775. Morgan Kaufmann, Aug. 1997.
|
| |
20
|
|
| |
21
|
T. Koch, A. Ardo, A. Brummer, and S. Lundberg. The building and maintenance of robot based internet search services: A review of current indexing and data collection methods. Prepared for Work Package 3 of EU Telematics for Research, project DESIRE; Available from http:l/www.ub2.1u.se/desire/radar/reportslD3.111, Sept. 1996.
|
| |
22
|
|
| |
23
|
S. Lawrence and C. L. Giles. Accessibility of Information on the Web. Nature, 400:107-109, 1999.
|
 |
24
|
|
| |
25
|
O. A. McBryan. GENVL and WWWW: Tools for taming the Web. In Proceedings of the First International World Wide Web Conference, Geneva, Switzerland, May 1994.
|
| |
26
|
|
| |
27
|
D. Mladenic. Personal WebWatcher: Implementation and Design. Technical Report IJS-DP-7472, Department of Intelligent Systems, J. Stefan Institute, Univ. of of Ljubljana, Slovenia, Oct. 1996.
|
 |
28
|
James Pitkow , Peter Pirolli, Life, death, and lawfulness on the electronic frontier, Proceedings of the SIGCHI conference on Human factors in computing systems, p.383-390, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258805]
|
| |
29
|
|
| |
30
|
E. Selberg and O. Etzioni. The MetaCrawler Architecture for Resource Aggregation on the Web. IEEE Expert, 12(1):8-14, Jan/Feb 1997.
|
| |
31
|
D. Sullivan. More evil than Dr. Evil? From the Search Engine Report, at http://www.searchenginewatch- .com/sereport/99/11-google.html, Nov. 1999.
|
| |
32
|
D. Sullivan. Search engine features for webmasters. From Search Engine Watch, at http://www.searchenginewatch- .com/webmasters/features.html, Jan. 2000.
|
| |
33
|
|
CITED BY 53
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Taher H. Haveliwala , Aristides Gionis , Dan Klein , Piotr Indyk, Evaluating strategies for similarity search on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
|
|
|
|
|
|
|
|
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
|
|
|
Zheng Chen , Shengping Liu , Liu Wenyin , Geguang Pu , Wei-Ying Ma, Building a web thesaurus from web link structure, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
David Hawking , Francis Crimmins , Nick Craswell , Trystan Upstill, How valuable is external link evidence when searching enterprise Webs?, Proceedings of the fifteenth Australasian database conference, p.77-84, January 01, 2004, Dunedin, New Zealand
|
|
|
|
|
|
|
|
|
|
|
|
Soumen Chakrabarti , Mukul M. Joshi , Kunal Punera , David M. Pennock, The structure of broad topics on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
|
|
|
|
|
|
Henry S. Baird , Daniel Lopresti , Brian D. Davison , William M. Pottenger, Robust document image understanding technologies, Proceedings of the 1st ACM workshop on Hardcopy document processing, p.9-14, November 12-12, 2004, Washington, DC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rashmin Babaria , J. Saketha Nath , Krishnan S , Sivaramakrishnan K R , Chiranjib Bhattacharyya , M. N. Murty, Focused crawling with scalable ordinal regression solvers, Proceedings of the 24th international conference on Machine learning, p.57-64, June 20-24, 2007, Corvalis, Oregon
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
|
|
|
Masahiro Ito , Kotaro Nakayama , Takahiro Hara , Shojiro Nishio, Association thesaurus construction methods based on link co-occurrence analysis for wikipedia, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
Kerstin Bischoff , Claudiu S. Firan , Wolfgang Nejdl , Raluca Paiu, Can all tags be used for search?, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Zhumin Chen , Jun Ma , Jingsheng Lei , Bo Yuan , Li Lian , Ling Song, A cross-language focused crawling algorithm based on multiple relevance prediction strategies, Computers & Mathematics with Applications, v.57 n.6, p.1057-1072, March, 2009
|
|
|
|
|
|
|
|
|
|
|