|
ABSTRACT
We seek to gain improved insight into how Web search engines shouldcope with the evolving Web, in an attempt to provide users with themost up-to-date results possible. For this purpose we collectedweekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate ofcreation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them. For pages that persistover time we found that, perhaps surprisingly, the degree of contentshift as measured using TF.IDF cosine distance does not appear to beconsistently correlated with the frequency of contentupdating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications ofour results for the design of effective Web search engines.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Google Directory http://dir.google.com.
|
| |
2
|
Google Search. http://www.google.com.
|
| |
3
|
The Internet Archive http://www.archive.org.
|
| |
4
|
Nielsen NetRatings for Search Engines. avaiable from searchenginewatch.com at http://searchenginewatch.com/reports/article.php/2156451.
|
| |
5
|
Online Computer Library Center http://wcp.oclc.org.
|
| |
6
|
Open Directory Project http://www.dmoz.org.
|
| |
7
|
The WebArchive Project, UCLA Computer Science, http://webarchive.cs.ucla.edu.
|
| |
8
|
|
| |
9
|
|
 |
10
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
11
|
|
| |
12
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
| |
13
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
14
|
Soumen Chakrabarti , Byron E. Dom , S. Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins , David Gibson , Jon Kleinberg, Mining the Web's Link Structure, Computer, v.32 n.8, p.60-67, August 1999
[doi> 10.1109/2.781636]
|
| |
15
|
|
 |
16
|
|
| |
17
|
E. Coffman, Jr., Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, June 1998.
|
| |
18
|
F. Douglis, A. Feldmann, and B. Krishnamurthy. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, Monterey, 1997.
|
 |
19
|
|
| |
20
|
R. Kumar , P. Raghavan , S. Rajagopalan , D. Sivakumar , A. Tomkins , E. Upfal, Stochastic models for the Web graph, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, p.57, November 12-14, 2000
|
 |
21
|
Lipyeow Lim , Min Wang , Sriram Padmanabhan , Jeffrey Scott Vitter , Ramesh Agarwal, Dynamic maintenance of web indexes using landmarks, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary
[doi> 10.1145/775152.775167]
|
| |
22
|
|
| |
23
|
B. H. Murray and A. Moore. Sizing the internet. White paper, Cyveillance, Inc., 2000.
|
 |
24
|
James Pitkow , Peter Pirolli, Life, death, and lawfulness on the electronic frontier, Proceedings of the SIGCHI conference on Human factors in computing systems, p.383-390, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258805]
|
| |
25
|
|
CITED BY 61
|
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
Sandeep Pandey , Sourashis Roy , Christopher Olston , Junghoo Cho , Soumen Chakrabarti, Shuffling a stacked deck: the case for partially randomized ranking of search engine results, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jure Leskovec , Jon Kleinberg , Christos Faloutsos, Graphs over time: densification laws, shrinking diameters and possible explanations, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Vlassis Krikos , Sofia Stamou , Pavlos Kokosis , Alexandros Ntoulas , Dimitris Christodoulakis, DirectoryRank: ordering pages in web directories, Proceedings of the 7th annual ACM international workshop on Web information and data management, November 04-04, 2005, Bremen, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xiangye Xiao , Longhao Wang , Xing Xie , Qiong Luo, Discovering co-located queries in geographic search logs, Proceedings of the first international workshop on Location and the web, p.77-84, April 22-22, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Michael L. Nelson , Joan A. Smith , Ignacio Garcia del Campo, Efficient, automatic web resource harvesting, Proceedings of the eighth ACM international workshop on Web information and data management, November 10-10, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Anirban Dasgupta , Arpita Ghosh , Ravi Kumar , Christopher Olston , Sandeep Pandey , Andrew Tomkins, The discoverability of the web, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
Anirban Dasgupta , Arpita Ghosh , Ravi Kumar , Christopher Olston , Sandeep Pandey , Andrew Tomkins, The discoverability of the web, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Young Geun Han , Sang Ho Lee , Jae Hwi Kim , Yanggon Kim, A new aggregation policy for RSS services, Proceedings of the 2008 international workshop on Context enabled source and service selection, integration and adaptation: organized with the 17th International World Wide Web Conference (WWW 2008), p.1-7, April 22-22, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jure Leskovec , Lars Backstrom , Ravi Kumar , Andrew Tomkins, Microscopic evolution of social networks, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Avishek Anand , Srikanta Bedathur , Klaus Berberich , Ralf Schenkel , Christos Tryfonopoulos, EverLast: a distributed architecture for preserving the web, Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, June 15-19, 2009, Austin, TX, USA
|
|
|
|
|
|
|
|