| Finding replicated Web collections |
| Full text |
Pdf
(333 KB)
|
| Source
|
ACM SIGMOD Record
archive
Volume 29 , Issue 2 (June 2000)
table of contents
Pages: 355 - 366
Year of Publication: 2000
ISSN:0163-5808
Also published in ...
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 92, Citation Count: 29
|
|
|
Warning: The download time has expired please click on the item to try again.
ABSTRACT
Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alexa Corporation. http://www.alexa.com, 1999.
|
| |
2
|
|
| |
3
|
Sergey Brin and Lawrence Page. Google search engine. http://www.google.com, 1999.
|
| |
4
|
|
| |
5
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
6
|
|
| |
7
|
|
| |
8
|
Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400:107-109, 1999.
|
| |
9
|
|
 |
10
|
James Pitkow , Peter Pirolli, Life, death, and lawfulness on the electronic frontier, Proceedings of the SIGCHI conference on Human factors in computing systems, p.383-390, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258805]
|
| |
11
|
|
| |
12
|
Narayanan Shivakumar and Hector Garcia-Molina. SCAM:a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95), Austin, Texas, June 1995.
|
 |
13
|
|
CITED BY 29
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Junghoo Cho , Hector Garcia-Molina , Taher Haveliwala , Wang Lam , Andreas Paepcke , Sriram Raghavan , Gary Wesley, Stanford WebBase components and applications, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.153-186, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
André Luiz da Costa Carvalho , Edleno Silva de Moura , Altigran Soares da Silva , Klessius Berlt , Allan Bezerra, A cost-effective method for detecting web site replicas on search engine databases, Data & Knowledge Engineering, v.62 n.3, p.421-437, September, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Atsuyuki Morishima , Akiyoshi Nakamizo , Toshinari Iida , Shigeo Sugimoto , Hiroyuki Kitagawa, Bringing your dead links back to life: a comprehensive approach and lessons learned, Proceedings of the 20th ACM conference on Hypertext and hypermedia, June 29-July 01, 2009, Torino, Italy
|
|
|
|
|
|
|
|