ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Finding replicated Web collections
Full text PdfPdf (333 KB)
Source ACM SIGMOD Record archive
Volume 29 ,  Issue 2  (June 2000) table of contents
Pages: 355 - 366  
Year of Publication: 2000
ISSN:0163-5808
Also published in ...
Authors
Junghoo Cho  Department of Computer Science, Stanford, CA
Narayanan Shivakumar  Department of Computer Science, Stanford, CA
Hector Garcia-Molina  Department of Computer Science, Stanford, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 92,   Citation Count: 29
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/335191.335429
What is a DOI?

Warning: The download time has expired please click on the item to try again.


ABSTRACT

Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Alexa Corporation. http://www.alexa.com, 1999.
 
2
 
3
Sergey Brin and Lawrence Page. Google search engine. http://www.google.com, 1999.
 
4
 
5
 
6
 
7
 
8
Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400:107-109, 1999.
 
9
10
 
11
 
12
Narayanan Shivakumar and Hector Garcia-Molina. SCAM:a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95), Austin, Texas, June 1995.
13

CITED BY  29

Collaborative Colleagues:
Junghoo Cho: colleagues
Narayanan Shivakumar: colleagues
Hector Garcia-Molina: colleagues