ACM Home Page
Please provide us with feedback. Feedback
Detecting near-duplicates for web crawling
Full text PdfPdf (170 KB)
Source
International World Wide Web Conference archive
Proceedings of the 16th international conference on World Wide Web table of contents
Banff, Alberta, Canada
SESSION: Similarity search table of contents
Pages: 141 - 150  
Year of Publication: 2007
ISBN:978-1-59593-654-7
Authors
Gurmeet Singh Manku  Google Inc., Mountain View, CA
Arvind Jain  Google Inc., Mountain View, CA
Anish Das Sarma  Stanford University, Stanford, CA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 45,   Downloads (12 Months): 337,   Citation Count: 11
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1242572.1242592
What is a DOI?

ABSTRACT

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
4
 
5
 
6
 
7
8
 
9
 
10
 
11
 
12
13
 
14
15
 
16
17
18
 
19
20
 
21
22
23
 
24
 
25
 
26
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J American Society for Information Science, 41(6):391--407, 1990.
 
27
 
28
29
30
 
31
D. Greene, MParnas, and FYao. Multi-index hashing for information retrieval. In Proc. 35th Annual Symposium on Foundations of Computer Science (FOCS 1994), pages 722--731, 1994.
 
32
 
33
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. 3rd Intl. Workshop on the Web and Databases (WebDB 2000), pages 129--134, May 2000.
34
35
 
36
 
37
D. A. Huffman. A method for the construction of minimum-redundancy codes. In Proc. Institute of Radio Engineering, volume 40, pages 1098--1102, Sept. 1952.
38
39
40
 
41
 
42
43
 
44
M. Minsky and S. Papert. Perceptrons. MIT Press, 1969.
45
46
 
47
W. Pugh and M. R. Henzinger. Detecting duplicate and near-duplicate files. United States Patent 6,658,423, granted on Dec 2, 2003, 2003.
 
48
 
49
M. O. Rabin. Fingerprinting by random polynomials. Technical Report Report TR-15-81, Center for Research in Computing Techonlogy, Harvard University, 1981.
50
 
51
N. Shivakumar and H. Garcia-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries, 1995.
 
52
A. Yao and F. Yao. The complexity of searching in an ordered random table. In Proc. 17th Annual Symposium on Foundations of Computer Science (FOCS 1976), pages 173--177, 1976.
 
53

CITED BY  11

Collaborative Colleagues:
Gurmeet Singh Manku: colleagues
Arvind Jain: colleagues
Anish Das Sarma: colleagues