|
ABSTRACT
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
 |
13
|
Andrei Z. Broder , Moses Charikar , Alan M. Frieze , Michael Mitzenmacher, Min-wise independent permutations (extended abstract), Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.327-336, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276781]
|
| |
14
|
|
 |
15
|
|
| |
16
|
|
 |
17
|
|
 |
18
|
Zhiyuan Chen , Nick Koudas , Flip Korn , S. Muthukrishnan, Selectively estimation for Boolean queries, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.216-225, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335225]
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
 |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J American Society for Information Science, 41(6):391--407, 1990.
|
| |
27
|
|
| |
28
|
Danny Dolev , Yuval Harari , Nathan Linial , Noam Nisan , Michal Parnas, Neighborhood preserving hashing and approximate queries, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.251-259, January 23-25, 1994, Arlington, Virginia, United States
|
 |
29
|
|
 |
30
|
Aristides Gionis , Dimitrios Gunopulos , Nick Koudas, Efficient and tumble similar set retrieval, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.247-258, May 21-24, 2001, Santa Barbara, California, United States
|
| |
31
|
D. Greene, MParnas, and FYao. Multi-index hashing for information retrieval. In Proc. 35th Annual Symposium on Foundations of Computer Science (FOCS 1994), pages 722--731, 1994.
|
| |
32
|
|
| |
33
|
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. 3rd Intl. Workshop on the Web and Databases (WebDB 2000), pages 129--134, May 2000.
|
 |
34
|
Taher H. Haveliwala , Aristides Gionis , Dan Klein , Piotr Indyk, Evaluating strategies for similarity search on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511502]
|
 |
35
|
|
| |
36
|
|
| |
37
|
D. A. Huffman. A method for the construction of minimum-redundancy codes. In Proc. Institute of Radio Engineering, volume 40, pages 1098--1102, Sept. 1952.
|
 |
38
|
|
 |
39
|
|
 |
40
|
|
| |
41
|
|
| |
42
|
|
 |
43
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
| |
44
|
M. Minsky and S. Papert. Perceptrons. MIT Press, 1969.
|
 |
45
|
|
 |
46
|
|
| |
47
|
W. Pugh and M. R. Henzinger. Detecting duplicate and near-duplicate files. United States Patent 6,658,423, granted on Dec 2, 2003, 2003.
|
| |
48
|
|
| |
49
|
M. O. Rabin. Fingerprinting by random polynomials. Technical Report Report TR-15-81, Center for Research in Computing Techonlogy, Harvard University, 1981.
|
 |
50
|
|
| |
51
|
N. Shivakumar and H. Garcia-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries, 1995.
|
| |
52
|
A. Yao and F. Yao. The complexity of searching in an ordered random table. In Proc. 17th Annual Symposium on Foundations of Computer Science (FOCS 1976), pages 173--177, 1976.
|
| |
53
|
|
CITED BY 11
|
|
Yida Wang , Jiang-Ming Yang , Wei Lai , Rui Cai , Lei Zhang , Wei-Ying Ma, Exploring traversal strategy for web forum crawling, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
Paolo Ferragina , Roberto Grossi , Ankur Gupta , Rahul Shah , Jeffrey Scott Vitter, On searching compressed string collections cache-obliviously, Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 09-12, 2008, Vancouver, Canada
|
|
|
|
|
|
|
|
|
Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|