| Improved robustness of signature-based near-replica detection via lexicon randomization |
| Full text |
Pdf
(184 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Seattle, WA, USA
POSTER SESSION: Research track posters
table of contents
Pages: 605 - 610
Year of Publication: 2004
ISBN:1-58113-888-1
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 12, Downloads (12 Months): 58, Citation Count: 12
|
|
|
ABSTRACT
Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct inter-document similarity computation (e.g., using the cosine measure) are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, are very attractive computationally but may be brittle with respect to small changes to document content. We focus on approaches to near-replica detection that are based upon large-collection statistics and present a general technique of increasing their robustness via multiple lexicon randomization. In experiments with large web-page and spam-email datasets the proposed method is shown to consistently outperform traditional I-Match, with the relative improvement in duplicate-document recall reaching as high as 40-60%. The large gains in detection accuracy are offset by only small increases in computational requirements.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
3
|
|
| |
4
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
5
|
C. Buckley, C. Cardie, S. Mardisa, M. Mitra, D. Pierce, K. Wagstaff, and J. Walz. The smart/empire tipster ir system. In TIPSTER Phase III Proceedings. Morgan Kaufmann, 2000.
|
 |
6
|
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
J. Graham-Cummings. The spammers' compendium. In Proceedings of the Spam Conference, 2003.
|
| |
13
|
T. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proceedings of WebDB 2000, 2000.
|
| |
14
|
D. Hawking. Overview of the TREC-9 web track. In TREC-9 NIST, 2000.
|
| |
15
|
D. Hawking and N. Craswell. Overview of the trec-2001 web track. In TREC-10 NIST, 2001.
|
| |
16
|
N. Heintze. Scalable document fingerprinting. In 1996 USENIX Workshop on Electronic Commerce, November 1996.
|
| |
17
|
|
| |
18
|
S. Ilyinsky, M. Kuzmin, A. Melkov, and I. Segalovich. An efficient method to detect duplicates of web documents with the use of inverted index. In Proceedings of the Eleventh International World Wide Web Conference, 2002.
|
| |
19
|
A. Kołcz, , A. Chowdhury, and J. Alspector. Data duplication: An imbalance problem ? In Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets (II), 2003.
|
| |
20
|
M. Sanderson. Duplicate detection in the Reuters collection. Technical Report TR-1997-5, Department of Computing Science, University of Glasgow, 1997.
|
| |
21
|
|
 |
22
|
|
CITED BY 12
|
|
|
|
|
|
|
|
Milad Shokouhi , Justin Zobel , Yaniv Bernstein, Distributed text retrieval from overlapping collections, Proceedings of the eighteenth conference on Australasian database, p.141-150, January 30-February 02, 2007, Ballarat, Victoria, Australia
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|