|
ABSTRACT
We present a new algorithm for duplicate document detection that
uses collection statistics. We compare our approach with the
state-of-the-art approach using multiple collections. These
collections include a 30 MB 18,577 web document collection
developed by Excite@Home and three NIST collections. The first NIST
collection consists of 100 MB 18,232 LA-Times documents, which is
roughly similar in the number of documents to the
Excite&at;Home collection. The other two collections are both 2
GB and are the 247,491-web document collection and the TREC disks 4
and 5---528,023 document collection. We show that our approach
called I-Match, scales in terms of the number of documents and
works well for documents of all sizes. We compared our solution to
the state of the art and found that in addition to improved
accuracy of detection, our approach executed in roughly one-fifth
the time.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
3
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
4
|
Buckley, C., Cardie, C., Mardis, S., Mitra, M., Pierce, D., Wagstaff, K., and Walz, J. 1999. The Smart/Empire TIPSTER IR System. In Proceedings of TIPSTER Phase III (San Francisco, CA.). 107--121.
|
| |
5
|
Chowdhury, A., Holmes, D., Mccabe, M. C., Grossman, D., and Frieder, O. 2000. The use of fusion with AIRE at TREC-9. In Proceedings of the Ninth Text Retrieval Conference (TREC-9, Gathersburg, MD, November).
|
| |
6
|
Frieder, O., Grossman, D., Chowdhury, A., and Frieder, G. 2000. Efficiency considerations in very large information retrieval servers. J. Dig. Inf. 1, 5 (Apr).
|
| |
7
|
Giles, L. and Lawrence, S. 1999. Accessibility and distribution of information on the web. Nature 400, 107--109.
|
| |
8
|
Grossman, D., Holmes, D., and Frieder, O. 1993. A DBMS Approach to IR in TREC-4. In Proceedings of the Fourth Text Retrieval Conference (TREC-4) (Gaithersburg, Maryland, November).
|
| |
9
|
Heintze, N. 1996. Scalable document fingerprinting. In Proceedings of the Second USENIX Electronic Commerce Workshop (Oakland, CA., November). 191--200.
|
| |
10
|
|
 |
11
|
|
| |
12
|
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science. 280, 5360, 98--100.
|
| |
13
|
NIH. 2000. http://nccam.nih.gov/, The National Institutes of Health (NIH), National Center for Complementary and Alternative Medicine (NCCAM), April 12, 2000.
|
| |
14
|
NIST. 1995. Secure Hash Standard, U.S. Department of Commerce/National Institute of Standards and Technology, FIPS PUB 180-1 (April 17).
|
| |
15
|
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
|
| |
16
|
Robertson, S. Walker, S., and Beaulieu, M. 1999. Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and interactive, Proceedings of the 7th Text Retrieval Conference (TREC-7'99) (July). 253--264.
|
| |
17
|
Rocchio, J. 1971. Relevance Feedback in Information Retrieval. In The Smart System---experiments in automatic document processing, pages 313--323. Prentice Hall, Englewood Cliffs, NJ.
|
 |
18
|
|
| |
19
|
Sanderson, M. 1997. Duplicate detection in the Reuters collection. Technical Report (TR-1997-5) of the Department of Computing Science at the University of Glasgow, Glasgow G12 8QQ, UK.
|
| |
20
|
Scotti, R. and Lilly, C. 1999. George Washington University Declassification Productivity Research Center. http://dprc.seas.gwu.edu. July 31.
|
| |
21
|
|
 |
22
|
|
| |
23
|
SMART FTP site: 2000. ftp://ftp.cs.cornell.edu/pub/smart/. January 19.
|
| |
24
|
Smeaton, A., Kelledy, F., and Quinn, G. 1997. Ad-hoc retrieval using thresholds, WSTs for French monolingual retrieval, Document-at-a-Glance for high precision and triphone windows for spoken documents. In Proceedings of the Sixth Text Retrieval Conference (TREC-6, Gathersburg, Maryland).
|
CITED BY 34
|
|
|
|
|
|
|
|
Ling Ma , Nazli Goharian , Abdur Chowdhury , Misun Chung, Extracting unstructured data from template generated web documents, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Milad Shokouhi , Justin Zobel , Yaniv Bernstein, Distributed text retrieval from overlapping collections, Proceedings of the eighteenth conference on Australasian database, p.141-150, January 30-February 02, 2007, Ballarat, Victoria, Australia
|
|
|
|
|
|
|
|
|
|
|
|
Ludmila Cherkasova , Kave Eshghi , Charles B. Morrey , Joseph Tucek , Alistair Veitch, Applying syntactic similarity algorithms for enterprise information management, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
André Luiz da Costa Carvalho , Edleno Silva de Moura , Altigran Soares da Silva , Klessius Berlt , Allan Bezerra, A cost-effective method for detecting web site replicas on search engine databases, Data & Knowledge Engineering, v.62 n.3, p.421-437, September, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|