ACM Home Page
Please provide us with feedback. Feedback
Collection statistics for fast duplicate document detection
Full text PdfPdf (191 KB)
Source ACM Transactions on Information Systems (TOIS) archive
Volume 20 ,  Issue 2  (April 2002) table of contents
Pages: 171 - 191  
Year of Publication: 2002
ISSN:1046-8188
Authors
Abdur Chowdhury  Illinois Institute of Technology, Chicago, IL
Ophir Frieder  Illinois Institute of Technology, Chicago, IL
David Grossman  Illinois Institute of Technology, Chicago, IL
Mary Catherine McCabe  Illinois Institute of Technology, Chicago, IL
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 31,   Downloads (12 Months): 144,   Citation Count: 34
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/506309.506311
What is a DOI?

ABSTRACT

We present a new algorithm for duplicate document detection that uses collection statistics. We compare our approach with the state-of-the-art approach using multiple collections. These collections include a 30 MB 18,577 web document collection developed by Excite@Home and three NIST collections. The first NIST collection consists of 100 MB 18,232 LA-Times documents, which is roughly similar in the number of documents to the Excite&at;Home collection. The other two collections are both 2 GB and are the 247,491-web document collection and the TREC disks 4 and 5---528,023 document collection. We show that our approach called I-Match, scales in terms of the number of documents and works well for documents of all sizes. We compared our solution to the state of the art and found that in addition to improved accuracy of detection, our approach executed in roughly one-fifth the time.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
 
4
Buckley, C., Cardie, C., Mardis, S., Mitra, M., Pierce, D., Wagstaff, K., and Walz, J. 1999. The Smart/Empire TIPSTER IR System. In Proceedings of TIPSTER Phase III (San Francisco, CA.). 107--121.
 
5
Chowdhury, A., Holmes, D., Mccabe, M. C., Grossman, D., and Frieder, O. 2000. The use of fusion with AIRE at TREC-9. In Proceedings of the Ninth Text Retrieval Conference (TREC-9, Gathersburg, MD, November).
 
6
Frieder, O., Grossman, D., Chowdhury, A., and Frieder, G. 2000. Efficiency considerations in very large information retrieval servers. J. Dig. Inf. 1, 5 (Apr).
 
7
Giles, L. and Lawrence, S. 1999. Accessibility and distribution of information on the web. Nature 400, 107--109.
 
8
Grossman, D., Holmes, D., and Frieder, O. 1993. A DBMS Approach to IR in TREC-4. In Proceedings of the Fourth Text Retrieval Conference (TREC-4) (Gaithersburg, Maryland, November).
 
9
Heintze, N. 1996. Scalable document fingerprinting. In Proceedings of the Second USENIX Electronic Commerce Workshop (Oakland, CA., November). 191--200.
 
10
11
 
12
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science. 280, 5360, 98--100.
 
13
NIH. 2000. http://nccam.nih.gov/, The National Institutes of Health (NIH), National Center for Complementary and Alternative Medicine (NCCAM), April 12, 2000.
 
14
NIST. 1995. Secure Hash Standard, U.S. Department of Commerce/National Institute of Standards and Technology, FIPS PUB 180-1 (April 17).
 
15
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
 
16
Robertson, S. Walker, S., and Beaulieu, M. 1999. Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and interactive, Proceedings of the 7th Text Retrieval Conference (TREC-7'99) (July). 253--264.
 
17
Rocchio, J. 1971. Relevance Feedback in Information Retrieval. In The Smart System---experiments in automatic document processing, pages 313--323. Prentice Hall, Englewood Cliffs, NJ.
18
 
19
Sanderson, M. 1997. Duplicate detection in the Reuters collection. Technical Report (TR-1997-5) of the Department of Computing Science at the University of Glasgow, Glasgow G12 8QQ, UK.
 
20
Scotti, R. and Lilly, C. 1999. George Washington University Declassification Productivity Research Center. http://dprc.seas.gwu.edu. July 31.
 
21
22
 
23
SMART FTP site: 2000. ftp://ftp.cs.cornell.edu/pub/smart/. January 19.
 
24
Smeaton, A., Kelledy, F., and Quinn, G. 1997. Ad-hoc retrieval using thresholds, WSTs for French monolingual retrieval, Document-at-a-Glance for high precision and triphone windows for spoken documents. In Proceedings of the Sixth Text Retrieval Conference (TREC-6, Gathersburg, Maryland).

CITED BY  34

Collaborative Colleagues:
Abdur Chowdhury: colleagues
Ophir Frieder: colleagues
David Grossman: colleagues
Mary Catherine McCabe: colleagues