ACM Home Page
Please provide us with feedback. Feedback
Online duplicate document detection: signature reliability in a dynamic retrieval environment
Full text PdfPdf (215 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the twelfth international conference on Information and knowledge management table of contents
New Orleans, LA, USA
SESSION: Information retrieval session 8: efficiency table of contents
Pages: 443 - 452  
Year of Publication: 2003
ISBN:1-58113-723-0
Authors
Jack G. Conrad  Thomson Legal & Regulatory
Xi S. Guo  Thomson Legal & Regulatory
Cindy P. Schriber  Thomson Legal & Regulatory
Sponsors
ACM: Association for Computing Machinery
SIGMIS: ACM Special Interest Group on Management Information Systems
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 85,   Citation Count: 13
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956863.956946
What is a DOI?

ABSTRACT

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a 'fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with new documents, or new versions of documents, arriving frequently, and other documents periodically removed. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and completeness arise. We show that even with very large training collections possessing extremely high feature correlations before and after updates, underlying fingerprints remain sensitive to subtle changes. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components from multiple domains. This technique appears to offer a practical foundation for fingerprint stability. We also consider mechanisms for updating training collections while mitigating signature instability. Our research is divided into three parts. We begin with a study of the distribution of duplicate types in two broad-ranging news collections consisting of approximately 50 million documents. We then examine the utility of document signatures in addressing identical or nearly identical duplicate documents and their sensitivity to collection updates. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identification of non-identical duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
4
5
6
 
7
D. P. Dabney, H. R. Turtle, J. G. Conrad, et. al. System and Method of Processing Formatted Text Documents in a Database. U.S. Patent App. No. 09/120,170, 1999.
 
8
O. Frieder, D. A. Grossman, A. Chowdhury, and G. Frieder. Efficiency considerations for scalable information retrieval servers. Journal of Digital Information, 1(5):26 pgs, Jan. 2000.
 
9
N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX Electronic Commerce Workshop, pages 191--200, Nov. 1996.
 
10
U. Manber. Finding similar files in a large file system. In USENIX Winter 1994 Technical Conference Proceedings (USENIX '94), pages 1--10, Jan. 1994.
 
11
 
12
M. J. Moroney. Facts from Figures, pages 334--370. Penguin Books, Middlesex, UK, 3rd edition, 1956.
13
 
14
T. A. Phelps and R. Wilensky. Robust hyperlinks: Cheap, everywhere, now. In Proceedings of the 8th Int'l Conference on Digital Documents and Electronic Publishing (DDEP '00). Springer-Verlag, Sept. 2000.
 
15
 
16
 
17
 
18
P. Thompson, H. Turtle, B. Yang, and J. Flood. uppercase TREC-3 ad hoc experiments using the uppercase WIN system. In Proc. of TREC-3, pages 211--217. NIST, Nov. 1995.
 
19
 
20
 
21
U. S. Department of Commerce/National Institute of Standards and Technology. Secure Hash Std, 1995.
 
22

CITED BY  13

Collaborative Colleagues:
Jack G. Conrad: colleagues
Xi S. Guo: colleagues
Cindy P. Schriber: colleagues