ACM Home Page
Please provide us with feedback. Feedback
Detecting similar documents using salient terms
Full text PdfPdf (181 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the eleventh international conference on Information and knowledge management table of contents
McLean, Virginia, USA
SESSION: Information retrieval models table of contents
Pages: 245 - 251  
Year of Publication: 2002
ISBN:1-58113-492-4
Authors
James W. Cooper  IBM T J Watson Research Center, Yorktown Heights, NY
Anni R. Coden  IBM T J Watson Research Center, Yorktown Heights, NY
Eric W. Brown  IBM T J Watson Research Center, Yorktown Heights, NY
Sponsors
SIGMIS: ACM Special Interest Group on Management Information Systems
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 99,   Citation Count: 7
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/584792.584835
What is a DOI?

ABSTRACT

We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. We compare this to the shingles approach.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Brown, Eric W. and Prager, John M., US Patent 05913208.
 
2
 
3
Rabin, M. O., "Fingerprinting by random polynomials, " Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.
 
4
Bloomfield, Louis, University of Virginia, interviewed on NPR's All Things Considered, May 9, 2001. See www.plagiarism.phys.virginia.edu.
5
 
6
 
7
Ravin, Y. and Wacholder, N. 1996, "Extracting Names from Natural-Language Text," IBM Research Report 20338.
 
8
Justeson, J. S. and S. Katz "Technical terminology: some linguistic properties and an algorithm for identification in text." Natural Language Engineering, 1, 9--27, 1995.
 
9
Byrd, R.J. and Ravin, Y. Identifying and Extracting Relations in Text. Proceedings of NLDB 99, Klagenfurt, Austria.
 
10
Mnis-Textwise Labs, www.textwise.com <http://www.textwise.com>. DR-LINK was developed at Syracuse University and is marketed by Textwise.
 
11
Evans, D. K., Klavans, J. and Wacholder, N., "Document Processing with LinkIT," Proc. Of the RIAO Conference, Paris, France, 2000.
 
12
InXight, Inc. www.inxight.com
 
13
Neff, Mary S. and Cooper, James W. "Document Summarization for Active Markup," in Proceedings of the 32nd Hawaii International Conference on System Sciences, Wailea, HI, January, 1999.
 
14
 
15
Cooper, J. W. "The Technology of Lexical Navigation," Workshop on Browsing Technology, First Joint Conference on Digital Libraries, Roanoke, VA, 2001.
 
16
Cooper, J.W., Cesar, C., So, Edward, and Mack R. L., "Construction of an OO Framework for Text Mining," OOPSLA, Tampa Bay, 2001.
 
17
Gemini plug-in for Adobe Acrobat Reader, Iceni Technology, Ltd, Norwich, England, www.iceni.com <http://www.iceni.com>.
 
18
 
19
Cooper, J W, "Loading Your Databases," JavaPro, May, 2000.

CITED BY  7

Collaborative Colleagues:
James W. Cooper: colleagues
Anni R. Coden: colleagues
Eric W. Brown: colleagues