|
ABSTRACT
We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. We compare this to the shingles approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Brown, Eric W. and Prager, John M., US Patent 05913208.
|
| |
2
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
3
|
Rabin, M. O., "Fingerprinting by random polynomials, " Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.
|
| |
4
|
Bloomfield, Louis, University of Virginia, interviewed on NPR's All Things Considered, May 9, 2001. See www.plagiarism.phys.virginia.edu.
|
 |
5
|
|
| |
6
|
|
| |
7
|
Ravin, Y. and Wacholder, N. 1996, "Extracting Names from Natural-Language Text," IBM Research Report 20338.
|
| |
8
|
Justeson, J. S. and S. Katz "Technical terminology: some linguistic properties and an algorithm for identification in text." Natural Language Engineering, 1, 9--27, 1995.
|
| |
9
|
Byrd, R.J. and Ravin, Y. Identifying and Extracting Relations in Text. Proceedings of NLDB 99, Klagenfurt, Austria.
|
| |
10
|
Mnis-Textwise Labs, www.textwise.com <http://www.textwise.com>. DR-LINK was developed at Syracuse University and is marketed by Textwise.
|
| |
11
|
Evans, D. K., Klavans, J. and Wacholder, N., "Document Processing with LinkIT," Proc. Of the RIAO Conference, Paris, France, 2000.
|
| |
12
|
InXight, Inc. www.inxight.com
|
| |
13
|
Neff, Mary S. and Cooper, James W. "Document Summarization for Active Markup," in Proceedings of the 32nd Hawaii International Conference on System Sciences, Wailea, HI, January, 1999.
|
| |
14
|
|
| |
15
|
Cooper, J. W. "The Technology of Lexical Navigation," Workshop on Browsing Technology, First Joint Conference on Digital Libraries, Roanoke, VA, 2001.
|
| |
16
|
Cooper, J.W., Cesar, C., So, Edward, and Mack R. L., "Construction of an OO Framework for Text Mining," OOPSLA, Tampa Bay, 2001.
|
| |
17
|
Gemini plug-in for Adobe Acrobat Reader, Iceni Technology, Ltd, Norwich, England, www.iceni.com <http://www.iceni.com>.
|
| |
18
|
|
| |
19
|
Cooper, J W, "Loading Your Databases," JavaPro, May, 2000.
|
CITED BY 7
|
|
|
|
|
|
|
|
Milad Shokouhi , Justin Zobel , Yaniv Bernstein, Distributed text retrieval from overlapping collections, Proceedings of the eighteenth conference on Australasian database, p.141-150, January 30-February 02, 2007, Ballarat, Victoria, Australia
|
|
|
|
|
|
|
|
|
|
|
|
|
|