|
ABSTRACT
We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarised documents. Another reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. We illustrate the approach using the recently available Reuters Corpus Volume 1 (RCV1). The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital SRC, Palo Alto, 1994.
|
| |
3
|
|
| |
4
|
A. Crauser and P. Ferragina. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica, 32(1):1--35, 2002.
|
| |
5
|
J. Diederich, J. Kindermann, E. Leopold, and G. Paass. Authorship attribution with support vector machines.
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
D. Khmelev. Disputed Authorship Resolution through Using Relative Empirical Entropy for Markov Chains of Letters in Human Language Text. J. of Quantitative Linguistics, 7(3):201--207, 2000.
|
| |
10
|
D. V. Khmelev and W. J. Teahan. Verification of text collections for text categorization and natural language processing. Technical Report AIIA 03.1, School of Informatics, Univ. of Wales, Bangor, 2003.
|
| |
11
|
|
| |
12
|
|
| |
13
|
M. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
|
| |
14
|
Reuters-21578 Text Categorization Collection. Available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.
|
| |
15
|
T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1---from yesterday's news to tomorrow's language resources. In Proceedings of the Third International Conference on Language Resources and Evaluation, pages 29--31, Las Palmas de Gran Canaris, 2002. IEEE Computer Society Press.
|
| |
16
|
M. Sanderson. Duplicate detection in the Reuters collection. Technical report, Department of Computer Science, Univ. of Glasgow, 1997.
|
| |
17
|
C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379--423,623--656,1948, 1948.
|
| |
18
|
|
| |
19
|
W. J. Teahan. Text classification and segmentation using minimum cross-entropy. In Proc. RIAO'2000, volume 2, pages 943--961, Paris, France, 2000.
|
| |
20
|
W. J. Teahan and D. J. Harper. Using compression- based language models for text categorization. In Workshop on Lang. Modeling and Inform. Retrieval, pages 83--88, Carnegie Mellon Univ., May 2001.
|
| |
21
|
The 20 Newsgroups data set. Available at http://www.ai.mit.edu/people/jrennie/20Newsgroups/.
|
| |
22
|
J. Weston and C. Watkins. Multi-class support vector machines, 1998.
|
|