|
ABSTRACT
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
Brenda S. Baker and Udi Manber. Deducing similarities in java sources from bytecodes. In Proc. of Usenix Annual Technical Conf., pages 179--190, 1998.
|
 |
4
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
5
|
Andrei Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998.
|
| |
6
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
7
|
The Crystals. Da do run run, 1963.
|
| |
8
|
Nevin Heintze. Scalable document fingerprinting. In 1996 USENIX Workshop on Electronic Commerce, November 1996.
|
| |
9
|
James Joyce. Finnegans wake {1st trade ed.}. Faber and Faber (London), 1939.
|
| |
10
|
|
| |
11
|
Sergio Leone, Clint Eastwood, Eli Wallach, and Lee Van Cleef. The Good, the Bad and the Ugly / Il Buono, Il Brutto, Il Cattivo (The Man with No Name). Produzioni Europee Associate (Italy) Production, Distributed by United Artists (USA), 1966.
|
| |
12
|
Udi Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1--10, San Fransisco, CA, USA, 17--21 1994.
|
| |
13
|
Peter Mork, Beitao Li, Edward Chang, Junghoo Cho, Chen Li, and James Wang. Indexing tamper resistant features for image copy detection, 1999. URL: citeseer.nj.nec.com/mork99indexing.html.
|
| |
14
|
Narayanan Shivakumar and Héctor García-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
|
| |
15
|
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249--260, 1995.
|
| |
16
|
George K. Zipf. The Psychobiology of Language. Houghton Mifltm Co., 1935.
|
CITED BY 51
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chao Liu , Chen Chen , Jiawei Han , Philip S. Yu, GPLAG: detection of software plagiarism by program dependence graph analysis, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
Zhenmin Li , Shan Lu , Suvda Myagmar , Yuanyuan Zhou, CP-Miner: a tool for finding copy-paste and related bugs in operating system code, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.20-20, December 06-08, 2004, San Francisco, CA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Miroslav Ponec , Paul Giura , Hervé Brönnimann , Joel Wein, Highly efficient techniques for network forensics, Proceedings of the 14th ACM conference on Computer and communications security, October 28-31, 2007, Alexandria, Virginia, USA
|
|
|
|
|
|
|
|
|
L. Luo , D. M. Hao , Z. Tian , Y. B. Dang , B. Hou , P. Malkin , S. X. Yang, Ariadne: an eclipse-based system for tracking the originality of source code, IBM Systems Journal, v.46 n.2, p.289-303, April 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Deise de Brum Saccol , Nina Edelweiss , Renata de Matos Galante , Carlo Zaniolo, XML version detection, Proceedings of the 2007 ACM symposium on Document engineering, August 28-31, 2007, Winnipeg, Manitoba, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Scott Huffman , April Lehman , Alexei Stolboushkin , Howard Wong-Toi , Fan Yang , Hein Roehrig, Multiple-signal duplicate detection for search evaluation, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Erik Linstead , Sushil Bajracharya , Trung Ngo , Paul Rigor , Cristina Lopes , Pierre Baldi, Sourcerer: mining and searching internet-scale software repositories, Data Mining and Knowledge Discovery, v.18 n.2, p.300-336, April 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Andreas Sæbjørnsen , Jeremiah Willcock , Thomas Panas , Daniel Quinlan , Zhendong Su, Detecting code clones in binary executables, Proceedings of the eighteenth international symposium on Software testing and analysis, July 19-23, 2009, Chicago, IL, USA
|
|
|
|
|
|
Ashok Anand , Chitra Muthukrishnan , Aditya Akella , Ramachandran Ramjee, Redundancy in network traffic: findings and implications, Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, June 15-19, 2009, Seattle, WA, USA
|
|