ACM Home Page
Please provide us with feedback. Feedback
Winnowing: local algorithms for document fingerprinting
Full text PdfPdf (180 KB)
Source International Conference on Management of Data archive
Proceedings of the 2003 ACM SIGMOD international conference on Management of data table of contents
San Diego, California
SESSION: Data security and protection table of contents
Pages: 76 - 85  
Year of Publication: 2003
ISBN:1-58113-634-X
Authors
Saul Schleimer  University of Illinois, Chicago
Daniel S. Wilkerson  Computer Science Division, UC Berkeley
Alex Aiken  Computer Science Division, UC Berkeley
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 157,   Citation Count: 51
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/872757.872770
What is a DOI?

ABSTRACT

Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
Brenda S. Baker and Udi Manber. Deducing similarities in java sources from bytecodes. In Proc. of Usenix Annual Technical Conf., pages 179--190, 1998.
4
 
5
Andrei Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998.
 
6
 
7
The Crystals. Da do run run, 1963.
 
8
Nevin Heintze. Scalable document fingerprinting. In 1996 USENIX Workshop on Electronic Commerce, November 1996.
 
9
James Joyce. Finnegans wake {1st trade ed.}. Faber and Faber (London), 1939.
 
10
 
11
Sergio Leone, Clint Eastwood, Eli Wallach, and Lee Van Cleef. The Good, the Bad and the Ugly / Il Buono, Il Brutto, Il Cattivo (The Man with No Name). Produzioni Europee Associate (Italy) Production, Distributed by United Artists (USA), 1966.
 
12
Udi Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1--10, San Fransisco, CA, USA, 17--21 1994.
 
13
Peter Mork, Beitao Li, Edward Chang, Junghoo Cho, Chen Li, and James Wang. Indexing tamper resistant features for image copy detection, 1999. URL: citeseer.nj.nec.com/mork99indexing.html.
 
14
Narayanan Shivakumar and Héctor García-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
 
15
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249--260, 1995.
 
16
George K. Zipf. The Psychobiology of Language. Houghton Mifltm Co., 1935.

CITED BY  51

Collaborative Colleagues:
Saul Schleimer: colleagues
Daniel S. Wilkerson: colleagues
Alex Aiken: colleagues