ACM Home Page
Please provide us with feedback. Feedback
Content and expression-based copy recognition for intellectual property protection
Full text PdfPdf (283 KB)
Source ACM Workshop On Digital Rights Management archive
Proceedings of the 3rd ACM workshop on Digital rights management table of contents
Washington, DC, USA
SESSION: Copyrights and access-rights table of contents
Pages: 103 - 110  
Year of Publication: 2003
ISBN:1-58113-786-9
Authors
Özlem Uzuner  Massachusetts Institute of Technology, Cambridge, MA
Randall Davis  Massachusetts Institute of Technology, Cambridge, MA
Sponsors
ACM: Association for Computing Machinery
SIGSAC: ACM Special Interest Group on Security, Audit, and Control
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 1,   Downloads (12 Months): 28,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/947380.947393
What is a DOI?

ABSTRACT

Protection of copyrights and revenues of content owners in the digital world has been gaining importance in the recent years. This paper presents a way of fingerprinting text documents that can be used to identify content and expression similarities in documents, as a way of facilitating tracking of digital copies of works, to ensure proper compensation to content owners.The fingerprints we collected consist of surface, syntactic, and semantic features of documents. Because they reflect mostly how things are said, we call these features stylistic fingerprints. However, how things are said are not independent of what is said, therefore these features have predictive power with respect to both content and expression.We tested the ability of these stylistic fingerprints to identify content and expression similarities between documents using a corpus of translated novels. On this corpus, these fingerprints identified the source of a given book chapter (content) successfully 90% of the time and the translator of the chapter (expression) 67% of the time using ten-fold cross validation and decision trees.In comparison, fingerprints based on the vocabularies of documents recognized the source of a given book chapter accurately 93% of the time and the expression of a particular translator 61% of the time.We believe that the right fingerprints can identify modified and literal copies of works, securing revenues for content owners. Enabling the content owners to secure revenues from distribution of their works can alleviate the digital copyright problem and reduce the need to prevent distribution, giving a chance to solutions that promote uninhibited distribution and use of works by the public.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Brinegar, C. S. 1963. Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship, Journal of the American Statistical Association, 58, 85--96.
 
3
Buckley, C., Singhal, A., Mitra, M., and Salton, G. 1995. New retrieval approaches using SMART: TREC 4. TREC 1995.
 
4
 
5
Stone, P. J. 2002. Descriptions of Inquirer Categories and Use of Inquirer Dictionaries. http://www.wjh.harvard.edu/~inquirer/homecat.htm
 
6
Gerovac, B. and Solomon, R. J. Protect Revenues, Not Bits: Identify Your Intellectual Property. http://www.cni.org/docs/ima.ip-workshop/Gerovac.Solomon.html
 
7
Glover, A. and Hirst, G. 1996. Detecting stylistic inconsistencies in collaborative writing. In: Sharples, Mike and van der Geest, Thea (editors), The new writing environment: Writers at work in a world of technology. London: Springer-Verlag.
 
8
Google.com
 
9
Hatzivassiloglou, V., Klavans, J., and Eskin E. 1999. Detecting Similarity by Applying Learning over Indicators. 37th Annual Meeting of the ACL, 1999.
 
10
Holmes, D. I. 1994. Authorship Attribution. Computers and the Humanities 28(2), 87--106.
 
11
McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.
 
12
Mendenhall, T. C. 1901. A mechanical solution of a literary problem. The Popular Science Monthly, Vol.LX no.7, pp.97--105.
 
13
Morton, A. Q. 1965. The Authorship of Greek Prose. Journal of the Royal Statistical Society, Series A, 128, 169--233.
 
14
Mosteller, F. and Wallace, D. L. 1963. Inference in an Authorship Problem. Journal of the American Statistical Association, 58, 275--309.
 
15
Netanel, N. W. Impose noncommercial Use Levy to Allow Free P2P File-Swapping and Remixing. TPRC. 2002.
 
16
Peng, R. 1999. Statistical Aspects of Literary Style. Bachelor's Thesis, Yale University.
 
17
Peng, R. and Hengartner, N. 2002. Quantitative Analysis of Literary Styles. UCLA, preprints #338.
18
 
19
Shivakumar, N. and Garcia-Molina, H. 1996a. SCAM: A Copy Detection Mechanism for Digital Documents. Proc. of the 2nd Intern'l Conference on Theory and Practice of Digital Libraries, 1996.
20
 
21
Thisted, R. and Efron, B. 1987. Did Shakespeare Write a Newly-discovered Poem? Biometrika, 74, 445--455.
 
22


Collaborative Colleagues:
Özlem Uzuner: colleagues
Randall Davis: colleagues