ACM Home Page
Please provide us with feedback. Feedback
Mining e-mail content for author identification forensics
Full text PdfPdf (954 KB)
Source ACM SIGMOD Record archive
Volume 30 ,  Issue 4  (December 2001) table of contents
SPECIAL ISSUE: Special section on data mining for intrusion detection and threat analysis table of contents
Pages: 55 - 64  
Year of Publication: 2001
ISSN:0163-5808
Authors
O. de Vel  Defence Science and Technology Organisation, Salisbury, Australia
A. Anderson  Queensland University of Technology, Brisbane, Australia
M. Corney  Queensland University of Technology, Brisbane, Australia
G. Mohay  Queensland University of Technology, Brisbane, Australia
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 34,   Downloads (12 Months): 163,   Citation Count: 24
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/604264.604272
What is a DOI?

ABSTRACT

We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different e-mail topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Identifying the Authors of Suspect E-mail". Communications of the ACM, 2001. (Submitted).
 
2
A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Multi-topic E-mail authorship attribution forensics". In Proc. Workshop on Data Mining for Security Applications, 8th ACM Conference on Computer Security (CCS'2001), 2001.
 
3
C. Apte, F. Damerau, and S. Weiss. "Text mining with decision rules and decision trees". In Workshop on Learning from text and the Web, Conference on Automated Learning and Discovery, 1998.
 
4
R. Bosch and J. Smith. "Separating hyperplanes and the authorship of the disputed federalist papers". American Mathematical Monthly, 105(7):601-608, 1998.
 
5
C. Chaski. "A Daubert-inspired assessment of current techniques for language-based author identification". Technical report, US National Institute of Justice, 1998. Available through www.ncjrs.org.
 
6
C. Chaski. "Empirical evaluations of language-based author identification techniques". Forensic Linguistics, 2001. (to appear).
 
7
W. Cohen. "Learning rules that classify e-mail". In Proc. Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), pages 18-25, 1996.
 
8
C. Crain. "The Bard's fingerprints". Lingua Franca, pages 29-39, 1998.
 
9
O. de Vel. "Evaluation of Text Document Categorisation Techniques for Computer Forensics". Journal of Computer Security, 1999. (Submitted).
 
10
O. de Vel. "Mining e-mail authorship". In Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000), 2000.
 
11
 
12
H. Druker, D. Wu, and V. Vapnik. "Support vector machines for spam categorisation". IEEE Trans. on Neural Networks, 10:1048-1054, 1999.
 
13
W. Elliot and R. Valenza. "Was the Earl of Oxford the true Shakespeare?". Notes and Queries, 38:501-506, 1991.
 
14
J. Farringdon. Analysing for Authorship: A Guide to the Cusum Technique. University of Wales Press, Cardiff, 1996.
 
15
D. Foster. Author Unknown: On the Trail of Anonymous. Henry Holt, New York, 2000.
 
16
A. Gray, P. Sallis, and S. MacDonell. "Software forensics: Extending authorship analysis techniques to computer programs". In Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1-8, 1997.
 
17
D. Holmes and R. Forsyth. "The Federalist revisited: New directions in authorship attribution". Literary and Linguistic Computing, pages 111-127, 1995.
 
18
 
19
D. Khmelev. "Disputed authorship resolution using relative entropy for Markov chain of letters in a text". In R. Baayen, editor, Proc. 4th Conference Int. Quantitative Linguistics Association, Prague, 2000.
 
20
I. Krsul. "Authorship analysis: Identifying the author of a program". Technical report, Department of Computer Science, Purdue University, 1994. Technical Report CSD-TR-94-030.
 
21
I. Krsul and E. Spafford. "Authorship analysis: Identifying the author of a program". Computers and Security, 16:248-259, 1997.
 
22
D. Lowe and R. Matthews. "Shakespeare vs Fletcher: A stylometric analysis by radial basis functions". Computers and the Humanities, pages 449-461, 1995.
 
23
 
24
F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, Mass., 1964.
25
26
 
27
J. Rudman. "The state of authorship attribution studies: Some problems and solutions". Computers and the Humanities, 31(4):351-365, 1997.
 
28
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. "A Bayesian approach to filtering junk e-mail". In Learning for Text Categorization Workshop: 15th National Conf. on AI. AAAI Technical Report WS-98-05, pages 55-62, 1998.
 
29
P. Sallis, S. MacDonell, G. MacLennan, A. Gray, and R. Kilgour. "Identified: Software authorship analysis with case-based reasoning". In Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, pages 53-56, 1997.
 
30
 
31
 
32
O. Teytaud and R. Jalam. "Kernel-based text categorization". In International Joint Conference on Neural Networks (IJCNN'2001), 2001. Washington DC, to appear.
 
33
B. Thisted and R. Efron. "Did Shakespeare write a newly discovered poem?". Biometrika, pages 445-455, 1987.
 
34
R. Thomson and T. Murachver. "Predicting gender from electronic discourse". British Journal of Social Psychology, 40:193-208, 2001.
 
35
F. Tweedie and R. Baayen. "How variable may a constant be? Measure of lexical richness in perspective". Computers and the Humanities, 32(5):323-352, 1998.
 
36
F. Tweedie, S. Singh, and D. Holmes. "Neural network applications in stylometry: The Federalist papers". Computers and the Humanities, 30(1):1-10, 1996.
 
37
University of Dortmund. Support Vector Machine, SVMLight. http://www-ai.cs.uni-dortmund.de/FORSCHUNG/VERFAHREN/SVM_LIGHT/svm_light.eng.html.
 
38
 
39
S. Waugh, A. Adams, and F. Tweedie. "Computational stylistics using artificial neural networks". Literary and Linguistic Computing, 15(2):187-198, 2000.
 
40
 
41
42

CITED BY  24

Collaborative Colleagues:
O. de Vel: colleagues
A. Anderson: colleagues
M. Corney: colleagues
G. Mohay: colleagues