|
ABSTRACT
We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different e-mail topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Identifying the Authors of Suspect E-mail". Communications of the ACM, 2001. (Submitted).
|
| |
2
|
A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Multi-topic E-mail authorship attribution forensics". In Proc. Workshop on Data Mining for Security Applications, 8th ACM Conference on Computer Security (CCS'2001), 2001.
|
| |
3
|
C. Apte, F. Damerau, and S. Weiss. "Text mining with decision rules and decision trees". In Workshop on Learning from text and the Web, Conference on Automated Learning and Discovery, 1998.
|
| |
4
|
R. Bosch and J. Smith. "Separating hyperplanes and the authorship of the disputed federalist papers". American Mathematical Monthly, 105(7):601-608, 1998.
|
| |
5
|
C. Chaski. "A Daubert-inspired assessment of current techniques for language-based author identification". Technical report, US National Institute of Justice, 1998. Available through www.ncjrs.org.
|
| |
6
|
C. Chaski. "Empirical evaluations of language-based author identification techniques". Forensic Linguistics, 2001. (to appear).
|
| |
7
|
W. Cohen. "Learning rules that classify e-mail". In Proc. Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), pages 18-25, 1996.
|
| |
8
|
C. Crain. "The Bard's fingerprints". Lingua Franca, pages 29-39, 1998.
|
| |
9
|
O. de Vel. "Evaluation of Text Document Categorisation Techniques for Computer Forensics". Journal of Computer Security, 1999. (Submitted).
|
| |
10
|
O. de Vel. "Mining e-mail authorship". In Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000), 2000.
|
| |
11
|
|
| |
12
|
H. Druker, D. Wu, and V. Vapnik. "Support vector machines for spam categorisation". IEEE Trans. on Neural Networks, 10:1048-1054, 1999.
|
| |
13
|
W. Elliot and R. Valenza. "Was the Earl of Oxford the true Shakespeare?". Notes and Queries, 38:501-506, 1991.
|
| |
14
|
J. Farringdon. Analysing for Authorship: A Guide to the Cusum Technique. University of Wales Press, Cardiff, 1996.
|
| |
15
|
D. Foster. Author Unknown: On the Trail of Anonymous. Henry Holt, New York, 2000.
|
| |
16
|
A. Gray, P. Sallis, and S. MacDonell. "Software forensics: Extending authorship analysis techniques to computer programs". In Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1-8, 1997.
|
| |
17
|
D. Holmes and R. Forsyth. "The Federalist revisited: New directions in authorship attribution". Literary and Linguistic Computing, pages 111-127, 1995.
|
| |
18
|
|
| |
19
|
D. Khmelev. "Disputed authorship resolution using relative entropy for Markov chain of letters in a text". In R. Baayen, editor, Proc. 4th Conference Int. Quantitative Linguistics Association, Prague, 2000.
|
| |
20
|
I. Krsul. "Authorship analysis: Identifying the author of a program". Technical report, Department of Computer Science, Purdue University, 1994. Technical Report CSD-TR-94-030.
|
| |
21
|
I. Krsul and E. Spafford. "Authorship analysis: Identifying the author of a program". Computers and Security, 16:248-259, 1997.
|
| |
22
|
D. Lowe and R. Matthews. "Shakespeare vs Fletcher: A stylometric analysis by radial basis functions". Computers and the Humanities, pages 449-461, 1995.
|
| |
23
|
|
| |
24
|
F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, Mass., 1964.
|
 |
25
|
Hwee Tou Ng , Wei Boon Goh , Kok Leong Low, Feature selection, perception learning, and a usability case study for text categorization, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, p.67-73, July 27-31, 1997, Philadelphia, Pennsylvania, United States
|
 |
26
|
|
| |
27
|
J. Rudman. "The state of authorship attribution studies: Some problems and solutions". Computers and the Humanities, 31(4):351-365, 1997.
|
| |
28
|
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. "A Bayesian approach to filtering junk e-mail". In Learning for Text Categorization Workshop: 15th National Conf. on AI. AAAI Technical Report WS-98-05, pages 55-62, 1998.
|
| |
29
|
P. Sallis, S. MacDonell, G. MacLennan, A. Gray, and R. Kilgour. "Identified: Software authorship analysis with case-based reasoning". In Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, pages 53-56, 1997.
|
| |
30
|
|
| |
31
|
|
| |
32
|
O. Teytaud and R. Jalam. "Kernel-based text categorization". In International Joint Conference on Neural Networks (IJCNN'2001), 2001. Washington DC, to appear.
|
| |
33
|
B. Thisted and R. Efron. "Did Shakespeare write a newly discovered poem?". Biometrika, pages 445-455, 1987.
|
| |
34
|
R. Thomson and T. Murachver. "Predicting gender from electronic discourse". British Journal of Social Psychology, 40:193-208, 2001.
|
| |
35
|
F. Tweedie and R. Baayen. "How variable may a constant be? Measure of lexical richness in perspective". Computers and the Humanities, 32(5):323-352, 1998.
|
| |
36
|
F. Tweedie, S. Singh, and D. Holmes. "Neural network applications in stylometry: The Federalist papers". Computers and the Humanities, 30(1):1-10, 1996.
|
| |
37
|
University of Dortmund. Support Vector Machine, SVMLight. http://www-ai.cs.uni-dortmund.de/FORSCHUNG/VERFAHREN/SVM_LIGHT/svm_light.eng.html.
|
| |
38
|
|
| |
39
|
S. Waugh, A. Adams, and F. Tweedie. "Computational stylistics using artificial neural networks". Literary and Linguistic Computing, 15(2):187-198, 2000.
|
| |
40
|
|
| |
41
|
|
 |
42
|
|
CITED BY 24
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hsinchun Chen , Wingyan Chung , Jennifer Jie Xu , Gang Wang , Yi Qin , Michael Chau, Crime Data Mining: A General Framework and Some Examples, Computer, v.37 n.4, p.50-56, April 2004
|
|
|
|
|
|
|
|
|
Hsinchun Chen , Wingyan Chung , Yi Qin , Michael Chau , Jennifer Jie Xu , Gang Wang , Rong Zheng , Homa Atabakhsh, Crime data mining: an overview and case studies, Proceedings of the 2003 annual national conference on Digital government research, p.1-5, May 18-21, 2003, Boston, MA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chun Wei , Alan Sprague , Gary Warner , Anthony Skjellum, Mining spam email to identify common origins for forensic application, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|