Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation

Sanderson, Conrad and Guenter, Simon (2006). Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In: 2006 Conference on Empirical Methods in Natural Language Processing: Proceedings of the Conference. 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), University of Technology, Sydney, Australia, (482-491). 22-23 July 2006.


Author Sanderson, Conrad
Guenter, Simon
Title of paper Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation
Conference name 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006)
Conference location University of Technology, Sydney, Australia
Conference dates 22-23 July 2006
Convener SIGDAT (Association for Computational Linguistics' Special Interest Group for linguistic data and corpus-based approaches to NLP)
Proceedings title 2006 Conference on Empirical Methods in Natural Language Processing: Proceedings of the Conference
Place of Publication Stroudsburg, PA, U.S.A.
Publisher Association for Computational Linguistics (ACL)
Publication Year 2006
Year available 2006
Sub-type Fully published paper
ISBN 1-932432-73-6
Start page 482
End page 491
Total pages 10
Language eng
Abstract/Summary We present an investigation of recently proposed character and word sequence kernels for the task of authorship attribution based on relatively short texts. Performance is compared with two corresponding probabilistic approaches based on Markov chains. Several configurations of the sequence kernels are studied on a relatively large dataset (50 authors), where each author covered several topics. Utilising Moffat smoothing, the two probabilistic approaches obtain similar performance, which in turn is comparable to that of character sequence kernels and is better than that of word sequence kernels. The results further suggest that when using a realistic setup that takes into account the case of texts which are not written by any hypothesised authors, the amount of training material has more influence on discrimination performance than the amount of test material. Moreover, we show that the recently proposed author unmasking approach is less useful when dealing with short texts.
Subjects 080107 Natural Language Processing
080109 Pattern Recognition and Data Mining
Q-Index Code E1
Additional Notes Paper # W06-1657 and full text available via conference website (pdf).

 
Versions
Version Filter Type
Citation counts: Google Scholar Search Google Scholar
Created: Thu, 02 Apr 2009, 12:44:55 EST by Mary-Anne Marrington on behalf of School of Information Technol and Elec Engineering