ACM Home Page
Please provide us with feedback. Feedback
Exploring linguistic features for web spam detection: a preliminary study
Full text PdfPdf (320 KB)
Source AIRWeb; Vol. 295 archive
Proceedings of the 4th international workshop on Adversarial information retrieval on the web table of contents
Beijing, China
SESSION: Text analysis table of contents
Pages 25-28  
Year of Publication: 2008
ISBN:978-1-60558-159-0
Authors
Jakub Piskorski  Joint Research Centre of the European Commission, Ispra, VA, Italy
Marcin Sydow  Polish-Japanese Institute of Information Technology, Koszykowa, Warsaw, Poland
Dawid Weiss  Poznań University of Technology, Piotrowo, Poznań, Poland
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 60,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1451983.1451990
What is a DOI?

ABSTRACT

We study the usability of linguistic features in the Web spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when combined with features studied elsewhere.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
J. Abernethy, O. Chapelle, and C. Castillo. Witch: A new approach to web spam detection, 2007. submitted.
2
3
 
4
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceedings of ECML 2005, volume 3720 of LNAI, pages 233--243, Porto, Portugal, 2005.
 
5
T. Erjavec. MULTEXT -- East Morphosyntactic Specifications, 2004. URL: http://nl.ijs.si/ME/V3/msd/html.
 
6
A. Esuli and F. Sebastiani. SentiWordNet: A publicly available lexical resource for opinion mining. In Proceedings of LREC 2006, pages 417--422, Genova, IT, 2006.
7
8
 
9
Jakub Piskorski. Corleone - Core Linguistic Entity Extraction. Technical Report. JRC of the European Commission, 2008.
 
10
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of AIRWeb 2005, May 2005.
11
 
12
M. Sydow, J. Piskorski, D. Weiss, and C. Castillo. Application of machine learning in combating web spam, 2007. submitted for publication in IOS Press.
 
13
T. Urvoy, T. Lavergne, and P. Filoche. Tracking web spam with hidden style similarity. In AIRWeb 2006, pages 25--31, 2006.
 
14
Webspam corpora. URL: http://yr-bcn.es/webspam/datasets, accessed February 21, 2008.
 
15
A. Zhou, J. Burgoon, J. Nunamaker, and D. Twitchell. Automating Linguistics-Based Cues for Detecting Deception of Text-based Asynchronous Computer-Mediated Communication. Group Decision and Negotiations, 12:81--106, 2004.


Collaborative Colleagues:
Jakub Piskorski: colleagues
Marcin Sydow: colleagues
Dawid Weiss: colleagues