| Exploring linguistic features for web spam detection: a preliminary study |
| Full text |
Pdf
(320 KB)
|
| Source
|
AIRWeb; Vol. 295
archive
Proceedings of the 4th international workshop on Adversarial information retrieval on the web
table of contents
Beijing, China
SESSION: Text analysis
table of contents
Pages 25-28
Year of Publication: 2008
ISBN:978-1-60558-159-0
|
|
Authors
|
|
Jakub Piskorski
|
Joint Research Centre of the European Commission, Ispra, VA, Italy
|
|
Marcin Sydow
|
Polish-Japanese Institute of Information Technology, Koszykowa, Warsaw, Poland
|
|
Dawid Weiss
|
Poznań University of Technology, Piotrowo, Poznań, Poland
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 8, Downloads (12 Months): 60, Citation Count: 1
|
|
|
ABSTRACT
We study the usability of linguistic features in the Web spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when combined with features studied elsewhere.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Abernethy, O. Chapelle, and C. Castillo. Witch: A new approach to web spam detection, 2007. submitted.
|
 |
2
|
András Benczúr , István Bíró , Károly Csalogány , Tamás Sarlós, Web spam detection via commercial intent analysis, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
[doi> 10.1145/1244408.1244424]
|
 |
3
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277814]
|
| |
4
|
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceedings of ECML 2005, volume 3720 of LNAI, pages 233--243, Porto, Portugal, 2005.
|
| |
5
|
T. Erjavec. MULTEXT -- East Morphosyntactic Specifications, 2004. URL: http://nl.ijs.si/ME/V3/msd/html.
|
| |
6
|
A. Esuli and F. Sebastiani. SentiWordNet: A publicly available lexical resource for opinion mining. In Proceedings of LREC 2006, pages 417--422, Genova, IT, 2006.
|
 |
7
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
 |
8
|
|
| |
9
|
Jakub Piskorski. Corleone - Core Linguistic Entity Extraction. Technical Report. JRC of the European Commission, 2008.
|
| |
10
|
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of AIRWeb 2005, May 2005.
|
 |
11
|
|
| |
12
|
M. Sydow, J. Piskorski, D. Weiss, and C. Castillo. Application of machine learning in combating web spam, 2007. submitted for publication in IOS Press.
|
| |
13
|
T. Urvoy, T. Lavergne, and P. Filoche. Tracking web spam with hidden style similarity. In AIRWeb 2006, pages 25--31, 2006.
|
| |
14
|
Webspam corpora. URL: http://yr-bcn.es/webspam/datasets, accessed February 21, 2008.
|
| |
15
|
A. Zhou, J. Burgoon, J. Nunamaker, and D. Twitchell. Automating Linguistics-Based Cues for Detecting Deception of Text-based Asynchronous Computer-Mediated Communication. Group Decision and Negotiations, 12:81--106, 2004.
|
|