| Looking into the past to better classify web spam |
| Full text |
Pdf
(634 KB)
|
| Source
|
ACM International Conference Proceeding Series
archive
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
table of contents
Madrid, Spain
SESSION: Temporal analysis
table of contents
Pages 1-8
Year of Publication: 2009
ISBN:978-1-60558-438-6
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 32, Downloads (12 Months): 96, Citation Count: 1
|
|
|
ABSTRACT
Web spamming techniques aim to achieve undeserved rankings in search results. Research has been widely conducted on identifying such spam and neutralizing its influence. However, existing spam detection work only considers current information. We argue that historical web page information may also be important in spam classification. In this paper, we use content features from historical versions of web pages to improve spam classification. We use supervised learning techniques to combine classifiers based on current page content with classifiers based on temporal features. Experiments on the WEBSPAM-UK2007 dataset show that our approach improves spam classification F-measure performance by 30% compared to a baseline classifier which only considers current page content.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, O. Sercinoglu, and S. Tong. Information retrieval based on historical data. United States Patent 20050071741, USPTO, Mar. 2005.
|
 |
2
|
Reid Andersen , Christian Borgs , Jennifer Chayes , John Hopcroft , Kamal Jain , Vahab Mirrokni , Shanghua Teng, Robust PageRank and locally computable spam detection features, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
[doi> 10.1145/1451983.1452000]
|
 |
3
|
|
| |
4
|
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 1--8, Aug. 2006.
|
| |
5
|
A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. SpamRank -- Fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval (AIRWeb), May 2005.
|
 |
6
|
|
 |
7
|
|
| |
8
|
B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.
|
| |
9
|
Google Inc. Google home page. http://www.google.com/, 2009.
|
| |
10
|
|
| |
11
|
|
| |
12
|
Internet Archive. The Internet Archive. http://www.archive.org/, 2009.
|
| |
13
|
|
 |
14
|
Yu-Ru Lin , Hari Sundaram , Yun Chi , Junichi Tatemura , Belle L. Tseng, Splog detection using self-similarity analysis on blog temporal dynamics, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
[doi> 10.1145/1244408.1244410]
|
 |
15
|
|
| |
16
|
The dmoz Open Directory Project (ODP), 2009. http://www.dmoz.org/.
|
| |
17
|
|
| |
18
|
Guoyang Shen , Bin Gao , Tie-Yan Liu , Guang Feng , Shiji Song , Hang Li, Detecting Link Spam Using Temporal Information, Proceedings of the Sixth International Conference on Data Mining, p.1049-1053, December 18-22, 2006
[doi> 10.1109/ICDM.2006.51]
|
 |
19
|
|
| |
20
|
|
 |
21
|
|
 |
22
|
|
 |
23
|
|
|