|
ABSTRACT
While Web spam is targeted for the high commercial value of top-ranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide. These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for them. In this paper we illustrate spam filtering needs, opportunities and blockers for Internet archives via analyzing several crawl snapshots and the difficulty of migrating filter models across different crawls via the example of the 13 .uk snapshots performed by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proc. of the 4th Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
|
 |
2
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988716]
|
| |
3
|
A. A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight Web spam. In Proc. of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2006.
|
| |
4
|
A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. SpamRank -- Fully automatic link spam detection. In Proc. of the 1st Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
 |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
C. Castillo, K. Chellapilla, and L. Denoyer. Web spam challenge 2008. In Proc. of the 4th Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
|
 |
10
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
[doi> 10.1145/1189702.1189703]
|
 |
11
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277814]
|
| |
12
|
|
 |
13
|
|
| |
14
|
G. Cormack. Content-based Web Spam Detection. In Proc. of the 3rd Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007.
|
| |
15
|
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proc. of the 16th European Conference on Machine Learning (ECML), volume 3720 of Lecture Notes in Artificial Intelligence, pages 233--243, 2005.
|
 |
16
|
|
 |
17
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
| |
22
|
|
 |
23
|
|
| |
24
|
|
 |
25
|
|
| |
26
|
PR10.info. BadRank as the opposite of PageRank, 2004. http://en.pr10.info/pagerank0-badrank/ (visited June 27th, 2005).
|
| |
27
|
|
 |
28
|
|
| |
29
|
B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote Web spam. In Workshop on Models of Trust for the Web, Edinburgh, Scotland, 2006.
|
 |
30
|
|
|