ACM Home Page
Please provide us with feedback. Feedback
Web spam filtering in internet archives
Full text PdfPdf (703 KB)
Source ACM International Conference Proceeding Series archive
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web table of contents
Madrid, Spain
SESSION: Temporal analysis table of contents
Pages 17-20  
Year of Publication: 2009
ISBN:978-1-60558-438-6
Authors
Miklós Erdélyi  University of Pannonia and Computer and Automation Research Institute of the Hungarian Academy of Sciences
András A. Benczúr  Computer and Automation Research Institute of the Hungarian Academy of Sciences
Julien Masanés  European Archive Foundation, France
Dávid Siklósi  Computer and Automation Research Institute of the Hungarian Academy of Sciences
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 25,   Downloads (12 Months): 82,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1531914.1531918
What is a DOI?

ABSTRACT

While Web spam is targeted for the high commercial value of top-ranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide. These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for them.

In this paper we illustrate spam filtering needs, opportunities and blockers for Internet archives via analyzing several crawl snapshots and the difficulty of migrating filter models across different crawls via the example of the 13 .uk snapshots performed by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proc. of the 4th Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
2
 
3
A. A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight Web spam. In Proc. of the 2nd Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2006.
 
4
A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. SpamRank -- Fully automatic link spam detection. In Proc. of the 1st Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
5
 
6
7
 
8
 
9
C. Castillo, K. Chellapilla, and L. Denoyer. Web spam challenge 2008. In Proc. of the 4th Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
10
11
 
12
13
 
14
G. Cormack. Content-based Web Spam Detection. In Proc. of the 3rd Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007.
 
15
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proc. of the 16th European Conference on Machine Learning (ECML), volume 3720 of Lecture Notes in Artificial Intelligence, pages 233--243, 2005.
16
17
18
 
19
 
20
 
21
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
 
22
23
 
24
25
 
26
PR10.info. BadRank as the opposite of PageRank, 2004. http://en.pr10.info/pagerank0-badrank/ (visited June 27th, 2005).
 
27
28
 
29
B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote Web spam. In Workshop on Models of Trust for the Web, Edinburgh, Scotland, 2006.
30


Collaborative Colleagues:
Miklós Erdélyi: colleagues
András A. Benczúr: colleagues
Julien Masanés: colleagues
Dávid Siklósi: colleagues