ACM Home Page
Please provide us with feedback. Feedback
Know your neighbors: web spam detection using the web topology
Full text PdfPdf (797 KB)
Source
Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Amsterdam, The Netherlands
SESSION: Spam spam spam table of contents
Pages: 423 - 430  
Year of Publication: 2007
ISBN:978-1-59593-597-7
Authors
Carlos Castillo  Yahoo! Research Barcelona, Catalunya, Spain
Debora Donato  Yahoo! Research Barcelona, Catalunya, Spain
Aristides Gionis  Yahoo! Research Barcelona, Catalunya, Spain
Vanessa Murdock  Yahoo! Research Barcelona, Catalunya, Spain
Fabrizio Silvestri  ISTI-CNR, Pisa, Italy
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 52,   Downloads (12 Months): 314,   Citation Count: 19
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1277741.1277814
What is a DOI?

ABSTRACT

Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In AIRWeb, 2006.
 
5
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In ACM WebKDD, Pennsylvania, USA, August 2006.
 
6
A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In AIRWeb, 2006.
7
 
8
W. W. Cohen and Z. Kou. Stacked graphical learning: approximating learning in markov random fields using very short inhomogeneous markov chains. Technical report, 2006.
9
10
 
11
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In ECML, volume 3720 of LNAI, pages 233--243, Porto, Portugal, 2005.
12
 
13
 
14
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In AIRWeb, 2005.
 
15
 
16
17
 
18
 
19
Q. Lu and L. Getoor. Link-based classification using labeled and unlabeled data. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, DC, 2003.
 
20
S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by-association, collective inference, and focused data access. In International Conference on Intelligence Analysis, 2005.
 
21
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb, 2005.
22
 
23
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.
24
 
25
 
26
 
27
B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In AIRWeb, 2005.
28
 
29
B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In MTW, May 2006.
 
30
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. Van Roy. Making eigenvector-based reputation systems robust to collusion. In WAW, volume 3243 of LNCS, pages 92--104, Rome, Italy, 2004. Springer.
31
 
32
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. Advances in Neural Information Processing Systems, 16:321--328, 2004.

CITED BY  19

Collaborative Colleagues:
Carlos Castillo: colleagues
Debora Donato: colleagues
Aristides Gionis: colleagues
Vanessa Murdock: colleagues
Fabrizio Silvestri: colleagues