| A large-scale study of link spam detection by graph algorithms |
| Full text |
Pdf
(126 KB)
|
| Source
|
AIRWeb; Vol. 215
archive
Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
table of contents
Banff, Alberta, Canada
SESSION: Link farms
table of contents
Pages: 45 - 48
Year of Publication: 2007
ISBN:978-1-59593-732-2
|
|
Authors
|
|
Hiroo Saito
|
Aihara Complexity Modelling Project, ERATO, JST, Tokyo, Japan and University of Tokyo, Tokyo, Japan
|
|
Masashi Toyoda
|
University of Tokyo, Tokyo, Japan
|
|
Masaru Kitsuregawa
|
University of Tokyo, Tokyo, Japan
|
|
Kazuyuki Aihara
|
Aihara Complexity Modelling Project, ERATO, JST, Tokyo, Japan and University of Tokyo, Tokyo, Japan
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 11, Downloads (12 Months): 59, Citation Count: 2
|
|
|
ABSTRACT
Link spam refers to attempts to promote the ranking of spammers' web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called "link farm". In this paper, we study the overall structure and distribution of link farms in a large-scale graph of the Japanese Web with 5.8 million sites and 283 million links. To examine the spam structure, we apply three graph algorithms to the web graph. First, the web graph is decomposed into strongly connected components (SCC). Beside the largest SCC (core) in the center of the web, we have observed that most of large components consist of link farms. Next, to extract spam sites in the core, we enumerate maximal cliques as seeds of link farms. Finally, we expand these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. We found about 0.6 million spam sites in SCCs around the core, and extracted additional 8 thousand and 49 thousand sites as spams with high precision in the core by the maximal clique enumeration and by the minimum cut technique, respectively.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
L. Becchetti, C. Castillo, and D. Donato. Link-based characterization and detection of web spam. In Proc. of AIRWEB 2006, Seattle, 2006.
|
| |
2
|
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detetection. In Proc. of KDD 2006, Philadelphia, Pennsylvania, 2006.
|
| |
3
|
A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In Proc. of AIRWEB 2006, Seattle, 2006.
|
| |
4
|
A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. Spamrank -- fully automatic link spam detection. In Proc. of AIRWEB 2005, Chiba, 2005.
|
| |
5
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Computer Networks: The International Journal of Computer and Telecommunications Networking, v.33 n.1-6, p.309-320, June 2000
|
 |
6
|
Gary William Flake , Steve Lawrence , C. Lee Giles, Efficient identification of Web communities, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.150-160, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347121]
|
| |
7
|
|
| |
8
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. of AIRWEB 2005, Chiba, 2005.
|
| |
9
|
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. of VLDB 2004, Toronto, 2004.
|
 |
10
|
|
| |
11
|
V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In Proc. of AIRWEB 2006, Seattle, 2006.
|
| |
12
|
K. Makino and T. Uno. New algorithms for enumerating all maximal cliques. In SWAT 2004, Humlebaek, 2004.
|
| |
13
|
P. T. Metaxas and J. DeStefano. Web spam, propaganda and trust. In Proc. of AIRWEB 2005, Chiba, 2005.
|
 |
14
|
|
| |
15
|
T. Ono, M. Toyoda, and M. Kitsuregawa. An examination of techniques for identifying web spam by link analysis. In Proc. of DEWS 2006, Tokyo, 2006.
|
| |
16
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.
|
 |
17
|
|
|