|
ABSTRACT
Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such as PageRank, HITS, and other derivatives are especially vulnerable to link spam. Link farms and link exchanges are two common instances of link spam that produce spam communities -- i.e., clusters in the web graph. In this paper, we present a directed approach to extracting link spam communities when given one or more members of the community. In contrast to previous completely automated approaches to finding link spam, our method is specifically designed to be used interactively. Our approach starts with a small spam seed set provided by the user and simulates a random walk on the web graph. The random walk is biased to explore the local neighborhood around the seed set through the use of decay probabilities. Truncation is used to retain only the most frequently visited nodes. After termination, the nodes are sorted in decreasing order of their final probabilities and presented to the user. Experiments using manually labeled link spam data sets and random walks from a single seed domain show that the approach achieves over 95.12% precision in extracting large link farms and 80.46% precision in extracting link exchange centroids.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal. "Web search using automatic classification." In Proceedings of the 6th International World Wide Web Conference (WWW), San Jose, US, 1997.
|
| |
2
|
The Word Spy - Spamdexing. http://www.wordspy.com/words/spamdexing.asp.
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
Z. Gyöngyi and H. Garcia-Molina. "Web spam taxonomy." In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
| |
9
|
|
 |
10
|
|
| |
11
|
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. "Making eigenvector-based reputation systems robust to collusion." In Proceedings of the 3rd Workshop on Algorithms and Models for the Web-Graph (WAW), Rome, Italy, October 2004. Full version to appear in Internet Mathematics.
|
| |
12
|
R. Baeza-Yates, C. Castillo, and V. López. "PageRank increase under different collusion topologies." In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
| |
13
|
|
 |
14
|
Gary William Flake , Steve Lawrence , C. Lee Giles, Efficient identification of Web communities, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.150-160, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347121]
|
 |
15
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
 |
16
|
Daniel A. Spielman , Shang-Hua Teng, Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems, Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, June 13-16, 2004, Chicago, IL, USA
[doi> 10.1145/1007352.1007372]
|
 |
17
|
|
| |
18
|
A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. "SpamRank - Fully automatic link spam detection." In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
| |
19
|
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. "Combating web spam with TrustRank." In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, 2004.
|
 |
20
|
|
| |
21
|
R. Raj, V. Krishnan. "Web Spam Detection with Anti-Trust Rank." Second International Workshop on Adversarial Information Retrieval on the Web (At the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval).
|
 |
22
|
|
| |
23
|
BadRank as the opposite of PageRank. http://en.pr10.info/pagerank0-badrank/.
|
 |
24
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
| |
25
|
B. D. Davison. "Recognizing nepotistic links on the web." In AAAI-2000 Workshop on Artificial Intelligence for Web Search, Austin, TX, pages 23--28, July 30 2000.
|
 |
26
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
[doi> 10.1145/900051.900060]
|
 |
27
|
|
|