| Spam, damn spam, and statistics: using statistical analysis to locate spam web pages |
| Full text |
Pdf
(792 KB)
|
| Source
|
WebDB; Vol. 67
archive
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
table of contents
Paris, France
SESSION: Paper session 1: web querying and mining
table of contents
Pages: 1 - 6
Year of Publication: 2004
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 27, Downloads (12 Months): 158, Citation Count: 53
|
|
|
ABSTRACT
The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
[doi> 10.1145/900051.900060]
|
| |
2
|
|
| |
3
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
4
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
 |
5
|
|
| |
6
|
|
| |
7
|
B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.
|
 |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford Digital Libraries Technologies Project, Jan. 1998.
|
CITED BY 53
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
André Luiz da Costa Carvalho , Paul - Alexandru Chirita , Edleno Silva de Moura , Pável Calado , Wolfgang Nejdl, Site level noise removal for search engines, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
|
|
|
|
|
|
Krysta M. Svore , Qiang Wu , Chris J. C. Burges , Aaswath Raman, Improving web spam classification using rank-time features, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
Yi-Min Wang , Ming Ma , Yuan Niu , Hao Chen, Spam double-funnel: connecting web spammers with advertisers, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
András Benczúr , István Bíró , Károly Csalogány , Tamás Sarlós, Web spam detection via commercial intent analysis, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
Fabricio Benevenuto , Tiago Rodrigues , Virgilio Almeida , Jussara Almeida , Chao Zhang , Keith Ross, Identifying video spammers in online social networks, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
|
|
|
Fabiano Atalla , Daniel Miranda , Jussara Almeida , Marcos André Gonçalves , Virgilio Almeida, Analyzing the impact of churn and malicious behavior on the quality of peer-to-peer web search, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Reid Andersen , Christian Borgs , Jennifer Chayes , John Hopcroft , Kamal Jain , Vahab Mirrokni , Shanghua Teng, Robust PageRank and locally computable spam detection features, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
|
|
|
Yiqun Liu , Rongwei Cen , Min Zhang , Shaoping Ma , Liyun Ru, Identifying web spam with user behavior analysis, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
Luca Becchetti , Paolo Boldi , Carlos Castillo , Aristides Gionis, Efficient semi-streaming algorithms for local triangle counting in massive graphs, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fabrício Benevenuto , Tiago Rodrigues , Virgílio Almeida , Jussara Almeida , Marcos Gonçalves, Detecting spammers and content promoters in online video social networks, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Pranam Kolari , Akshay Java , Tim Finin , Tim Oates , Anupam Joshi, Detecting spam blogs: a machine learning approach, proceedings of the 21st national conference on Artificial intelligence, p.1351-1356, July 16-20, 2006, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|