ACM Home Page
Please provide us with feedback. Feedback
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Full text PdfPdf (792 KB)
Source WebDB; Vol. 67 archive
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004 table of contents
Paris, France
SESSION: Paper session 1: web querying and mining table of contents
Pages: 1 - 6  
Year of Publication: 2004
Authors
Dennis Fetterly  Microsoft Research, Mountain View, CA
Mark Manasse  Microsoft Research, Mountain View, CA
Marc Najork  Microsoft Research, Mountain View, CA
Sponsor
: INRIA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 27,   Downloads (12 Months): 158,   Citation Count: 53
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1017074.1017077
What is a DOI?

ABSTRACT

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
5
 
6
 
7
B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.
8
 
9
10
 
11
L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford Digital Libraries Technologies Project, Jan. 1998.

CITED BY  53

Collaborative Colleagues:
Dennis Fetterly: colleagues
Mark Manasse: colleagues
Marc Najork: colleagues