|
ABSTRACT
Research in the area of adversarial information retrieval has been facilitated by the availability of the UK-2006/UK-2007 collections, comprising crawl data, link graph, and spam labels. However, research into nullifying the negative effect of spam or excessive search engine optimisation (SEO) on the ranking of non-spam pages is not well supported by these resources. Nor is the study of cloaking techniques or of click spam. Finally, the domain-restricted nature of a .uk crawl means that only parts of link-farm icebergs may be visible in these crawls. We introduce the term nullification which we define as "preventing problem pages from negatively affecting search results". We show some important differences between properties of current .uk-restricted crawls and those previously reported for the Web as a whole. We identify a need for an adversarial IR collection which is not domain-restricted and which is supported by a set of appropriate query sets and (optimistically) user-behaviour data. The billion-page unrestricted crawl being conducted by CMU (web09-bst) and which will be used in the 2009 TREC Web Track is assessed as a possible basis for a new AIR test collection. We discuss the pros and cons of its scale, and the feasibility of adding resources such as query lists to enhance the utility of the collection for AIR research.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Reid Andersen , Christian Borgs , Jennifer Chayes , John Hopcroft , Kamal Jain , Vahab Mirrokni , Shanghua Teng, Robust PageRank and locally computable spam detection features, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22-22, 2008, Beijing, China
[doi> 10.1145/1451983.1452000]
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
 |
6
|
Andrei Z. Broder , David Carmel , Michael Herscovici , Aya Soffer , Jason Zien, Efficient query evaluation using a two-level retrieval process, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956944]
|
 |
7
|
|
 |
8
|
|
| |
9
|
J. Callan, M. Hoy, C. Yoo, and L. Zhao. web08-bst.v1 web data collection, 2008. http://boston.lti.cs.cmu.edu/callan/Data/#Web.
|
 |
10
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
[doi> 10.1145/1189702.1189703]
|
| |
11
|
K. Chellapilla and D. M. Chickering. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 17--24, Seattle, WA, August 2006.
|
 |
12
|
|
| |
13
|
G. Culliss. User popularity ranked search engines, 1999. http://web.archive.org/web/20000302121422/http://www.infonortics.com/searchengines/boston1999/culliss/index.htm.
|
 |
14
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
| |
15
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
| |
16
|
|
| |
17
|
D. Hawking and N. Craswell. The very large collection and web tracks. In E. Voorhees and D. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. http://es.csiro.au/pubs/trecbook_for_website.pdf (ISBN 0262220733).
|
| |
18
|
D. Hawking, T. Rowlands, and M. Adcock. Improving rankings in small-scale web search using click-implied descriptions. In P. Bruza, A. Spink, and R. Wilkinson, editors, Proceedings of ADCS 2006, pages 17--24, Brisbane, December 2006.
|
 |
19
|
|
| |
20
|
T. Jones, D. Hawking, and R. Sankaranarayana. A framework for measuring the impact of web spam. In Proceedings of ADCS 2007, December 2007.
|
 |
21
|
|
| |
22
|
V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 37--40, 2006.
|
| |
23
|
|
| |
24
|
G. Mishne. Blocking blog spam with language model disagreement. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
| |
25
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford, Santa Barbara, CA 93106, January 1998. dbpubs.stanford.edu:8090/pub/1999-66.
|
 |
26
|
|
 |
27
|
|
| |
28
|
X. Shen. Chronicle of aol search query log release incident, 2009. http://sifaka.cs.uiuc.edu/xshen/aol_querylog.html.
|
 |
29
|
|
 |
30
|
|
 |
31
|
Gui-Rong Xue , Hua-Jun Zeng , Zheng Chen , Yong Yu , Wei-Ying Ma , WenSi Xi , WeiGuo Fan, Optimizing web search using web click-through data, Proceedings of the thirteenth ACM international conference on Information and knowledge management, November 08-13, 2004, Washington, D.C., USA
[doi> 10.1145/1031171.1031192]
|
 |
32
|
|
|