ACM Home Page
Please provide us with feedback. Feedback
Compressed collections for simulated crawling
Full text PdfPdf (511 KB)
Source
ACM SIGIR Forum archive
Volume 42 ,  Issue 2  (December 2008) table of contents
COLUMN: Papers table of contents
Pages 39-44  
Year of Publication: 2008
ISSN:0163-5840
Authors
Alessio Orlandi  Università di Pisa, Italy
Sebastiano Vigna  Università degli Studi di Milano, Italy
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 41,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1480506.1480512
What is a DOI?

ABSTRACT

Collections are a fundamental tool for reproducible evaluation of information retrieval techniques. We describe a new method for distributing the document lengths and term counts (a.k.a. within-document frequencies) of a web snapshot in a highly compressed and nonetheless quickly accessible form. Our main application is reproducibility of the behaviour of focused crawlers: by coupling our collection with the corresponding web graph compressed with WebGraph [3] we make it possible to apply text-based machine learning tools to the collection, while keeping the data set footprint small. We describe a collection based on a crawl of 100 Mpages of the .uk domain, publicly available in bundle with a Java open-source implementation of our techniques.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
 
4
P. Boldi and S. Vigna. Codes for the world wide web. Internet mathematics 2(4):405--427, 2005.
5
 
6
7
 
8
R. M. Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, Project MAC, MIT, Cambridge, Mass., n.d., 1971.
 
9
S. Vigna. Broadword implementation of rank/select queries. Proc. of the 7th International Workshop on Experimental Algorithms, pp. 154--168. LNCS 5038, Springer Verlag, 2008.
 
10
WARC file format, ISO/DIS 28500, 2007.

Collaborative Colleagues:
Alessio Orlandi: colleagues
Sebastiano Vigna: colleagues