| Query-based partitioning of documents and indexes for information lifecycle management |
| Full text |
Pdf
(993 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
table of contents
Vancouver, Canada
SESSION: Research Session 14: Ordered Data
table of contents
Pages 623-636
Year of Publication: 2008
ISBN:978-1-60558-102-6
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 20, Downloads (12 Months): 268, Citation Count: 1
|
|
|
ABSTRACT
Regulations require businesses to archive many electronic documents for extended periods of time. Given the sheer volume of documents and the response time requirements, documents that are unlikely to ever be accessed should be stored on an inexpensive device (such as tape), while documents that are likely to be accessed should be placed on a more expensive, higher-performance device. Unfortunately, traditional data partitioning techniques either require substantial manual involvement, or are not suitable for read-rarely workloads. In this paper, we present a novel technique to address this problem. We estimate the future access likelihood for a document based on past workloads of keyword queries and the click-through behavior for top-K query answers, then use this information to drive partitioning decisions. Our overall best scheme, the document-split inverted index, does not require any parameter tuning and yet performs close to the optimal partitioning strategy. Experiments show that document-split partitioning improves performance on a large intranet query workload by a factor of 4 when we add a fast storage server that holds 20% of the data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Ricardo Baeza-Yates , Aristides Gionis , Flavio Junqueira , Vanessa Murdock , Vassilis Plachouras , Fabrizio Silvestri, The impact of caching on search engines, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277775]
|
 |
2
|
|
 |
3
|
|
| |
4
|
R. Glazer. Measuring the value of information: The information-intensive organization. IBM Systems Journal, 32(1):99--110, 1993.
|
 |
5
|
|
| |
6
|
C. Johnson. ILM Case Study: Complete Data Lifecycle Management Solution. http://www.snia.org/, 2004.
|
| |
7
|
B. Klimt and Y. Yang. Introducing the Enron Corpus. In Conference on Email and Anti-Spam (CEAS), 2004.
|
| |
8
|
|
 |
9
|
|
| |
10
|
E. Pierre. Introduction to ILM: A tutorial. http://www.snia.org/, 2004.
|
 |
11
|
|
| |
12
|
A. Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43, 2001.
|
 |
13
|
David Carmel , Doron Cohen , Ronald Fagin , Eitan Farchi , Michael Herscovici , Yoelle S. Maarek , Aya Soffer, Static index pruning for information retrieval systems, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.43-50, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383958]
|
| |
14
|
|
CITED BY
|
|
Andrew W. Leung , Minglong Shao , Timothy Bisson , Shankar Pasupathy , Ethan L. Miller, Spyglass: fast, scalable metadata search for large-scale storage systems, Proccedings of the 7th conference on File and stroage technologies, p.153-166, February 24-27, 2009, San Francisco, California
|
|