ACM Home Page
Please provide us with feedback. Feedback
Query-based partitioning of documents and indexes for information lifecycle management
Full text PdfPdf (993 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data table of contents
Vancouver, Canada
SESSION: Research Session 14: Ordered Data table of contents
Pages 623-636  
Year of Publication: 2008
ISBN:978-1-60558-102-6
Authors
Soumyadeb Mitra  University of Ilinois at Urbana Champaign, Urbana, IL, USA
Marianne Winslett  University of Illinois at Urbana Champaign, Urbana, IL, USA
Windsor W. Hsu  Data Domain Inc, Santa Clara, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 268,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376616.1376680
What is a DOI?

ABSTRACT

Regulations require businesses to archive many electronic documents for extended periods of time. Given the sheer volume of documents and the response time requirements, documents that are unlikely to ever be accessed should be stored on an inexpensive device (such as tape), while documents that are likely to be accessed should be placed on a more expensive, higher-performance device. Unfortunately, traditional data partitioning techniques either require substantial manual involvement, or are not suitable for read-rarely workloads. In this paper, we present a novel technique to address this problem. We estimate the future access likelihood for a document based on past workloads of keyword queries and the click-through behavior for top-K query answers, then use this information to drive partitioning decisions. Our overall best scheme, the document-split inverted index, does not require any parameter tuning and yet performs close to the optimal partitioning strategy. Experiments show that document-split partitioning improves performance on a large intranet query workload by a factor of 4 when we add a fast storage server that holds 20% of the data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
R. Glazer. Measuring the value of information: The information-intensive organization. IBM Systems Journal, 32(1):99--110, 1993.
5
 
6
C. Johnson. ILM Case Study: Complete Data Lifecycle Management Solution. http://www.snia.org/, 2004.
 
7
B. Klimt and Y. Yang. Introducing the Enron Corpus. In Conference on Email and Anti-Spam (CEAS), 2004.
 
8
9
 
10
E. Pierre. Introduction to ILM: A tutorial. http://www.snia.org/, 2004.
11
 
12
A. Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43, 2001.
13
 
14


Collaborative Colleagues:
Soumyadeb Mitra: colleagues
Marianne Winslett: colleagues
Windsor W. Hsu: colleagues