ACM Home Page
Please provide us with feedback. Feedback
When one sample is not enough: improving text database selection using shrinkage
Full text PdfPdf (391 KB)
Source International Conference on Management of Data archive
Proceedings of the 2004 ACM SIGMOD international conference on Management of data table of contents
Paris, France
SESSION: Research sessions: text and DB table of contents
Pages: 767 - 778  
Year of Publication: 2004
ISBN:1-58113-859-8
Authors
Panagiotis G. Ipeirotis  Columbia University
Luis Gravano  Columbia University
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 50,   Citation Count: 11
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1007568.1007655
What is a DOI?

ABSTRACT

Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of "shrinkage" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their "unshrunk" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated "relevance judgments," show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
 
5
J. G. Conrad, X. S. Guo, P. Jackson, and M Meziou. Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment. In VLDB 2002, 2002.
6
 
7
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B(39), 1977.
8
9
10
11
12
13
 
14
D. Harman. Overview of the Fourth Text REtrieval Conference (TREC-4). In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), 1996.
 
15
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer Verlag, Aug. 2001.
 
16
P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB 2002, 2002.
 
17
P. G. Ipeirotis and L. Gravano. When one sample is not enough: Improving text database selection using shrinkage. Technical Report CUCS-013-04, Columbia University, Computer Science Department, Mar. 2004.
18
 
19
 
20
B. B. Mandelbrot. Fractal Geometry of Nature. W. H. Freeman & Co., 1988.
 
21
J. P. Marques De Sá. Applied Statistics. Springer Verlag, 2003.
 
22
 
23
 
24
 
25
 
26
27
28
 
29
E. Voorhees and D. Harman. Overview of the Sixth Text REtrieval Conference (TREC-6). In NIST Special Publication 500-240: The Sixth Text REtrieval Conference (TREC-6), 1998.
30
31
32
 
33

CITED BY  11
Collaborative Colleagues:
Panagiotis G. Ipeirotis: colleagues
Luis Gravano: colleagues