ACM Home Page
Please provide us with feedback. Feedback
A corpus analysis approach for automatic query expansion and its extension to multiple databases
Full text PdfPdf (111 KB)
Source ACM Transactions on Information Systems (TOIS) archive
Volume 17 ,  Issue 3  (July 1999) table of contents
Pages: 250 - 269  
Year of Publication: 1999
ISSN:1046-8188
Authors
Susan Gauch  University of Kansas
Jianying Wang  University of Kansas
Satya Mahesh Rachakonda  University of Kansas
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 59,   Citation Count: 17
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/314516.314519
What is a DOI?

ABSTRACT

Searching online text collections can be both rewarding and frustrating. While valuable information can be found, typically many irrelevant documents are also retrieved, while many relevant ones are missed. Terminology mismatches between the user's query and document contents are a main cause of retrieval failures. Expanding a user's query with related words can improve search performances, but finding and using related words is an open problem. This research uses corpus analysis techniques to automatically discover similar words directly from the contents of the databases which are not tagged with part-of-speech labels. Using these similarities, user queries are automatically expanded, resulting in conceptual retrieval rather than requiring exact word matches between queries and documents. We are able to achieve a 7.6% improvement for TREC 5 queries and up to a 28.5% improvement on the narrow-domain Cystic Fibrosis collection. This work has been extended to multidatabase collections where each subdatabase has a collection-specific similarity matrix associated with it. If the best matrix is selected, substantial search improvements are possible. Various techniques to select the appropriate matrix for a particular query are analyzed, and a 4.8% improvement in the results is validated.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
CROFT, W. B., COOK, R., AND WILDER, D. 1995. Providing government information on the Internet: Experiences with THOMAS. In Proceedings of the Digital Libraries Conference (DL '95). 19-24.
4
 
5
DEERWESTER, S., DUMAI, S. T., FURNAS, G. W., LANDAUER, T. K., AND HARSHMAN, R. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6, 391-407.
 
6
FINCH, S. AND CHATER, N. 1992. Bootstrapping syntactic categories using statistical methods. In Proceedings of the 1st SHOE Workshop (The Netherlands), W. Daelemans and D. Powers, Eds. 229-235.
 
7
GAUCH, S. AND CHONG, M. 1995. Automatic word similarity detection for TREC 4 query expansion. In Proceedings of the 4th Text Retrieval Conference (TREC-4, Washington, D.C., Nov.), D. K. Harman, Ed. National Institute of Standards and Technology, Gaithersburg, MD, 527-536.
 
8
GAUCH, S. AND RACHAKONDA, S. 1997. Experiments in automatic similarity matrix selection for query expansion. Tech. Rep. ITTC-FY97-TR-11100-3. Information and Telecommunication Technology Center, University of Kansas, Lawrence, KS.
9
 
10
GAUCH, S. AND SMITH, J. B. 1993. An expert system for automatic query reformulation. J. Am. Soc. Inf. Sci. 44, 3, 124-136.
 
11
GAUCH, S. AND WANG, J. 1996. Automatic word similarity detection for TREC 5 query expansion. In Proceedings of the 5th Text Retrieval Conference (TREC-5, Gaithersburg, MD, Nov.), E. M. Voorhees and D. K. Harman, Eds. National Institute of Standards and Technology, Gaithersburg, MD.
 
12
GAUCH, S. AND WANG, g. 1997. Tuning a corpus analysis approach for automatic query expansion. Tech. Rep. ITTC-FY97-TR-11100-2. Information and Telecommunication Technology Center, University of Kansas, Lawrence, KS.
13
14
 
15
JING, Y. AND CROFT, W. B. 1994. An association thesaurus for information retrieval. In Proceedings of the Intelligent Multimedia Information Retrieval Systems (RIAO '94, New York, NY). 146-160.
 
16
LIDDY, E. D. AND MYAENG, S. H. 1993. DR-LINK's linguistic-conceptual approach to document detection. In Proceedings of the 1st Text Retrieval Conference. 113-129.
 
17
MILLER, G. A. AND CHARLES, W. G. 1991. Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1, 1-28.
 
18
MYAENG, S. H. AND LI, M. 1992. Building term clusters by acquiring lexical semantics from a corpus. In Proceedings of the 1st International Conference on Information and Knowledge Management (CIKM-92, Baltimore, MD, Nov.), Y. Yesha, Ed. 130-137.
19
 
20
SCHUTZE, H. AND PEDERSEN, g. 1994. A cooccurrence-based thesaurus and two applications to information retrieval. In Proceedings of the Intelligent Multimedia Information Retrieval Systems (RIAO '94, New York, NY). 266-274.
 
21
SHAW, W. M. JR., WOOD, J. B., WOOD, R. E., AND TIBBO, H. R. 1991. The cystic fibrosis database: Content and research opportunities. Libr. Inf. Sci. Res. 12, 347-366.
 
22
SPARCK JONES, K. 1971. Automatic Keyword Classification for Information Retrieval. Butterworths, London, UK.
 
23
24

CITED BY  17

Collaborative Colleagues:
Susan Gauch: colleagues
Jianying Wang: colleagues
Satya Mahesh Rachakonda: colleagues