|
ABSTRACT
A unified model for text categorization and text retrieval is introduced. We use a training set of manually categorized documents to learn word-category associations, and use these associations to predict the categories of arbitrary documents. Similarly, we use a training set of queries and their related documents to obtain empirical associations between query words and indexing terms of documents, and use these associations to predict the related documents of arbitrary queries. A Linear Least Squares Fit (LLSF) technique is employed to estimate the likelihood of these associations. Document collections from the MEDLINE database and Mayo patient records are used for studies on the effectiveness of our approach, and on how much the effectiveness depends on the choices of training data, indexing language, word-weighting scheme, and morphological canonicalization. Alternative methods are also tested on these data collections for comparison. It is evident that the LLSF approach uses the relevance information effectively within human decisions of categorization and retrieval, and achieves a semantic mapping of free texts to their representations in an indexing language. Such a semantic mapping lead to a significant improvement in categorization and retrieval, compared to alternative approaches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
CHUTE, C. G., AND YANG, Y. 1992. An evaluatmn of concept based Latent Semantic Indexing for clinical information retrieval. In Proceedings of the 16th Annual Symposzum on Computer Applications ~n Medical Care, vol. 16. McGraw-HilL New York, 639-643.
|
| |
2
|
CPHA. 1986. International Classifice, tion of Dzseases. 9th Rev. Clinical Modifications. Commission on Professional and Hospital Activities. Ann Arbor, Mich.
|
| |
3
|
DEERWESTER, S., DUMAIS, S. T., FURNAS, G.W., LANDAUER, T. K., AND HARSHMAN, R. 1990. Indexing by Latent Semantic analysis. J. Am. Soc. Inf. Sci. 41, 6, 391-407.
|
| |
4
|
DONGARRA, J. J., MOLER, C. B., BUNCH, J. R., AND STEWART, C.W. 1979. LINPACK Users' Guide. SIAM, Philadelphia, Pa.
|
| |
5
|
DSC. 1991. M++ Class Library, User Guide. Rel. 3. Dyad Software Corporation, Bellevue, Wash.
|
| |
6
|
EVANS, D. A., CHUTE, C. G., HANDERSON, S. K., YANG, Y., MONARCH, I. A., AND HERSH, W. R. 1992. Mapping vocabularies using "Latent Semantics." In MEDINFO 92. 1462-1468.
|
| |
7
|
EVANS, D. A., HERSH, W. R., MONARCH, I. A., LEFFERTS, R. G., AND HANDERSON, S.K. 1991. Automatic indexing of abstracts via natural-language processing using a simple thesaurus. Medical Decision Making 11, 4, 108-115.
|
 |
8
|
|
| |
9
|
FUHR, N., ET AL. 1991. AIR/X--a rule-based multistage indexing systems for large subject fields. In Proceedings of the RIAO'91. 606-623.
|
| |
10
|
|
| |
11
|
HAYNES, R., McKSBBON, K., WALKER, C., RYAN, N., FITZGERALD, D., AND RAMSDEN, M. 1990. Online access to MEDLINE in clinical settings. Ann. Int. Med. 112, 1, 78 84.
|
| |
12
|
HERSH, W. R., HICKAM, D. H., AND LEONE, T.J. 1992. Words, concepts, or both: Optimal indexing units for automated information retrieval. In Proceedings of the 16th Annual Symposium on Computer AppDcations in Medical Core, voL 16. McGraw-Hill, New York, 644 648.
|
| |
13
|
LAWSON, C. L., AND HANSON, R. J. 1974. Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs~ N.J.
|
| |
14
|
|
| |
15
|
NLM. 1993. Medical Subject Headings (MESH). National Library of Medicine, Bethesda, Md.
|
| |
16
|
SALTON, G. 1991. Development in automatic text retrieval. Science 253, 974-980.
|
| |
17
|
|
| |
18
|
SALTON, G., AND BUCKLEY, C. 1990. Improving retrieval performance by relevance feedback. J. Am. Soc. Inf. Sci. 41, 4, 288-297.
|
 |
19
|
|
| |
20
|
YANG, Y., AND CHUTE, C.G. 1993b. Words or concepts: The features of indexing units and their optimal use in information retrieval. In Proceedings of the 17th Annual Symposium on Computer Apphcations tn Medical Cure, vol. 17. McGraw-Hill, New York, 685-689.
|
| |
21
|
|
CITED BY 65
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland
|
|
|
|
|
|
J. Mostafa , S. Mukhopadhyay , M. Palakal , W. Lam, A multilevel approach to intelligent information filtering: model, system, and evaluation, ACM Transactions on Information Systems (TOIS), v.15 n.4, p.368-399, Oct. 1997
|
|
|
Dimitris Meretakis , Dimitris Fragoudis , Hongjun Lu , Spiros Likothanassis, Scalable association-based text classification, Proceedings of the ninth international conference on Information and knowledge management, p.5-11, November 06-11, 2000, McLean, Virginia, United States
|
|
|
|
|
|
|
|
|
|
|
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ding-Yi Chen , Xue Li , Zhao Yang Dong , Xia Chen, Determining the fitness of a document model by using conflict instances, Proceedings of the sixteenth Australasian database conference, p.125-133, January 01, 2005, Newcastle, Australia
|
|
|
|
|
|
|
|
|
Ruofei Zhang , Ramesh Sarukkai , Jyh-Herng Chow , Wei Dai , Zhongfei Zhang, Joint categorization of queries and clips for web-based video search, Proceedings of the 8th ACM international workshop on Multimedia information retrieval, October 26-27, 2006, Santa Barbara, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yen-Hsien Lee , Tsang-Hsiang Cheng , Ci-Wei Lan , Chih-Ping Wei , Paul Jen-Hwa Hu, Overcoming small-size training set problem in content-based recommendation: a collaboration-based training set expansion approach, Proceedings of the 11th International Conference on Electronic Commerce, August 12-15, 2009, Taipei, Taiwan
|
|
|
|
|
|
Sofus A. Macskassy , Haym Hirsh , Arunava Banerjee , Aynur A. Dayanik, Using text classifiers for numerical classification, Proceedings of the 17th international joint conference on Artificial intelligence, p.885-890, August 04-10, 2001, Seattle, WA, USA
|
|
|
|
|