ACM Home Page
Please provide us with feedback. Feedback
An evaluation of phrasal and clustered representations on a text categorization task
Full text PdfPdf (1.22 MB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Copenhagen, Denmark
Pages: 37 - 50  
Year of Publication: 1992
ISBN:0-89791-523-2
Author
David D. Lewis  Center for Information and Language Studies, University of Chicago, Chicago, IL
Sponsors
Royal School of Lib. : Royal School of Lib.
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 128,   Citation Count: 64
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/133160.133172
What is a DOI?

ABSTRACT

Syntactic phrase indexing and term clustering have been widely explored as text representation techniques for text retrieval. In this paper we study the properties of phrasal and clustered indexing languages on a text categorization task, enabling us to study their properties in isolation from query interpretation issues. We show that optimal effectiveness occurs when using only a small proportion of the indexing terms available, and that effectiveness peaks at a higher feature set size and lower effectiveness level for a syntactic phrase indexing than for word-based indexing. We also present results suggesting that traditional term clustering method are unlikely to provide significantly improved text representations. An improved probabilistic text categorization method is also presented.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

BFL+88
 
Chu88
 
Cro83
Croft, W. Experiments with representation in a document retrieval system. Information Technology: Research and Development, 2:1-21, 1983.
CD90
CTL91
Cro88
 
DH73
Duda, R.. and Hart, P. Pattern Classification and Scene Analysis. Wiley- Interscience, New York, 1973.
 
Fag87
Fagan, J. Ezperiments in Automatic Phrase Indezing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Dept. of Computer Science, Cornell University, 1987.
 
Fie75
Field, B. Towards automatic indexing: Automatic assignment of controlledlanguage indexing and classification from free indexing. Journal of Documentation, 31(4):246-265, December 1975.
 
Fuh89
FB90
 
FHL+91
Fuhr, N., Hartmann, S., Lustig, G., Schwantner, M., Tzeras, K., and Knorz, G.. AII~/X~a rule-based multistage indexing system for large subject fields. RIAO 91, pp. 606-623, 1991.
 
HZ80
Hamill, K. and Zamora, A. The use of titles for automatic document classification. JASIS, pp. 396-402, Nov. 1980.
 
HW90
 
JC82
Jain, A. and Chandrasekaran, B. Dimensionality and sample size considerations in pattern recognition practice. In Krishnaiah, P. and Kanal, L., eds., Handbook of Statistics, Vol. Z, pp. 835-855. North-Holland, Amsterdam, 1982.
 
Jam85
 
Lan86
Lancaster, F. Vocabulary Control for Information Retrieval. Information Resources, Arlington, VA, 2nd edition, 1986.
LC90
 
Lew91
 
Lew92
 
Lew92c
Lewis, D. and tkinguette, M. Text categorization by inductive learning. Manuscript, Jan. 1992.
Mar61
 
Mur83
Murtagh, F. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4) :354-359, 1983.
 
PW91
Peat, H. and Willett, P. The limitations of term co-occurrence data for query expansion in document retrieval systems. JASIS, 42(5):378-383, 1991.
 
RS76
l~obertson, S. and Sparck :Iones, K. l~elevance weighting of search terms. JASIS, pp. 129-146, May-June 1976.
 
SB90
Salton, G. and Buckley, C. Improving retrieval performance by relevance feedback. JASIS, 41(4):288-297, 1990.
 
SS91
Smeaton, A. and Sheridan, P. Using morpho-syntactic language analysis in phrase matching. RIAO-91, pp. 414- 429, 1991.
 
Spa73a
Sparck Jones, K. Collection properties influencing automatic term classification performance. Information Storage and Retrieval, 9:499-513, 1973.
 
Str92
Strzalkowski, T. Information retrieval using robust natural language processing. In Proceedings of Speech and Natural Language Workshop, Morgan Kaufmann: San Mateo, CA, Feb. 1992. To appear.
 
Sun91
Sundheim, B., editor. Proceedings of the Third Message Understanding Evaluation and Conference, Morgan Kaufmann, Los Altos, CA, May 1991.
 
van77
van Kijsbergen, C. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2):106-119, June 1977.
 
van79
 
van81
van RJjsbergen, C. Retrieval effectiveness. In Sparck :Iones, K., editor, information Retrieval E~periment. Butterworths, London, 1981.

CITED BY  64