|
ABSTRACT
Syntactic phrase indexing and term clustering have been widely explored as text representation techniques for text retrieval. In this paper we study the properties of phrasal and clustered indexing languages on a text categorization task, enabling us to study their properties in isolation from query interpretation issues. We show that optimal effectiveness occurs when using only a small proportion of the indexing terms available, and that effectiveness peaks at a higher feature set size and lower effectiveness level for a syntactic phrase indexing than for word-based indexing. We also present results suggesting that traditional term clustering method are unlikely to provide significantly improved text representations. An improved probabilistic text categorization method is also presented.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
BFL+88
|
P. Biebricher , N. Fuhr , G. Lustig , M. Schwantner , G. Knorz, The automatic indexing system AIR/PHYS - from research to applications, Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, p.333-342, May 1988, Grenoble, France
[doi> 10.1145/62437.62470]
|
| |
Chu88
|
|
| |
Cro83
|
Croft, W. Experiments with representation in a document retrieval system. Information Technology: Research and Development, 2:1-21, 1983.
|
 |
CD90
|
|
 |
CTL91
|
W. Bruce Croft , Howard R. Turtle , David D. Lewis, The use of phrases and structured queries in information retrieval, Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, p.32-45, October 13-16, 1991, Chicago, Illinois, United States
[doi> 10.1145/122860.122864]
|
 |
Cro88
|
|
| |
DH73
|
Duda, R.. and Hart, P. Pattern Classification and Scene Analysis. Wiley- Interscience, New York, 1973.
|
| |
Fag87
|
Fagan, J. Ezperiments in Automatic Phrase Indezing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. PhD thesis, Dept. of Computer Science, Cornell University, 1987.
|
| |
Fie75
|
Field, B. Towards automatic indexing: Automatic assignment of controlledlanguage indexing and classification from free indexing. Journal of Documentation, 31(4):246-265, December 1975.
|
| |
Fuh89
|
|
 |
FB90
|
|
| |
FHL+91
|
Fuhr, N., Hartmann, S., Lustig, G., Schwantner, M., Tzeras, K., and Knorz, G.. AII~/X~a rule-based multistage indexing system for large subject fields. RIAO 91, pp. 606-623, 1991.
|
| |
HZ80
|
Hamill, K. and Zamora, A. The use of titles for automatic document classification. JASIS, pp. 396-402, Nov. 1980.
|
| |
HW90
|
|
| |
JC82
|
Jain, A. and Chandrasekaran, B. Dimensionality and sample size considerations in pattern recognition practice. In Krishnaiah, P. and Kanal, L., eds., Handbook of Statistics, Vol. Z, pp. 835-855. North-Holland, Amsterdam, 1982.
|
| |
Jam85
|
|
| |
Lan86
|
Lancaster, F. Vocabulary Control for Information Retrieval. Information Resources, Arlington, VA, 2nd edition, 1986.
|
 |
LC90
|
|
| |
Lew91
|
|
| |
Lew92
|
|
| |
Lew92c
|
Lewis, D. and tkinguette, M. Text categorization by inductive learning. Manuscript, Jan. 1992.
|
 |
Mar61
|
|
| |
Mur83
|
Murtagh, F. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4) :354-359, 1983.
|
| |
PW91
|
Peat, H. and Willett, P. The limitations of term co-occurrence data for query expansion in document retrieval systems. JASIS, 42(5):378-383, 1991.
|
| |
RS76
|
l~obertson, S. and Sparck :Iones, K. l~elevance weighting of search terms. JASIS, pp. 129-146, May-June 1976.
|
| |
SB90
|
Salton, G. and Buckley, C. Improving retrieval performance by relevance feedback. JASIS, 41(4):288-297, 1990.
|
| |
SS91
|
Smeaton, A. and Sheridan, P. Using morpho-syntactic language analysis in phrase matching. RIAO-91, pp. 414- 429, 1991.
|
| |
Spa73a
|
Sparck Jones, K. Collection properties influencing automatic term classification performance. Information Storage and Retrieval, 9:499-513, 1973.
|
| |
Str92
|
Strzalkowski, T. Information retrieval using robust natural language processing. In Proceedings of Speech and Natural Language Workshop, Morgan Kaufmann: San Mateo, CA, Feb. 1992. To appear.
|
| |
Sun91
|
Sundheim, B., editor. Proceedings of the Third Message Understanding Evaluation and Conference, Morgan Kaufmann, Los Altos, CA, May 1991.
|
| |
van77
|
van Kijsbergen, C. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2):106-119, June 1977.
|
| |
van79
|
|
| |
van81
|
van RJjsbergen, C. Retrieval effectiveness. In Sparck :Iones, K., editor, information Retrieval E~periment. Butterworths, London, 1981.
|
CITED BY 64
|
|
|
|
|
|
|
|
C. K. P. Wong , R. W. P. Luk , K. F. Wong , K. L. Kwok, Text categorization using hybrid (mined) terms (poster session), Proceedings of the fifth international workshop on on Information retrieval with Asian languages, p.217-218, September 30-October 01, 2000, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hinrich Schütze , David A. Hull , Jan O. Pedersen, A comparison of classifiers and document representations for the routing problem, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.229-237, July 09-13, 1995, Seattle, Washington, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yuefeng Li , Xujuan Zhou , Peter Bruza , Yue Xu , Raymond Y.K. Lau, A two-stage text mining model for information filtering, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
K. Rajan , V. Ramalingam , M. Ganesan , S. Palanivel , B. Palaniappan, Automatic classification of Tamil documents using vector space model and artificial neural network, Expert Systems with Applications: An International Journal, v.36 n.8, p.10914-10918, October, 2009
|
|
|
|
|
|
Tokunaga Takenobu , Iwayama Makoto , Tanaka Hozumi, Automatic thesaurus construction based on grammatical relations, Proceedings of the 14th international joint conference on Artificial intelligence, p.1308-1313, August 20-25, 1995, Montreal, Quebec, Canada
|
|
|
|
|