|
ABSTRACT
This informal note was prompted by discussions and questions at the 1990 AAAI Spring Symposium on Text-Based Intelligent Systems (cf Jacobs 1990). There is a growing interest in access to, and the use of, large scale full-text databases for a variety of purposes, and in the application of classification methods to organise the mass of data involved (see e.g. Church and Hanks 1990). A good deal of work has been done in this field in the past, but it is little known, and some of the early research literature is not very accessible. Classification is an area in which it is easy to make plausible but mistaken assumptions, and as this certainly holds for classification in retrieval, there is a good deal that can be usefully learnt from past experience, most of which was hard won from careful thought and grinding experiment. This paper is intended as an introduction to this initial work on automatic classification, to help those now becoming interested in classification to avoid unnecessarily repeating heavy effort or, more especially, reinventing square wheels. It should also be noted that automatic classification and related (e.g. seriation) methods have been extensively developed for biological applications in particular, but have been more variously applied, and that much of this work may be relevant in the broad area of machine learning.It must be emphasised that as this paper is focussed on early work on automatic classification, particularly for information retrieval, and is designed primarily to lead into this research and its literature, it does not attempt a critical evaluation of the overall results established by now, or of the current state of the art. However it should be pointed out that in the retrieval context in general, as opposed to the wider one of classification as a whole, there has been comparatively little work since the seventies, largely for the reasons indicated in the paper. More recent work in any case refers heavily to earlier research, so this note can be taken as an entry point to the research of the last decade for which some references are given at the end of the note.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
P. H. A. Sneath and R. R. Sokal. Numerical taxonomy, San Francisco: Freeman, 1973. This substantial book gives a very good and comprehensive, if somewhat biologically-oriented, picture of the area as a whole, and is also a very useful point of access into the literature. It is essential reading as an indication of the sophistication and scope of the field. (The book is not a simple updating of Sneath and Sokal's earlier Principles of numerical taxonomy, 1963: the difference reflects the growth in the field in the sixties.)
|
| |
2
|
R. M. Cormack 'A review of classification', Journal of the Royal Statistical Society Series A, 134, 1971, 321--367. A much shorter, but usefully information-packed introductory review.
|
| |
3
|
P. Macnaughton-Smith 'Some statistical and other numerical techniques for classifying individuals' (Home Office studies in the causes of delinquency and the treatment of offenders 6), London: Her Majesty's Stationery Office, 1965. Good, primarily discursive, presentation of issues.
|
| |
4
|
N. Jardine and R. Sibson Mathematical taxonomy, London: Wiley, 1971. This emphasises, and considers in detail, well-foundedness in classification, treating a range of problems and approaches from this point of view. Jardine and Sibson's work was notable for demonstrating the formal merits of single-link clustering.
|
| |
5
|
R. Sibson 'Order invariant methods for data analysis', Journal of the Royal Statistical Society Series B, 34, 1972, 311--349. A useful review focusing on an important general issue in relation to classification and data analysis, especially from a computational point of view.
|
 |
6
|
|
| |
7
|
M. E. Stevens Automatic indexing: a state-of-the-art report, Monograph 91, National Bureau of Standards, Washington DC, 1965, revised edition 1970. A comprehensive review including classification work, produced when enthusiasm and hope for this area was at its height.
|
| |
8
|
M. E. Stevens, L. Heilprin and V. E. Giuliano (eds) Statistical association methods for mechanised documentation, Symposium proceedings (1964), National Bureau of Standards, Washington DC, 1965. This is also a 'peak' collection, directly presenting the work being done in the area and showing its variety.
|
| |
9
|
K. Sparck Jones 'Some thoughts on classification for retrieval', Journal of Documentation 26, 1970, 89--101. A short discussion of key issues, linking the retrieval application with work on automatic classification in general.
|
| |
10
|
|
| |
11
|
|
| |
12
|
K. Sparck Jones Automatic keyword classification for information retrieval, London: Butterworths, 1971. A monograph describing the motivation for, and experiments in, automatic retrieval thesaurus construction initiated with Needham's work on the theory of clumps (itself summarised for the retrieval context in K. Sparck Jones 'The theory of clumps' in The encyclopedia of library and information science (ed Kent and Lancour), 1971).
|
| |
13
|
K. Sparck Jones and R. G. Bates 'Research on automatic indexing 1974-1976' 2 vols, British Library R&D Report 5428, and Computer Laboratory, University of Cambridge, 1975. Describes a whole series of tests with different collections covering a wide range of indexing methods, including term classifications, and showing that term weighting is much more useful than term classification.
|
| |
14
|
N. Jardine and C. J. van Rijsbergen 'The use of hierarchic clustering in information retrieval', Information Storage and Retrieval 7, 1971, 217--240. Account of early document clustering experiments in context of general theory and motivation for clustering for retrieval.
|
| |
15
|
|
| |
16
|
K. Sparck Jones (ed) Information retrieval experiment London: Butterworths, 1981. This includes a review chapter on retrieval system tests 1958-1978 which serves to place work on automatic classification for retrieval in a wider indexing context.
|
| |
17
|
R. M. Needham 'The application of digital computers to classification and grouping', PhD thesis, University of Cambridge, 1961; published as a report under the title 'Research on information retrieval, classification and grouping, 1957-1961', Cambridge Language Research Unit, 1961.
|
| |
18
|
E. L. Ivie 'Search procedures based on measures of relatedness between documents', PhD thesis, MIT, 1966.
|
| |
19
|
J. L. Rocchio 'Document retrieval system - optimisation and evaluation', PhD thesis, Harvard University, 1965; also as Report ISR-10, Computation Laboratory, Harvard University, 1966.
|
| |
20
|
|
| |
21
|
W. B. Croft 'A model of cluster searching based on classification', Information systems 5, 1980, 189--195.
|
| |
22
|
|
| |
23
|
A. Griffiths, H.C. Luckhurst and P. Willett 'Using interdocument similarity information in document retrieval systems', Journal of the American Society for Information Science 37, 1986, 3--11.
|
| |
24
|
H. J. Peat and P. Willett 'The limitations of term co-occurrence data for query expansion in document retreiavl systems', Journal of the American Society for Information Science, in press.
|
| |
25
|
C. J. van Rijsbergen, D.J. Harper and M.F. Porter 'The selection of good search terms', Information Processing and Management, 17, 1981, 77--91.
|
| |
26
|
S. E. Robertson, M. E. Maron and W. S. Cooper 'Probability of relevance: a unification of two competing models for document retrieval', Information Technology: Research and Development 1, 1982, 1--21.
|
| |
27
|
G. Salton and C. Buckley 'Improving retrieval performance by relevance feedback', Journal of the ASIS 41, 1990, 288--297.
|
| |
28
|
A. F. Smeaton and C. J. van Rijsbergen 'The retrieval effects of query expansion on a feedback document retrieval system', The Computer Journal 26, 1983, 239--246.
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
P. S. Jacobs (ed) Text-based intelligent systems: current research in text analysis, information extraction, and retrieval Report 90CRD198, General Electric Research and Development Centre, Schenectady, 1990.
|
CITED BY 7
|
|
|
|
|
J. Nie , F. Paradis , J. Vaucher, Adjusting the performance of an information retrieval system, Proceedings of the second international conference on Information and knowledge management, p.726-728, November 01-05, 1993, Washington, D.C., United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|