|
ABSTRACT
A new algorithm for document clustering is introduced. The base concept of the algorithm, the cover coefficient (CC) concept, provides a means of estimating the number of clusters within a document database and related indexing and clustering analytically. The CC concept is used also to identify the cluster seeds and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the information-retrieval effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method that is known to have good retrieval performance. The experiments also show that the algorithm is 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster-based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm and have shown improvements in retrieval effectiveness. In the experiments two document databases are used: TODS214 and INSPEC. The latter is a common database with 12,684 documents.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
ANDERBERG, M.R. Cluster Analysis for Applications. Academic Press, New York, 1973.
|
| |
2
|
CAN, F. A new clustering scheme and its use in an information retrieval system incorporating the support of a database machine. Ph.D. dissertation, Dept. of Computer Engineering, Middle East Technical Univ., Ankara, Turkey, 1985.
|
 |
3
|
|
| |
4
|
CAN, F., AND OZKARAHAN, E.A. A clustering scheme, In Proceedings of the 6th Annual Sci. 35, 5 (Sept. 1984), 268-276.
|
 |
5
|
|
| |
6
|
|
| |
7
|
CAN, F., AND OZKARAHAN, E. A. Effectiveness assessment of the cover coefficient based clustering methodology. Working Paper 89-002, Dept. of Systems Analysis, Miami Univ., Oxford, Ohio, Oct. 1989.
|
| |
8
|
CROUCH, D.B. A file organization and maintenance procedure for dynamic document collections. Inf. Process. Manage. 11, 1 {1975), 11-21.
|
| |
9
|
|
| |
10
|
|
| |
11
|
GRIFFITHS, A , LUCKHURST, C. AND WILLETT, P. Using interdocument similarity information in document retrieval systems. J. Am. Soc. Inf. Sci. 37, 1 (Jan. 1986), 3-11.
|
| |
12
|
GRIFFITHS, A., ROBINSON, L. A., AND WILLETT, P. Hierarchical agglomerative clustering methods for automatic document classification. J. Doc. 40, 3 (Sept. 1984), 175-205.
|
| |
13
|
HODGES, J. L., AND LEHMANN, E.L. Basic Concepts of Probability and Statistics. Holden-Day, San Francisco, Calif., 1964.
|
| |
14
|
IBM. IBM 3083 processor complex. IBM Doc. G221-2417-0, IBM Corp., Armonk, N.Y., Mar. 1982.
|
| |
15
|
|
| |
16
|
KUSIAK, A., AND CHOW, W. ~. An efficient cluster identification algorithm. 11~'EE "l'rans. Syst. Man Cybern. SMC-17, 4 (July-Aug. 1987), 696-699.
|
| |
17
|
KUTLUAY, M.S. A validity analysis of the cover coefficient concept on cluster analysis. M.S. thesis, Dept. of Electrical and Electronics Engineering, Middle East Technical Univ., Ankara Turkey, 1986.
|
| |
18
|
|
 |
19
|
|
 |
20
|
|
| |
21
|
SALTON, G. Cluster search strategies and the optimization of retrieval effectiveness. In The Smart Retrieval System--Experiments in Automatic Document Processing, G. Salton, Ed. Prentice-Hall, Englewood Cliffs, N.J., 1971, pp. 223-242.
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
 |
28
|
|
| |
29
|
|
| |
30
|
|
 |
31
|
|
| |
32
|
|
 |
33
|
|
REVIEWS
"Karen Sparck-Jones : Reviewer"
The authors give an exhaustive account of a method of clustering
and some experiments with it. The application is to document clustering;
the presumption is that clustering will both reduce search effort in
retrieving documents for a request a
more...
"Fazli Can : Reviewer","Esen A. Ozkarahan : Reviewer"
In her review, K. Sparck-Jones claims that the cluster seed
selection process would fail catastrophically if two disjoint databases
with identical statistical properties were combined. Sparck-Jones states
that this was brought to her attention
more...
"Karen Sparck-Jones : Reviewer"
I have two points in response to the authors. First, they were
correct in their statement about disjoint databases. R. Needham
misunderstood their paper and offers his apologies. My second point
concerns clustering for recall or precision. In
more...
|