|
ABSTRACT
Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored. Experimental cluster analysis is performed on a sample corpus of 2267 documents. A term-term similarity matrix is constructed for the 3950 unique terms used to index the documents. Various threshold values, T, are applied to the similarity matrix to provide a series of binary threshold matrices. The corresponding graph of each binary threshold matrix is used to obtain the term clusters.
Three definitions of a cluster are analyzed: (1) the connected components of the threshold matrix; (2) the maximal complete subgraphs of the connected components of the threshold matrix; (3) clusters of the maximal complete subgraphs of the threshold matrix, as described by Gotlieb and Kumar.
Algorithms are described and analyzed for obtaining each cluster type. The algorithms are designed to be useful for large document and index collections. Two algorithms have been tested that find maximal complete subgraphs. An algorithm developed by Bierstone offers a significant time improvement over one suggested by Bonner.
For threshold levels T ≥ 0.6, basically the same clusters are developed regardless of the cluster definition used. In such situations one need only find the connected components of the graph to develop the clusters.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
AUGUSTSON, J. G., .&ND MINKER, J. An analysis of some graph theoretical cluster techniques. Computer Science Center Tech. Rep. No. TR70-106, U. of Maryland, College Park, Md., Jan. 1970.
|
| |
2
|
--. Experiments with graph theoretical clustering techniques. Thesis submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of requirements for MS degree, U. of Maryland, College Park, Md., 1969.
|
 |
3
|
|
| |
4
|
BALL, G.H. Data analysis in the social sciences: What about the details? Proc. AFIPS 1965 Fall Joint Comput. Conf., Vol. 27, Pt. 1, pp. 533-559.
|
| |
5
|
BERGE, C., AND GHOUILx-HouRI, A. Programming, Games, and Networks. Wiley, New York, 1965.
|
| |
6
|
BIERSTONE, E. Cliques and generalized cliques in a finite linear graph. Unpublished rep.
|
| |
7
|
BONNER, R .E . On some clustering techniques. IBM J. Res. Develop. 8, 1 (Jan. 1964), 22-32.
|
| |
8
|
BoRxo, H. The construction of an empirically based mathematically derived classification system. Rep. No. SP-585, System Development Corp., Santa Monica, Calif., Oct. 26, 1961.
|
| |
9
|
--. Research in document classification and file organization. Rep. No. SP-1423, System Development Corp., Santa Monica, Calif., 1963.
|
| |
10
|
-- AND BERNICK, M. D. Automatic document classification. Part II--Additional experiments. Tech. Memo TM-771/001/00, System Development Corp., Santa Monica, Calif., Oct. 18, 1963.
|
| |
11
|
COSATI subject category list (DoD--modified). AD-624 000, Defense Documentation Center, Defense Supply Agency, Oct. 1965.
|
| |
12
|
DALE, A. G ., AND DALE, N. Some clumping experiments for information retrieval. LRC 64-WPI1, Linguistic Research Center, U. of Texas, Austin, Texas, Feb. 1964.
|
| |
13
|
D&TTOL&, R.T. A fast algorithm for automatic classification. In Information Storage and Retrieval series, Scientific Rep. No. ISR-14, Cornell U., Ithaca, N. Y.; Oct. 1968, Ch. V.
|
| |
14
|
DOYL, L.B. Breaking the cost barrier in automatic classification. Rep. No. SP-2516, System Development Corp., Santa Monica, Calif., July 1966.
|
| |
15
|
EURATOM-thesaurus: Keywords used within EURATOM's nuclear energy documentation project. Directorate "Dissemination of Information," Center for Information and Documentation, 1964.
|
 |
16
|
|
| |
17
|
GIULIANO, V. E ., AND JONES, P.E. Linear associative information retrieval. In Howerton, P. W., and Weeks, D. C. (Eds.), Vistas in Information Handling, Vol. 1, Spartan Books, Washington, D. C., 1963, Ch. 2, pp. 30-46.
|
| |
18
|
IvlE, E. L. Search procedures based on measures of relatedness between documents. Ph.D. thesis, MIT, Cambridge, Mass., May 1966.
|
| |
19
|
|
| |
20
|
KOCHEN, M., AND WONG, E. Concerning the possibility of a cooperative information exchange. IBM J. Res. Develop. 6, 2 (April 1962), 270-271.21. KUHNS, J .L . The continuum of coefficients of association. In Stevens, M. E., Giuliano,
|
| |
21
|
V. E., and Heilprin, L. B. (Eds.), Proc. Symposium in Statistical Association Methods for Mechanized Documentation, US Dep. of Commerce, Washington, D. C., Dec. 1965.
|
| |
22
|
--. Mathematical analysis of correlation clusters. In Word correlation and automatic indexing, Progress Rep. No. 2, C82-OU1, Ramo-Wooldridge, Canoga Park, Calif., Dec. 1959, Appendix D.
|
| |
23
|
LESK, M. E. Word-word association in document retrieval systems. In Information storage and retrieval, Scientific Rep. No. ISR-13, Cornell U., Ithaca, New York, Jan.. 1968, Section IX.
|
 |
24
|
|
| |
25
|
NEEDHAM, R.M. A method for using computers in information classification. In Popplewell, C. M. CEd0, Information Processing 1962, Proc. IFIP Congr. 62, North-Holland, Amsterdam, 1963, pp. 284--287.
|
| |
26
|
--. The termination of certain iterative processes. Memorandum RM-5188-PR, Rand Corp., Santa Monica, Calif., Nov. 1966.
|
| |
27
|
The theory of clumps II. Rep. No. ML 139, Cambridge Language Research Unit, Cambridge, England, March 1961.
|
| |
28
|
OGILVrE, J.C. The distribution of number and size of connected components in random graphs of medium size. In Morrell, A. J. H. (Ed.), Information Processing 68, Proc. IFIP Congress 68, Vol. 2---Hardware, Applications, North-Holland, Amsterdam, 1969, pp. 1527-1530.
|
| |
29
|
ORE, O. Graphs and Their Use. Random House, New York, 1963.
|
| |
30
|
PARKER-RHODES, A. F. Contributions to the theory of clumps: The usefulness and feasibility of the theory. Rep. No. ML 138, Cambridge Language Research Unit, Cambridge, England, March 1961.
|
| |
31
|
AND NEEDHAM, R.M. The theory of clumps. Rep. No. ML 126, Cambridge Language Research Unit, Cambridge, England, Feb. 1960.
|
| |
32
|
PRICE, N., XND SCHIINOVICH, S. A clustering experiment: First step towards a computer-generated classification scheme. Inform. Storage and Retrieval $, 3 (Aug. 1968), 271-280.
|
| |
33
|
ROCCHIO, J. J., JR. Document retrieval systems--optimization and evaluation. Scientific Rep. No. ISR-10, Computation Laboratory, Harvard U., Cambridge, Mass., 1966.
|
| |
34
|
ROGERS, D., AND TANIOTO, T. A computer program for classifying plants. Science 132 (Oct. 1960), 1115-1118.
|
| |
35
|
|
| |
36
|
SHEPXRD, M. J., AND WILLMOTT, A .J . Cluster analysis on the Atlas computer. Computer J. 11, 1 (May 1968), 57-62.
|
| |
37
|
SPXRCK-JONES, K. Automatic term classification and information retrieval. In Morrell, A. J. H. (Ed.), Information Processing 68, Proc. IFIP Congress 68, Vol. 2---Hardware, Applications, North-Holland, Amsterdam, 1969, pp. 1290-1295.
|
| |
38
|
Mechanized semantic classification. 1961 International Conf. on Machine Translation of Languages and Applied Language Analysis, National Physics Laboratory Symposium No. 13, Volume II, 1962, Paper 25, pp. 417--435.
|
| |
39
|
-- AND JACKSON, D. Current approaches to classification and clump-finding at Cambridge Language Research Unit. Computer J. 10, 1 (May 1967), 29-37.
|
| |
40
|
STEVENS, M.E. Automatic indexing: A state of the art report. NBS Monograph 91, US Dep. of Commerce, Washington, D. C., March 1965.
|
 |
41
|
|
| |
42
|
-- AND SALISBURY, B. A. The use of the B-coefficient in information retrieval. Unpublished rep., Sept. 1967.
|
| |
43
|
TANIOTO, T. An elementary mathematical theory of classification and prediction. Rep., IBM Corp., 1958.
|
CITED BY 29
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Raffay Hamid , Siddhartha Maddi , Aaron Bobick , Irfan Essa, Unsupervised analysis of activity sequences using event-motifs, Proceedings of the 4th ACM international workshop on Video surveillance and sensor networks, October 27-27, 2006, Santa Barbara, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yan Wei , Ren Maosheng , Tong Zhao , LI Xiaoming, A bandwidth management scheme support for real-time applications in wireless mesh networks, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
Matthew C. Schmidt , Nagiza F. Samatova , Kevin Thomas , Byung-Hoon Park, A scalable, parallel algorithm for maximal clique enumeration, Journal of Parallel and Distributed Computing, v.69 n.4, p.417-428, April, 2009
|
|
|
|
|
|
|
|
|
|
|
|
Raffay Hamid , Siddhartha Maddi , Amos Johnson , Aaron Bobick , Irfan Essa , Charles Isbell, A novel sequence representation for unsupervised analysis of human activities, Artificial Intelligence, v.173 n.14, p.1221-1244, September, 2009
|
|
|
Weiming Hu , Wei Hu , Nianhua Xie , Steve Maybank, Unsupervised active learning based on hierarchical graph-theoretic clustering, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, v.39 n.5, p.1147-1161, October 2009
|
|
|
|
|