|
ABSTRACT
This paper is a report of a study investigating the validity of the Multiple Poisson (nP) model of word distribution in document collections. An nP distribution is a mixture of n Poisson distributions with different means. We describe a practical algorithm for determining if a certain word is distributed acording to an nP distribution and computing the distribution parameters. The algorithm was applied to every word in four different document collections. It was found that over 70% of frequently occurring words and terms indeed behave according to the nP distributions. The results indicate that the proportion of nP words depends on the collection size, document length and the frequency of the individual words. Most of the nP words recognised are distributed according to the mixture of relatively few single Poisson distributions (two, three or four). There is an indication that the number of single Poisson components in the mixture of relatively few single Poisson distributions (two, three or four). There is an indication that the number of single Poisson components in the mixture depends on the collection frequency of words.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Agha, M., & Ibrahim, M. T. (1984). Maximum Likelihood Estimation of Mixtures of Distributions. Applied Statistics, Journal of the Royal Statistzcal Soczety, 33,327-332.
|
| |
2
|
B~rtschi, M. (1985). An Overview of Information Retrieval Subjects. Compuler, 18(5),67-84.
|
| |
3
|
Bookstein, A., & Swanson, D. (1974). Probabilistic Models for Automatic Indexing. Journal of the American Society for {nformatzon Science, 25(5),312-318.
|
| |
4
|
Breiman, L. (1973). Statistics: Wtth a V~ew Towards Applications. Houghton-Mifflin Company.
|
| |
5
|
Damerau, F. J. (1965). An Experiment in Automatic Indexing. Amemcan Documentation, 16(4).
|
 |
6
|
|
| |
7
|
Harter, S. P. (1975). A Probabilistic Approach to Automatic Keyword Indexing: Part 1. Journal of the American Society for Information Science, 26(4),197-206.
|
| |
8
|
Hasselblad, V. (1969). }~'stimation of Finite Mixtures o{" Distributions from the Exponential Family. American Statistzcal Association journal, 64,1459-1471.
|
| |
9
|
Karp, R. M. (1990). An IJltroduction to Randomized AI.- gorithms. Technical Report TR-90-024, Computer Science Division, University of California, Berkeley.
|
| |
10
|
Luhn, H. P. (1958). The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2,156-165.
|
| |
11
|
Margulis, E. L. (1991). N-Poisson Document Modelling Revisited. Technical Report 166, ETH-Ziirich, Departement Informatik.
|
| |
12
|
Porter, M. (1980). An A_lgorithm for Suffix Stripping. Program, 24(3),130-137.
|
| |
13
|
|
| |
14
|
|
| |
15
|
Stone, D. C., & Rubinoff, M. (1968). Statistical Generation of a Technical Vocabulary. American Documentation, 19(4).
|
|