|
ABSTRACT
Skewed distributions appear very often in practice. Unfortunately, the traditional Zipf distribution often fails to model them well. In this paper, we propose a new probability distribution, the Discrete Gaussian Exponential (DGX), to achieve excellent fits in a wide variety of settings; our new distribution includes the Zipf distribution as a special case. We present a statistically sound method for estimating the DGX parameters based on maximum likelihood estimation (MLE). We applied DGX to a wide variety of real world data sets, such as sales data from a large retailer chain, us-age data from AT&T, and Internet clickstream data; in all cases, DGX fits these distributions very well, with almost a 99% correlation coefficient in quantile-quantile plots. Our algorithm also scales very well because it requires only a single pass over the data. Finally, we illustrate the power of DGX as a new tool for data mining tasks, such as outlier detection.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
L. Adamic. Zipf, power-laws, and pareto - a ranking tutorial, http://www.parc.xerox-com/istl/groups/iea /papers/ranking/ranking_html.
|
| |
2
|
B. Berry and W. Garrison, Alternate explanations of urban rank size relations" Annals of the Association of American Geographers, 48:83-91, 1958.
|
| |
3
|
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In IEEE Infocom '9g, New York, NY, Mar. 1999.
|
| |
4
|
|
| |
5
|
|
| |
6
|
W. Fox and W. Lasker. The distribution of surname frequencies. International Statistical Review, 51:81-87, 1983.
|
| |
7
|
Galton. The geometric mean in vital and social statistics. Proceedings of the Royal Society of London, 29:365-367, 1879.
|
| |
8
|
|
| |
9
|
P. Halmos. Random Mms. Annals of Mathematical Statistics, 15:182-189, 1944.
|
| |
10
|
G. Herdan. Small Particle Statistics. Butterworth's, London, 2 edition, 1060.
|
| |
11
|
B. Hill. The rank-frequency form of zipf's law. Journal of the American Statistical Association, 69(348):1017--1026, 1974.
|
| |
12
|
The MathWorks Inc. Matlab user's guide.
|
| |
13
|
J. Laherr6re. "parabolic fraetal" distributions in nature, ht tp://www.hubber t peak.com/laherrere /fractal.htm.
|
| |
14
|
B. Laherrere and D. Sornette. Stretched exponential distributions in nature and economy: "fat tails" with characteristic scales. European Physical Journal, B(2):525-539, 1998.
|
| |
15
|
S. K. N.I. Johnson and N. Baiakrishnan. Continuous Univaviate Distribution, Volume 1. John Wiley & Sons, Inc., U.S.A, 1994.
|
| |
16
|
V. Pareto. Cours d'Economie Politique. Rouge and Cie, Lausanne and Paris, 1897.
|
| |
17
|
H. Simon. On a class of skew distribution functions. Biometrika, 42:425-440, 1955.
|
| |
18
|
Virgílio Almeida , Azer Bestavros , Mark Crovella , Adriana de Oliveira, Characterizing reference locality in the WWW, Proceedings of the fourth international conference on on Parallel and distributed information systems, p.92-107, December 18-20, 1996, Miami Beach, Florida, United States
|
| |
19
|
G. Yule. A mathematical theory of evolution, based on conclusions of dr. i.e. willis, Lr.s. PhilosophieaI dimnsaetions of the Royal Society of London, 213:21-87, 1923.
|
| |
20
|
G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.
|
CITED BY 18
|
|
|
|
|
Jure Leskovec , Jon Kleinberg , Christos Faloutsos, Graphs over time: densification laws, shrinking diameters and possible explanations, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Salvatore J. Stolfo , Shlomo Hershkop , Chia-Wei Hu , Wei-Jen Li , Olivier Nimeskern , Ke Wang, Behavior-based modeling and its application to Email analysis, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.187-221, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
Mukund Seshadri , Sridhar Machiraju , Ashwin Sridharan , Jean Bolot , Christos Faloutsos , Jure Leskove, Mobile call graphs: beyond power-law and lognormal distributions, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|