ACM Home Page
Please provide us with feedback. Feedback
The "DGX" distribution for mining massive, skewed data
Full text PdfPdf (578 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Francisco, California
Pages: 17 - 26  
Year of Publication: 2001
ISBN:1-58113-391-X
Authors
Zhiqiang Bi  Carnegie Mellon University
Christos Faloutsos  Carnegie Mellon University
Flip Korn  AT&T Labs-Research
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
AAAI : American Association for Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 56,   Citation Count: 18
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/502512.502521
What is a DOI?

ABSTRACT

Skewed distributions appear very often in practice. Unfortunately, the traditional Zipf distribution often fails to model them well. In this paper, we propose a new probability distribution, the Discrete Gaussian Exponential (DGX), to achieve excellent fits in a wide variety of settings; our new distribution includes the Zipf distribution as a special case. We present a statistically sound method for estimating the DGX parameters based on maximum likelihood estimation (MLE). We applied DGX to a wide variety of real world data sets, such as sales data from a large retailer chain, us-age data from AT&T, and Internet clickstream data; in all cases, DGX fits these distributions very well, with almost a 99% correlation coefficient in quantile-quantile plots. Our algorithm also scales very well because it requires only a single pass over the data. Finally, we illustrate the power of DGX as a new tool for data mining tasks, such as outlier detection.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
L. Adamic. Zipf, power-laws, and pareto - a ranking tutorial, http://www.parc.xerox-com/istl/groups/iea /papers/ranking/ranking_html.
 
2
B. Berry and W. Garrison, Alternate explanations of urban rank size relations" Annals of the Association of American Geographers, 48:83-91, 1958.
 
3
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In IEEE Infocom '9g, New York, NY, Mar. 1999.
 
4
 
5
 
6
W. Fox and W. Lasker. The distribution of surname frequencies. International Statistical Review, 51:81-87, 1983.
 
7
Galton. The geometric mean in vital and social statistics. Proceedings of the Royal Society of London, 29:365-367, 1879.
 
8
 
9
P. Halmos. Random Mms. Annals of Mathematical Statistics, 15:182-189, 1944.
 
10
G. Herdan. Small Particle Statistics. Butterworth's, London, 2 edition, 1060.
 
11
B. Hill. The rank-frequency form of zipf's law. Journal of the American Statistical Association, 69(348):1017--1026, 1974.
 
12
The MathWorks Inc. Matlab user's guide.
 
13
J. Laherr6re. "parabolic fraetal" distributions in nature, ht tp://www.hubber t peak.com/laherrere /fractal.htm.
 
14
B. Laherrere and D. Sornette. Stretched exponential distributions in nature and economy: "fat tails" with characteristic scales. European Physical Journal, B(2):525-539, 1998.
 
15
S. K. N.I. Johnson and N. Baiakrishnan. Continuous Univaviate Distribution, Volume 1. John Wiley & Sons, Inc., U.S.A, 1994.
 
16
V. Pareto. Cours d'Economie Politique. Rouge and Cie, Lausanne and Paris, 1897.
 
17
H. Simon. On a class of skew distribution functions. Biometrika, 42:425-440, 1955.
 
18
 
19
G. Yule. A mathematical theory of evolution, based on conclusions of dr. i.e. willis, Lr.s. PhilosophieaI dimnsaetions of the Royal Society of London, 213:21-87, 1923.
 
20
G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.

CITED BY  18

Collaborative Colleagues:
Zhiqiang Bi: colleagues
Christos Faloutsos: colleagues
Flip Korn: colleagues