|
ABSTRACT
Data stream applications have made use of statistical summaries to reason about the data using nonparametric tools such as histograms, heavy hitters, and join sizes. However, relatively little attention has been paid to modeling stream data parametrically, despite the potential this approach has for mining the data. The challenges to do model fitting at streaming speeds are both technical -- how to continually find fast and reliable parameter estimates on high speed streams of skewed data using small space -- and conceptual -- how to validate the goodness-of-fit and stability of the model online.In this paper, we show how to fit hierarchical (binomial multifractal) and non-hierarchical (Pareto) power-law models on a data stream. We address the technical challenges using an approach that maintains a sketch of the data stream and fits least-squares straight lines; it yields algorithms that are fast, space-efficient, and provide approximations of parameter value estimates with a priori quality guarantees relative to those obtained offline. We address the conceptual challenge by designing fast methods for online goodness-of-fit measurements on a data stream; we adapt the statistical testing technique of examining the quantile-quantile (q-q) plot, to perform online model validation at streaming speeds.As a concrete application of our techniques, we focus on network traffic data which has been shown to exhibit skewed distributions. We complement our analytic and algorithmic results with experiments on IP traffic streams in AT&T's Gigascope® data stream management system, to demonstrate practicality of our methods at line speeds. We measured the stability and robustness of these models over weeks of operational packet data in an IP network. In addition, we study an intrusion detection application, and demonstrate the potential of online parametric modeling.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
L. Adamic. Zipf, power-law, pareto - a ranking tutorial. http://www.hpl.hp.com/research/idl/papers/ranking/, 2000.
|
 |
2
|
|
 |
3
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
4
|
|
| |
5
|
|
 |
6
|
Zhiqiang Bi , Christos Faloutsos , Flip Korn, The "DGX" distribution for mining massive, skewed data, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, p.17-26, August 26-29, 2001, San Francisco, California
[doi> 10.1145/502512.502521]
|
| |
7
|
CIDR, http://www.webopedia.com/TERM/C/CIDR.html.
|
 |
8
|
|
 |
9
|
Graham Cormode , Theodore Johnson , Flip Korn , S. Muthukrishnan , Oliver Spatscheck , Divesh Srivastava, Holistic UDAFs at streaming speeds, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007575]
|
| |
10
|
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. In LATIN, 2004.
|
| |
11
|
G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.
|
 |
12
|
|
| |
13
|
Mayur Datar , Aristides Gionis , Piotr Indyk , Rajeev Motwani, Maintaining stream statistics over sliding windows: (extended abstract), Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.635-644, January 06-08, 2002, San Francisco, California
|
| |
14
|
A. Deshpande, C. Guestrin, S. Madden, J.M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004.
|
| |
15
|
M. Evans, N. Hastings, and B. Peacock. Statistical Distributions. Wiley, New York, 3rd edition, 2000.
|
 |
16
|
|
| |
17
|
|
| |
18
|
P. Flajolet and G.N. Martin. Probabilistic counting. In FOCS, 1983.
|
| |
19
|
X. Gabaix, P. Gopikrishnan, V. Plerou, and H.E. Stanley. A theory of power law distributions in financial market fluctuations. 423:267--270, 2003.
|
 |
20
|
Johannes Gehrke , Flip Korn , Divesh Srivastava, On computing correlated aggregates over continual data streams, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.13-24, May 21-24, 2001, Santa Barbara, California, United States
|
| |
21
|
A.C. Gilbert, W. Willinger, and A. Feldmann. Scaling analysis of conservative cascades, with applications to network traffic. In IEEE Transaction on Information Theory, volume 45, pages 971--991, 1999.
|
| |
22
|
T. Krishnan G.J. McLachlan. The EM Algorithm and Extensions. Wiley-Interscience, 1996.
|
| |
23
|
C. Jr, A. Traina, L. Wu, and C. Faloutsos. Fast feature selection using the fractal dimension. In SBBD, 2000.
|
 |
24
|
|
 |
25
|
|
| |
26
|
Darpa intrusion detection evaluation. http://www.ll.mit.edu/IST/ideval/index.html.
|
| |
27
|
|
| |
28
|
B.B. Mandelbrot. Fractals and Scaling in Finance. Springer-Verlag, New York, 1997.
|
| |
29
|
|
| |
30
|
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Falout-sos. Loci: Fast outlier detection using the local correlation integral. In ICDE, 2003.
|
| |
31
|
S.I. Resnick. Heavy tail modeling and teletraffic data. The Annals of Statistics, 25:1805--1869, 1997.
|
| |
32
|
M. Roughan and C. Kalmanek. Pragmatic modeling of broad-band access traffic. In Computer Communications 26(8), 2003.
|
| |
33
|
M. Schroeder. Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise. W.H. Freeman and Company, New York, 1991.
|
| |
34
|
M. Wang, T. Madhyastha, N.H. Chan, S. Papadimitriou, and C. Faloutsos. Data mining meets performance evaluation: Fast algorithm for modeling bursty traffic. In ICDE, 2002.
|
 |
35
|
|
| |
36
|
W. Willinger and V. Paxson. Where mathematics meets the internet. Notices of the American Mathematical Society, 45(8):961--970, 1998.
|
| |
37
|
W. Willinger, V. Paxson, and M.S. Taqqu. Self-similarity and Heavy Tails: Structural Modeling of Network Traffic. Chapman & Hall, New York, 1998.
|
| |
38
|
|
|