|
ABSTRACT
Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and sampling-based aggregations. Also, we show how to implement the operator in Gigascope - a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subset-sum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
N. Duffield, C. Lund, M. Thorup. Learn more, sample less: control of volume and variance in network measurement. SIGCOMM 2001 Measurement workshop.
|
| |
3
|
G. Manku and R. Motwani. Approximate frequency counts over data streams. Proc. VLDB, 2002, 346--357.
|
 |
4
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
5
|
S. Muthukrishnan. Data stream algorithms and applications. http://www.cs,rutgers.edu/~stream-1-1.ps.
|
| |
6
|
A. Singh. http://www.cs.ucsb.edu/~ambuj/Courses/multimediaDB/sampling.pdf
|
 |
7
|
Amitabha Bagchi , Amitabh Chaudhary , David Eppstein , Michael T. Goodrich, Deterministic sampling and range counting in geometric data streams, Proceedings of the twentieth annual symposium on Computational geometry, June 08-11, 2004, Brooklyn, New York, USA
[doi> 10.1145/997817.997842]
|
 |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
SQL Server 2005. <u>http://www.microsoft.com/technet/prodtechnol/sql/2005/evaluate/dwsqlsy.mspx</u>
|
| |
12
|
P. Gulutzan and T. Pelzer, SQL-99 Complete, Really, CMP Books, 1999.
|
| |
13
|
D. Carney, U. Cetinternel, M. Cherniack, C. Coney, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul and S. Zdonik. Monitoring Streams - A New Class of Data Management Applications. VLDB 2002.
|
 |
14
|
Graham Cormode , Theodore Johnson , Flip Korn , S. Muthukrishnan , Oliver Spatscheck , Divesh Srivastava, Holistic UDAFs at streaming speeds, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007575]
|
 |
15
|
Marc Gyssens , Laks V. S. Lakshmanan , Iyer N. Subramanian, Tables as a paradigm for querying and restructuring (extended abstract), Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.93-103, June 04-06, 1996, Montreal, Quebec, Canada
[doi> 10.1145/237661.237688]
|
 |
16
|
Jianjun Chen , David J. DeWitt , Feng Tian , Yuan Wang, NiagaraCQ: a scalable continuous query system for Internet databases, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.379-390, May 15-18, 2000, Dallas, Texas, United States
|
| |
17
|
S. Chandrasekaran et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. Proc. CIDR 2003.
|
 |
18
|
|
| |
19
|
|
| |
20
|
R. Motwani, J. Widom, A. Arasu. B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, R. Varma. Query Processing, Resource Management, and Approximation in a Data Stream management System. In CIDR, pages 245--256, Jan 2003.
|
 |
21
|
|
| |
22
|
H. Wang, C. Zaniolo and C. Luo, ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams, Proc. VLDB 2003 pg 5--20.
|
| |
23
|
Y.-N. Law, H. Wang and C. Zaniolo, Query Languages and Data Models for Database Sequences and Data Streams, Proc. VLDB 2004 pg 492--503.
|
| |
24
|
|
CITED BY 15
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yijian Bai , Hetal Thakkar , Haixun Wang , Chang Luo , Carlo Zaniolo, A data stream language and system designed for power and extensibility, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Edith Cohen , Nick Duffield , Haim Kaplan , Carsten Lund , Mikkel Thorup, Stream sampling for variance-optimal estimation of subset sums, Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms, p.1255-1264, January 04-06, 2009, New York, New York
|
|
|
|
|
|
|
|