ACM Home Page
Please provide us with feedback. Feedback
Small synopses for group-by query verification on outsourced data streams
Full text PdfPdf (402 KB)
Source
ACM Transactions on Database Systems (TODS) archive
Volume 34 ,  Issue 3  (August 2009) table of contents
Article No. 15  
Year of Publication: 2009
ISSN:0362-5915
Authors
Ke Yi  Hong Kong University of Science and Technology, Hong Kong, China
Feifei Li  Florida State University, Tallahassee, FL
Graham Cormode  AT&T Labs——Research, Florham Park, NJ
Marios Hadjieleftheriou  AT&T Labs——Research, Florham Park, NJ
George Kollios  Boston University, Boston, MA
Divesh Srivastava  AT&T Labs——Research, Florham Park, NJ
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 67,   Downloads (12 Months): 128,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1567274.1567277
What is a DOI?

ABSTRACT

Due to the overwhelming flow of information in many data stream applications, data outsourcing is a natural and effective paradigm for individual businesses to address the issue of scale. In the standard data outsourcing model, the data owner outsources streaming data to one or more third-party servers, which answer queries posed by a potentially large number of clients on the data owner's behalf. Data outsourcing intrinsically raises issues of trust, making outsourced query assurance on data streams a problem with important practical implications. Existing solutions proposed in this model all build upon cryptographic primitives such as signatures and collision-resistant hash functions, which only work for certain types of queries, for example, simple selection/aggregation queries.

In this article, we consider another common type of queries, namely, “GROUP BY, SUM” queries, which previous techniques fail to support. Our new solutions are not based on cryptographic primitives, but instead use algebraic and probabilistic techniques to compute a small synopsis on the true query result, which is then communicated to the client so as to verify the correctness of the query result returned by the server. The synopsis uses a constant amount of space irrespective of the result size, has an extremely small probability of failure, and can be maintained using no extra space when the query result changes as elements stream by. We then generalize our synopsis to allow some tolerance on the number of erroneous groups, in order to support semantic load shedding on the server. When the number of erroneous groups is indeed tolerable, the synopsis can be strengthened so that we can locate and even correct these errors. Finally, we implement our techniques and perform an empirical evaluation using live network traffic.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Abadi, D., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and Zdonik, S. 2005. The design of the Borealis stream processing engine. In Proceedings of the Biennial Conference on Innovative Data Systems Research.
 
2
Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the ACM Symposium on Theory of Computation, 20--29.
 
3
Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J., and Widom, J. 2003. STREAM: The Stanford stream data manager. IEEE Data Engin. Bull. 26, 1, 19--26.
 
4
Arasu, A. and Manku, G. S. 2004. Approximate counts and quantiles over sliding windows. In Proceedings of the ACM Symposium on Principles of Database Systems, 286--296.
 
5
Arlitt, M. and Jin, T. ITA, 1998 world cup Web site access logs. http://www.acm.org/sigcomm/ITA/.
 
6
Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the ACM Symposium on Principles of Database Systems.
 
7
Babcock, B., Datar, M., and Motwani, R. 2004. Load shedding for aggregation queries over data streams. In Proceedings of the IEEE International Conference on Data Engineering, 350--361.
 
8
Babcock, B., Datar, M., Motwani, R., and O'Callaghan, L. 2003. Maintaining variance and k-medians over data stream windows. In Proceedings of the ACM Symposium on Principles of Database Systems, 234--243.
 
9
Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the International Workshop on Randomization and Approximation Techniques (RANDOM), 1--10.
 
10
Bellare, M., Goldreich, O., and Goldwasser, S. 1994. Incremental cryptography: The case of hashing and signing. In Proceedings of the Conference on Advances in Cryptology (CRYPTO), 216--233.
 
11
Bellare, M., Guerin, R., and Rogaway, P. 1995. Xor macs: New methods for message authentication using finite pseudorandom functions. In Proceedings of the Conference on Advances in Cryptology (CRYPTO), 15--28.
 
12
Bertino, E., Carminati, B., Ferrari, E., Thuraisingham, B., and Gupta, A. 2004. Selective and authentic third-party distribution of XML documents. IEEE Trans. Knowl. Data Engin. 16, 10, 1263--1278.
 
13
Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., and Zdonik, S. 2003. Monitoring streams—A new class of data management applications. In Proceedings of the International Conference on Very Large Databases, 215--226.
 
14
Chan, H., Perrig, A., and Song, D. 2006. Secure hierarchical in-network aggregation in sensor networks. In Proceedings of the ACM Conference on Computer and Communications Security, 278--287.
 
15
Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, D., Raman, V., Reiss, F., and Shah, M. A. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of the Biennial Conference on Innovative Data Systems Research.
 
16
Cormode, G. and Muthukrishnan, S. 2003. What's hot and what's not: Tracking most frequent items dynamically. In Proceedings of the ACM Symposium on Principles of Database Systems, 296--306.
 
17
Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1, 58--75.
 
18
Cormode, G., Muthukrishnan, S., and Rozenbaum, I. 2005. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In Proceedings of the International Conference on Very Large Databases, 25--36.
 
19
Cranor, C., Johnson, T., Spatscheck, O., and Shkapenyuk, V. 2003. Gigascope: A stream database for internet databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 647--651.
 
20
Datar, M., Gionis, A., Indyk, P., and Motwani, R. 2002. Maintaining stream statistics over sliding windows. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 635--644.
 
21
Devanbu, P., Gertz, M., Martel, C., and Stubblebine, S. 2003. Authentic data publication over the internet. J. Comput. Secur. 11, 3, 291--314.
 
22
Flajolet, P. and Martin, G. N. 1985. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 2, 182--209.
 
23
Freivalds, R. 1979. Fast probabilistic algorithms. In Proceedings of the International Symposium on Mathematical Foundations of Computer Science, 57--69.
 
24
Ganguly, S., Garofalakis, M., and Rastogi, R. 2004. Tracking set-expression cardinalities over continuous update streams. The VLDB J. 13, 4, 354--369.
 
25
Garofalakis, M., Hellerstein, J. M., and Maniatis, P. 2007. Proof sketches: Verifiable in-network aggregation. In Proceedings of the IEEE International Conference on Data Engineering.
 
26
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Databases, 454--465.
 
27
Greenwald, M. and Khanna, S. 2001. Space-Efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 58--66.
 
28
Hacigumus, H., Iyer, B. R., and Mehrotra, S. 2002. Providing database as a service. In Proceedings of the IEEE International Conference on Data Engineering, 29--40.
 
29
Hammad, M. A., Mokbel, M. F., Ali, M. H., Aref, W. G., Catlin, A. C., Elmagarmid, A. K., Eltabakh, M., Elfeky, M. G., Ghanem, T. M., Gwadera, R., Ilyas, I. F., Marzouk, M., and Xiong, X. 2004. Nile: A query processing engine for data streams. In Proceedings of the IEEE International Conference on Data Engineering, 851.
 
30
Indyk, P. and Woodruff, D. 2005. Optimal approximations of the frequency moments of data streams. In Proceedings of the ACM Symposium on Theory of Computation.
 
31
Karp, R. M., Shenker, S., and Papadimitriou, C. H. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Datab. Syst. 28, 1, 51--55.
 
32
Knuth, D. E. 1997. The Art of Computer Programming. Addison-Wesley.
 
33
Kushilevitz, E. and Nisan, N. 1997. Communication Complexity. Cambridge University Press.
 
34
Li, F., Chang, Q., Kollios, G., and Bestavros, A. 2006a. Characterizing and exploiting reference locality in data stream applications. In Proceedings of the IEEE International Conference on Data Engineering.
 
35
Li, F., Hadjieleftheriou, M., Kollios, G., and Reyzin, L. 2006b. Dynamic authenticated index structures for outsourced databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 121--132.
 
36
Li, F., Yi, K., Hadjieleftheriou, M., and Kollios, G. 2007. Proof-Infused streams: Enabling authentication of sliding window queries on streams. In Proceedings of the International Conference on Very Large Databases.
 
37
Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Databases, 346--357.
 
38
Martel, C., Nuckolls, G., Devanbu, P., Gertz, M., Kwong, A., and Stubblebine, S. 2004. A general model for authenticated data structures. Algorithmica 39, 1, 21--41.
 
39
Metwally, A., Agrawal, D., and Abbadi, A. E. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Datab. Syst. 31, 3, 1095--1133.
 
40
Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press.
 
41
Muthukrishnan, S. 2003. Data Streams: Algorithms and Applications. www.cs.rutgers.edu/~mathu/stream-1-1.ps
 
42
Nagell, T. 1981. Introduction to Number Theory, 2nd ed. Chelsea Publishing Company.
 
43
Pang, H., Jain, A., Ramamritham, K., and Tan, K.-L. 2005. Verifying completeness of relational query results in data publishing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 407--418.
 
44
Pang, H. and Tan, K.-L. 2004. Authenticating query results in edge computing. In Proceedings of the IEEE International Conference on Data Engineering, 560--571.
 
45
Papadopoulos, S., Yang, Y., and Papadias, D. 2007. CADS: Continuous authentication on data streams. In Proceedings of the International Conference on Very Large Databases, 135--146.
 
46
Rusu, F. and Dobra, A. 2007. Pseudo-Random number generation for sketch-based estimations. ACM Trans. Database Syst. 32, 2.
 
47
Schwartz, J. T. 1980. Fast probabilistic algorithms for verification of polynomial identities. J. ACM 27, 4, 701--717.
 
48
Tatbul, N., Cetintemel, U., Zdonik, S., Cherniack, M., and Stonebraker, M. 2003. Load shedding in a data stream manager. In Proceedings of the International Conference on Very Large Databases, 309--320.
 
49
Tatbul, N. and Zdonik, S. 2006. Window-Aware load shedding for aggregation queries over data streams. In Proceedings of the International Conference on Very Large Databases, 799--810.
 
50
Thorup, M. 2000. Even strongly universal hashing is pretty fast. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
 
51
Wegman, M. N. and Carter, J. L. 1981. New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci. 22, 3.
 
52
Yi, K., Li, F., Hadjieleftheriou, M., Kollios, G., and Srivastava, D. 2008. Randomized synopses for query assurance on data streams. In Proceedings of the IEEE International Conference on Data Engineering.
 
53
Zhang, R., Koudas, N., Ooi, B. C., and Srivastava, D. 2005. Multiple aggregations over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 299--310.
 
54
Zhao, Q., Xu, J., and Liu, Z. 2006. Design of a novel statistics counter architecture with optimal space and time efficiency. In Proceedings of the Conference on ACM SIGMETRICS, 323--334.