|
ABSTRACT
Data stream processing systems have become ubiquitous in academic and commercial sectors, with application areas that include financial services, network traffic analysis, battlefield monitoring and traffic control. The append-only model of streams implies that input data is immutable and therefore always correct. But in practice, streaming data sources often contend with noise (e.g., embedded sensors) or data entry errors (e.g., financial data feeds) resulting in erroneous inputs and by implication, erroneous query results. Many data stream sources (e.g., Reuters ticker feeds) issue "revision tuples" (revisions) that amend previously issued tuples (e.g. erroneous share prices). A stream processing engine might reasonably respond to revision inputs by generating revision outputs that correct previously emitted query results. We know of no stream processing system that presently has this capability. In this paper, we describe how a stream processing engine can be extended to support revision processing via replay. Replay-based revision processing techniques assume that a stream engine maintains an archive of recent data seen on each of its input streams. These archives are then queried in response to a revision, with the resulting tuples replayed through the system so as to generate corrected query outputs. We first present the design and implementation of the revision processing engine for the Borealis stream processing engine [1]. We then compare techniques for archiving streams to support replay, and then compare the performance and overhead of two revision processing techniques that replay input tuples to recompute and thereby revise previously output query results. These experiments reveal scalability issues due to the overhead required to maintain stream archives, and has motivated our current research on using sampling and data summarization (e.g., histograms) to reduce the data that must be stored in a stream archive.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The Design of the Borealis Stream Processing Engine. In Second Biennial Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, CA, January 2005.
|
 |
2
|
Arvind Arasu , Brian Babcock , Shivnath Babu , Mayur Datar , Keith Ito , Itaru Nishizawa , Justin Rosenstein , Jennifer Widom, STREAM: the stanford stream data manager (demonstration description), Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California
[doi> 10.1145/872757.872854]
|
| |
3
|
Arvind Arasu , Mitch Cherniack , Eduardo Galvez , David Maier , Anurag S. Maskey , Esther Ryvkina , Michael Stonebraker , Richard Tibbetts, Linear road: a stream data management benchmark, Proceedings of the Thirtieth international conference on Very large data bases, p.480-491, August 31-September 03, 2004, Toronto, Canada
|
 |
4
|
Sirish Chandrasekaran , Owen Cooper , Amol Deshpande , Michael J. Franklin , Joseph M. Hellerstein , Wei Hong , Sailesh Krishnamurthy , Samuel R. Madden , Fred Reiss , Mehul A. Shah, TelegraphCQ: continuous dataflow processing, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California
[doi> 10.1145/872757.872857]
|
| |
5
|
S. Chandrasekaran, A. Deshpande, M. Franklin, J. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. Shah. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In CIDR Conference, January 2003.
|
| |
6
|
|
 |
7
|
|
| |
8
|
T. M. Ghanem, M. A. Hammad, M. F. Mokbel, W. G. Aref, and A. K. Elmagarmid. Query Processing using Negative Tuples in Stream Query Engines. Technical Report CSD 04-040, Purdue University, 2005.
|
 |
9
|
|
 |
10
|
|
| |
11
|
Jeong-Hyon Hwang , Magdalena Balazinska , Alexander Rasin , Ugur Cetintemel , Michael Stonebraker , Stan Zdonik, High-Availability Algorithms for Distributed Stream Processing, Proceedings of the 21st International Conference on Data Engineering, p.779-790, April 05-08, 2005
[doi> 10.1109/ICDE.2005.72]
|
| |
12
|
C. S. Jensen. Temporal Database Management. PhD thesis, Aalborg University, 2000.
|
| |
13
|
A. S. Maskey and M. Cherniack. Replay-Based Approaches to Revision Processing in Stream Query Engines. Technical report, Brandeis University, December 2007. URL: http://www.cs.brandeis.edu/%7Eanurag/revision-techreport- 07.pdf.
|
| |
14
|
PostgreSQL Weekly News - June 17 2007, URL: http://people.planetpostgresql.org/dfetter/index.php?/archives/123-PostgreSQL-Weekly-News-June-17-2007.html.
|
| |
15
|
|
 |
16
|
|
| |
17
|
|
| |
18
|
StreamBase Systems, Inc. URL: http://www.streambase.com/.
|
| |
19
|
J. Widom. Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In CIDR Conference, pages 262--276, January 2005.
|
|