|
ABSTRACT
We present a code-generation-based optimization approach to bringing performance and scalability to distributed stream processing applications. We express stream processing applications using an operator-based, stream-centric language called SPADE, which supports composing distributed data flow graphs out of toolkits of type-generic operators. A major challenge in building such applications is to find an effective and flexible way of mapping the logical graph of operators into a physical one that can be deployed on a set of distributed nodes. This involves finding how best operators map to processes and how best processes map to computing nodes. In this paper, we take a two-stage optimization approach, where an instrumented version of the application is first generated by the SPADE compiler to profile and collect statistics about the processing and communication characteristics of the operators within the application. In the second stage, the profiling information is fed to an optimizer to come up with a physical data flow graph that is deployable across nodes in a computing cluster. This approach not only creates highly optimized applications that are tailored to the underlying computing and networking infrastructure, but also makes it possible to re-target the application to a different hardware setup by simply repeating the optimization step and re-compiling the application to match the physical flow graph produced by the optimizer. Using real-world applications, from diverse domains such as finance and radio-astronomy, we demonstrate the effectiveness of our approach on System S -- a large-scale, distributed stream processing platform.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis stream processing engine. In CIDR, 2005.
|
 |
2
|
Lisa Amini , Henrique Andrade , Ranjita Bhagwan , Frank Eskesen , Richard King , Philippe Selo , Yoonho Park , Chitra Venkatramani, SPC: a distributed, scalable platform for data mining, Proceedings of the 4th international workshop on Data mining standards, services and platforms, p.27-37, August 20-20, 2006, Philadelphia, Pennsylvania
[doi> 10.1145/1289612.1289615]
|
| |
3
|
|
| |
4
|
A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, R. Motwani, I. Nishizawa, U. Srivastava, D. Thomas, R. Varma, and J. Widom. STREAM: The Stanford stream data manager. IEEE Data Engineering Bulletin, 26, 2003.
|
| |
5
|
A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: Semantic foundations and query execution. Technical report, InfoLab -- Stanford University, October 2003.
|
| |
6
|
Hari Balakrishnan , Magdalena Balazinska , Don Carney , Uğur Çetintemel , Mitch Cherniack , Christian Convey , Eddie Galvez , Jon Salz , Michael Stonebraker , Nesime Tatbul , Richard Tibbetts , Stan Zdonik, Retrospective on Aurora, The VLDB Journal — The International Journal on Very Large Data Bases, v.13 n.4, p.370-383, December 2004
[doi> 10.1007/s00778-004-0133-5]
|
| |
7
|
M. Beynon, R. Ferreira, T. Kurc, A. Sussman, and J. Saltz. DataCutter: Middleware for filtering very large scientific datasets on archival storage systems. In IEEE Symposium on Mass Storage Systems, MSST, 2000.
|
| |
8
|
S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, V. Raman, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003.
|
| |
9
|
Coral8, inc. http://www.coral8.com, May 2007.
|
 |
10
|
Bugra Gedik , Henrique Andrade , Kun-Lung Wu , Philip S. Yu , Myungcheol Doo, SPADE: the system s declarative stream processing engine, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
[doi> 10.1145/1376616.1376729]
|
| |
11
|
Lewis Girod , Yuan Mei , Ryan Newton , Stanislav Rost , Arvind Thiagarajan , Hari Balakrishnan , Samuel Madden, XStream: a Signal-Oriented Data Stream Management System, Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, p.1180-1189, April 07-12, 2008
[doi> 10.1109/ICDE.2008.4497527]
|
| |
12
|
Hadoop. http://hadoop.apache.org.
|
| |
13
|
|
 |
14
|
Navendu Jain , Lisa Amini , Henrique Andrade , Richard King , Yoonho Park , Philippe Selo , Chitra Venkatramani, Design, implementation, and evaluation of the linear road bnchmark on the stream processing core, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142522]
|
 |
15
|
Tahsin Kurc , Chialin Chang , Renato Ferreira , Alan Sussman , Joel Saltz, Querying very large multi-dimensional datasets in ADR, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.12-es, November 14-19, 1999, Portland, Oregon, United States
[doi> 10.1145/331532.331544]
|
| |
16
|
LOFAR. http://www.lofar.org/, June 2008.
|
| |
17
|
LOIS. http://www.lois--space.net/, June 2008.
|
| |
18
|
MATLAB. http://www.mathworks.com, October 2007.
|
 |
19
|
Christopher Olston , Benjamin Reed , Utkarsh Srivastava , Ravi Kumar , Andrew Tomkins, Pig latin: a not-so-foreign language for data processing, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
[doi> 10.1145/1376616.1376726]
|
| |
20
|
StreamBase Systems. http://www.streambase.com, May 2008.
|
| |
21
|
|
| |
22
|
|
 |
23
|
|
 |
24
|
|
| |
25
|
J. Wolf, N. Bansal, K. Hildrum, S. Parekh, D. Rajan, R. Wagle, K.-L. Wu, and L. Fleischer. SODA: An optimizing scheduler for large-scale stream-based distributed computer systems. Technical Report RC 24453, IBM Research, Dec 2007.
|
|