ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
A code generation approach to optimizing high-performance distributed data stream processing
Full text PdfPdf (1.94 MB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 18th ACM conference on Information and knowledge management table of contents
Hong Kong, China
SESSION: DB streams, network databases table of contents
Pages: 847-856  
Year of Publication: 2009
ISBN:978-1-60558-512-3
Authors
Buğra Gedik  IBM Research, Hawthorne, NY, USA
Henrique Andrade  IBM Research, Hawthorne, NY, USA
Kun-Lung Wu  IBM Research, Hawthorne, NY, USA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 36,   Downloads (12 Months): 62,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1645953.1646061
What is a DOI?

ABSTRACT

We present a code-generation-based optimization approach to bringing performance and scalability to distributed stream processing applications. We express stream processing applications using an operator-based, stream-centric language called SPADE, which supports composing distributed data flow graphs out of toolkits of type-generic operators. A major challenge in building such applications is to find an effective and flexible way of mapping the logical graph of operators into a physical one that can be deployed on a set of distributed nodes. This involves finding how best operators map to processes and how best processes map to computing nodes. In this paper, we take a two-stage optimization approach, where an instrumented version of the application is first generated by the SPADE compiler to profile and collect statistics about the processing and communication characteristics of the operators within the application. In the second stage, the profiling information is fed to an optimizer to come up with a physical data flow graph that is deployable across nodes in a computing cluster. This approach not only creates highly optimized applications that are tailored to the underlying computing and networking infrastructure, but also makes it possible to re-target the application to a different hardware setup by simply repeating the optimization step and re-compiling the application to match the physical flow graph produced by the optimizer. Using real-world applications, from diverse domains such as finance and radio-astronomy, we demonstrate the effectiveness of our approach on System S -- a large-scale, distributed stream processing platform.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis stream processing engine. In CIDR, 2005.
2
 
3
 
4
A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, R. Motwani, I. Nishizawa, U. Srivastava, D. Thomas, R. Varma, and J. Widom. STREAM: The Stanford stream data manager. IEEE Data Engineering Bulletin, 26, 2003.
 
5
A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: Semantic foundations and query execution. Technical report, InfoLab -- Stanford University, October 2003.
 
6
 
7
M. Beynon, R. Ferreira, T. Kurc, A. Sussman, and J. Saltz. DataCutter: Middleware for filtering very large scientific datasets on archival storage systems. In IEEE Symposium on Mass Storage Systems, MSST, 2000.
 
8
S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, V. Raman, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003.
 
9
Coral8, inc. http://www.coral8.com, May 2007.
10
 
11
 
12
Hadoop. http://hadoop.apache.org.
 
13
14
15
 
16
LOFAR. http://www.lofar.org/, June 2008.
 
17
LOIS. http://www.lois--space.net/, June 2008.
 
18
MATLAB. http://www.mathworks.com, October 2007.
19
 
20
StreamBase Systems. http://www.streambase.com, May 2008.
 
21
 
22
23
24
 
25
J. Wolf, N. Bansal, K. Hildrum, S. Parekh, D. Rajan, R. Wagle, K.-L. Wu, and L. Fleischer. SODA: An optimizing scheduler for large-scale stream-based distributed computer systems. Technical Report RC 24453, IBM Research, Dec 2007.

Collaborative Colleagues:
Buğra Gedik: colleagues
Henrique Andrade: colleagues
Kun-Lung Wu: colleagues