|
ABSTRACT
This paper presents our first experiences in mapping and optimizing genomic sequence search onto the massively parallel IBM Blue Gene/P (BG/P) platform. Specifically, we performed our work on mpiBLAST, a parallel sequence-search code that has been optimized on numerous supercomputing environments. In doing so, we identify several critical performance issues. Consequently, we propose and study different approaches for mapping sequence-search and parallel I/O tasks on such massively parallel architectures. We demonstrate that our optimizations can deliver nearly linear scaling (93% efficiency) on up to 32,768 cores of BG/P. In addition, we show that such scalability enables us to complete a large-scale bioinformatics problem --- sequence searching a microbial genome database against itself to support the discovery of missing genes in genomes --- in only a few hours on BG/P. Previously, this problem was viewed as computationally intractable in practice.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
P. Balaji, W. Feng, H. Lin, J. Archuleta, S. Matsuoka, A. Warren, J. Setubal, E. Lusk, R. Thakur, I. Foster, D. S. Katz, S. Jha, K. Shinpaugh, S. Coghlan, and D. Reed. Distributed Data I/O with ParaMEDIC: Experiences with a Worldwide Supercomputer. In Proceedings of the IEEE International Supercomputing Conference (ISC): Best paper award, Dresden, Germany, June 2008.
|
| |
2
|
|
| |
3
|
R. C. Braun , K. T. Pedretti , T. L. Casavant , T. E. Scheetz , C. L. Birkett , C. A. Roberts, Parallelization of local BLAST service on workstation clusters, Future Generation Computer Systems, v.17 n.6, p.745-754, April 2001
[doi> 10.1016/S0167-739X(00)00057-1]
|
| |
4
|
N. Camp, H. Cofer, and R. Gomperts. High-throughput BLAST. http://www.sgi.com/industries/sciences/chembio/resources/papers/HTBlast/HT_Whitepaper.html.
|
| |
5
|
E. Chi, E. Shoop, J. Carlis, E. Retzel, and J. Riedl. Efficiency of shared-memory multiprocessors for a genetic sequence similarity search algorithm. Technical Report TR97--005, University of Minnesota, Computer Science Department, 1997.
|
 |
6
|
Phyllis E. Crandall , Ruth A. Aydt , Andrew A. Chien , Daniel A. Reed, Input/output characteristics of scalable parallel applications, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p.59-es, December 04-08, 1995, San Diego, California, United States
[doi> 10.1145/224170.224396]
|
| |
7
|
A. Darling, L. Carey, and W. Feng. The design, implementation, and evaluation of mpiBLAST. In Proceedings of the ClusterWorld Conference and Expo, in conjunction with the 4th International Conference on Linux Clusters: The HPC Revolution, 2003.
|
 |
8
|
Mark K. Gardner , Wu-chun Feng , Jeremy Archuleta , Heshan Lin , Xiaosong Mal, Parallel genomic sequence-searching on an ad-hoc grid: experiences, lessons learned, and implications, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
[doi> 10.1145/1188455.1188564]
|
| |
9
|
|
| |
10
|
M. Marra, S. Jones, C. Astell, R. Holt, A. Brooks-Wilson, Y. Butterfield, J. Khattra, J. Asano, S. Barber, S. Chan, A. Cloutier, S. Coughlin, D. Freeman, N. Gim, O. Griffith, S. Leach, M. Mayo, H. McDonald, S. Montgomery, P. Pandoh, A. Petrescu, G. Robertson, J. Schein, A. Siddiqui, D. Smailus, J. Stott, G. Yang, F. Plummer, A. Andonov, H. Artsob, N. Bastien, K. Bernard, T. Booth, D. Bowness, M. Drebot, L. Fernando, R. Flick, M. Garbutt, M. Gray, A. Grolla, S. Jones, H. Feldmann, A. Meyers, A. Kabani, Y. Li, S. Normand, U. Stroher, G. Tipples, S. Tyler, R. Vogrig, D. Ward, B. Watson, R. Brunham, M. Krajden, M. Petric, D. Skowronski, C. Upton, and R. Roper. The genome sequence of the sars-associated coronavirus. Science, 2003.
|
| |
11
|
D. Mathog. Parallel BLAST on split databases. Bioirformatics, 19(14), 2003.
|
| |
12
|
|
| |
13
|
Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Standard, July 1997.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
D. Quintero and M. Hennecke. GPFS Multicluster with the IBM System Blue Gene Solution and eHPS Clusters. IBM Redpaper, REDP-4168-00, October 24, 2006, http://www.redbooks.ibm.coin/abstracts/redp4168.html?Open.
|
| |
18
|
H. Rangwala, E. Lantz, R. Musselman, K. Pinnow, B. Smith, and B. Wallenfelt. Massively Parallel BLAST for the Blue Gene/L. In High Availability and Performance Workshop, 2005.
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
C. Sosa and G. Lakner. IBM System Blue Gene Solution: Blue Gene/P Application Development. IBM Red-Book, SG24--7287, ISBN 0738488674, Rochester, Minnesoat, 2008. http://www.redbooks.ibm.com/abstracts/sg247287.html?Open.
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
 |
26
|
Rajeev Thakur , William Gropp , Ewing Lusk, On implementing MPI-IO portably and with high performance, Proceedings of the sixth workshop on I/O in parallel and distributed systems, p.23-32, May 05-05, 1999, Atlanta, Georgia, United States
[doi> 10.1145/301816.301826]
|
| |
27
|
|
 |
28
|
Oystein Thorsen , Brian Smith , Carlos P. Sosa , Karl Jiang , Heshan Lin , Amanda Peters , Wu-chun Feng, Parallel genomic sequence-search on a massively parallel system, Proceedings of the 4th international conference on Computing frontiers, May 07-09, 2007, Ischia, Italy
[doi> 10.1145/1242531.1242542]
|
|