ACM Home Page
Please provide us with feedback. Feedback
An efficient parallel approach for identifying protein families in large-scale metagenomic data sets
Full text PdfPdf (222 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 2008 ACM/IEEE conference on Supercomputing - Volume 00 table of contents
Austin, Texas
SECTION: Papers table of contents
Article No. 35  
Year of Publication: 2008
ISBN:978-1-4244-2835-9
Authors
Changjun Wu  Washington State University, Pullman, WA
Ananth Kalyanaraman  Washington State University, Pullman, WA
Publisher
IEEE Press  Piscataway, NJ, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 112,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
S. F. Altschul, W. Gish, W. Miller et al. Basic local alignment search tool. Journal of Molecular Biology, 215:403--410, 1990.
 
2
R. Apweiler, A. Bairoch and C. H. Wu. Protein sequence databases. Current Opinion in Chemical Biology, 8(1):76--80, 2004.
 
3
A. Bateman, L. Coin, R. Durbin et al. The Pfam protein families database. Nucleic Acids Research, 32:D138--141, 2004.
 
4
E. Birney, T. D. Andrews, P. Bevan et al. An overview of Ensembl. Genome Research, 14(5):925--928, 2004.
 
5
E. Birney, T. D. Andrews, P. Bevan et al. Ensembl 2004. Nucleic Acids Research, 32(Database issue):D468--470, 2004.
 
6
 
7
 
8
F. Corpet, J. Gouzy and D. Kahn. The ProDom database of protein domain families. Nucleic Acids Research, 26(1):323--326, 1998.
 
9
U. Feige, D. Peleg and G. Kortsarz. The dense k-subgraph problem. Algorithmica, 29(3):410--421, 2001.
10
 
11
E. Gasteiger, E. Jung and A. Bairoch SWISS-PROT: connecting biomolecular knowledge via a protein database. Current Issues in Molecular Biology, 3(3):47--55, 2001.
 
12
 
13
S. R. Gill, M. Pop, R. T. DeBoy et al. Metagenomic analysis of the human distal gut microbiome. Science, 312(5778):1355--1359, 2006.
 
14
J. Gough, K. Karplus, R. Hughey and C. Chothia. Assignment of homology to genome sequences using a library of Hidden Markov Models that represent all proteins of known structure. Journal of Molecular Biology, 313(4):903--919, 2001.
 
15
D. H. Haft, B. J. Loftus, D. L. Richardson et al. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41--3, 2001.
 
16
D. H. Haft, J. D. Selengut and O. White. The TIGRFAMs database of protein families. Nucleic Acids Research, 31(1):371--373, 2003.
 
17
J. Handelsman. Metagenomics: Application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews, 68(4):669--685, 2004.
 
18
J. Handelsman, M. R. Rondon, S. F. Brady et al. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biololgy, 5:R245-R249, 1998.
 
19
 
20
21
 
22
E. W. Myers, G. G. Sutton, A. L. Delcher et al. A Whole-Genome Assembly of Drosophila. Science, 287:2196--2204, 2000.
 
23
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443--453, 1970.
 
24
J. Quackenbush, F. Liang, I. Holt et al. The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Research, 28(1):141--145, 2000.
 
25
D. B. Rusch, A. L. Halpern, G. Sutton et al. The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biology, 5(3):e77, 2007.
 
26
O. Sasson, A. Vaaknin, H. Fleischer et al. ProtoNet: hierarchical classification of the protein space. Nucleic Acids Research, 31(1):348--352, 2003.
 
27
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
 
28
E. L. Sonnhammer, S. R. Eddy, E. Birney et al. Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Research, 26(1):320--322, 1998.
29
 
30
S. G. Tringe, C. Mering, A. Kobayashi et al. Comparative metagenomics of microbial communities. Science, 308(5721):554--557, 2005.
 
31
J. C. Venter, K. Remington, J. F. Heidelberg et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304(5667):66--74, 2004.
 
32
D. L. Wheeler, C. Chappey, A. E. Lash et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 28(1):10--14, 2000.
 
33
S. Yooseph, G. Sutton, D. B. Rusch et al. The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families. PLoS Biology, 5(3):e16, 2007.

Collaborative Colleagues:
Changjun Wu: colleagues
Ananth Kalyanaraman: colleagues