|
ABSTRACT
Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. F. Altschul, W. Gish, W. Miller et al. Basic local alignment search tool. Journal of Molecular Biology, 215:403--410, 1990.
|
| |
2
|
R. Apweiler, A. Bairoch and C. H. Wu. Protein sequence databases. Current Opinion in Chemical Biology, 8(1):76--80, 2004.
|
| |
3
|
A. Bateman, L. Coin, R. Durbin et al. The Pfam protein families database. Nucleic Acids Research, 32:D138--141, 2004.
|
| |
4
|
E. Birney, T. D. Andrews, P. Bevan et al. An overview of Ensembl. Genome Research, 14(5):925--928, 2004.
|
| |
5
|
E. Birney, T. D. Andrews, P. Bevan et al. Ensembl 2004. Nucleic Acids Research, 32(Database issue):D468--470, 2004.
|
| |
6
|
|
| |
7
|
|
| |
8
|
F. Corpet, J. Gouzy and D. Kahn. The ProDom database of protein domain families. Nucleic Acids Research, 26(1):323--326, 1998.
|
| |
9
|
U. Feige, D. Peleg and G. Kortsarz. The dense k-subgraph problem. Algorithmica, 29(3):410--421, 2001.
|
 |
10
|
Gary William Flake , Steve Lawrence , C. Lee Giles, Efficient identification of Web communities, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.150-160, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347121]
|
| |
11
|
E. Gasteiger, E. Jung and A. Bairoch SWISS-PROT: connecting biomolecular knowledge via a protein database. Current Issues in Molecular Biology, 3(3):47--55, 2001.
|
| |
12
|
|
| |
13
|
S. R. Gill, M. Pop, R. T. DeBoy et al. Metagenomic analysis of the human distal gut microbiome. Science, 312(5778):1355--1359, 2006.
|
| |
14
|
J. Gough, K. Karplus, R. Hughey and C. Chothia. Assignment of homology to genome sequences using a library of Hidden Markov Models that represent all proteins of known structure. Journal of Molecular Biology, 313(4):903--919, 2001.
|
| |
15
|
D. H. Haft, B. J. Loftus, D. L. Richardson et al. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Research, 29(1):41--3, 2001.
|
| |
16
|
D. H. Haft, J. D. Selengut and O. White. The TIGRFAMs database of protein families. Nucleic Acids Research, 31(1):371--373, 2003.
|
| |
17
|
J. Handelsman. Metagenomics: Application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews, 68(4):669--685, 2004.
|
| |
18
|
J. Handelsman, M. R. Rondon, S. F. Brady et al. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biololgy, 5:R245-R249, 1998.
|
| |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
E. W. Myers, G. G. Sutton, A. L. Delcher et al. A Whole-Genome Assembly of Drosophila. Science, 287:2196--2204, 2000.
|
| |
23
|
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443--453, 1970.
|
| |
24
|
J. Quackenbush, F. Liang, I. Holt et al. The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Research, 28(1):141--145, 2000.
|
| |
25
|
D. B. Rusch, A. L. Halpern, G. Sutton et al. The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biology, 5(3):e77, 2007.
|
| |
26
|
O. Sasson, A. Vaaknin, H. Fleischer et al. ProtoNet: hierarchical classification of the protein space. Nucleic Acids Research, 31(1):348--352, 2003.
|
| |
27
|
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
|
| |
28
|
E. L. Sonnhammer, S. R. Eddy, E. Birney et al. Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Research, 26(1):320--322, 1998.
|
 |
29
|
|
| |
30
|
S. G. Tringe, C. Mering, A. Kobayashi et al. Comparative metagenomics of microbial communities. Science, 308(5721):554--557, 2005.
|
| |
31
|
J. C. Venter, K. Remington, J. F. Heidelberg et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304(5667):66--74, 2004.
|
| |
32
|
D. L. Wheeler, C. Chappey, A. E. Lash et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 28(1):10--14, 2000.
|
| |
33
|
S. Yooseph, G. Sutton, D. B. Rusch et al. The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families. PLoS Biology, 5(3):e16, 2007.
|
|