|
ABSTRACT
In this paper we (1) describe state-of-the-art methods to identify clusters in DNA sequence data for taxonomic analysis; (2) describe a new method with better scaling properties based on model-based clustering, and (3) present examples using the nucleoprotein and hemagglutin regions of influenza and the env and gag regions of human immunodeficiency virus (HIV).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Banfield, J. and Raftery, A. Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803-821, 1993.
|
| |
2
|
Bradley, P., Fayyad, U., and Reina, C. Scaling Clustering Algorithms to Large Databases. Proceedings of the 4th International Conf. on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, Aug. 1998.
|
| |
3
|
Burr, T., Myers, G., and Hyman, J. The origin of AIDS --- Darwinian or Lamarkian? Phil. Trans. R. Soc. Lond. B.356:877-887, 2001
|
| |
4
|
|
| |
5
|
Burr, T., Charlton, W., and Stanbro, W. Comparison of signature pattern analysis methods in molecular epidemiology. Mathematical and Engineering Methods in Medicine and Biological Sciences, 473-479, 2000.
|
| |
6
|
Dempster, A., Laird, N., and Rubin, D. Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1-38, 1977.
|
| |
7
|
Efron, B., Halloran, E., and Holmes, S. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. USA 93: 13429, 1996.
|
 |
8
|
Christos Faloutsos , King-Ip Lin, FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.163-174, May 22-25, 1995, San Jose, California, United States
|
| |
9
|
Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376, 1981.
|
| |
10
|
Felsenstein, J. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22:521-565, 1997.
|
| |
11
|
Fraley, C. and Raftery, A. MCLUST: Software for model-based cluster analysis. Journal of Classification 16:297-306, 1999.
|
| |
12
|
Gammelin, M., Mandler, J., and Schholtissek, C. Two subtypes of nucleoproteins (NP) of the influenza viruses. Virology 170:71-80, 1989.
|
| |
13
|
Grassley, N. C., Harvey, P. H., and Holmes, E. C. Population dynamics of HIV-1 inferred from gene sequences. Genetics 151: 427-438, 1999.
|
 |
14
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
| |
15
|
Hasegawa, M., Kishino, H., and Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 21: 160-174, 1985.
|
| |
16
|
Holmes, E. C., Pybus, O. G., and Harvey, P. H. The molecular population dynamics of HIV-1. In Crandell, K. The Evolution of HIV, Baltimore: Johns Hopkins University Press, 1999.
|
| |
17
|
Hu, D. J., Buve, A., Baggs, J., van der Groen, G., and Dondero, T. J. What role does HIV-1 subtype play in transmission and pathogenesis? An epidemiological perspective. AIDS 13:873-881, 1999.
|
| |
18
|
Huelsenbeck, J. and Rannala, B. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science, 276: 227-232, 1997.
|
| |
19
|
|
| |
20
|
Kass, R. and Raftery, A. Bayes Factors. J. American Statistical Association. 90:773-795, 1995.
|
| |
21
|
Kingman, J. F. C. On the genealogy of large populations. J. Appl. Prob. 19: 27-43. 1982.
|
| |
22
|
Korber, B. and Myers, G. Signature pattern analysis: a method for assessing viral sequence relatedness. AIDS Research and Human Retroviruses 8: 1549-1560, 1992.
|
| |
23
|
Leitner, T., Kumar., S., and Albert, J. Tempo and mode of nucleotide substitutions in gag and env gene fragments in HIV Type 1 populations with a known transmission history. Virology 71: 4761-4770, 1997.
|
| |
24
|
Leitner, T., et al, Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc. Natl. Acad. Sci., USA 93: 10864-10869, 1996.
|
| |
25
|
Mau, B., Newton, M., and Larget, B. Bayesian phylogenetic inference via Markov Chain Montre Carlo Methods. Biometrics 55:1-12, 1999.
|
| |
26
|
|
| |
27
|
Myers, G. HIV: between past and future. AIDS Res Human Retro 10: 1317-1324, 1994.
|
| |
28
|
Needleman, S. and Wunsch, C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol Biol. 48:443-453, 1970.
|
| |
29
|
|
| |
30
|
Salter, L. Algorithms for phylogenetic tree reconstruction. Mathematical and Engineering Methods in Medicine and Biological Sciences, 459-465, 2000.
|
| |
31
|
Simon, D. and Larget, B. Bayesian Analysis in Molecular Biology and Evolution (BAMBE) version 1.01 beta, Dept. of Mathematics and Computer Science, Duquesne University, 1998.
|
| |
32
|
S-Plus 5.1 MathSoft, Seattle Washington, 1999.
|
| |
33
|
Swofford, D. L., Olsen, G. J., Waddell, P. J., and Hillis, D. M. Phylogenetic inference In Molecular Systematics, 2nd edition, pp. 407-514 (Hillis et al., eds.) Sunderland, Massachusetts: Sinauer Associates, 1996.
|
| |
34
|
Swofford, D. L. PAUP* Phylogenetic analysis using parsimony; Version 4; Sunderland, Massachusetts: Sinauer Associates, 1999.
|
| |
35
|
Venables, W. and Ripley, B. Modern applied statistics with S-PLUS, 2nd ed., Springer-Verlag: NY, 1997.
|
| |
36
|
Web sites: hiv-web.lanl.gov for the HIV sequences; linker.lanl.gov/flu for the influenza sequences; www.stat.washington.edu/fraley for emclust code for use in Splus; http://evolve.zoo.ox.ac.uk for Treevolve code to simulate DNA data under various coalescent models.
|
 |
37
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
|