|
ABSTRACT
Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general purpose mining, these alone cannot cover the broad spectrum of needs of most medical and population genetics studies. To address this limitation, we propose building in-house customized data marts from the raw data provided by the largest public databases. In particular, for population genetics analysis based on genotypes we propose building a set of data processing scripts that would deal with raw data coming from the major SNP variation databases (e.g. HapMap, Perlegen) that can be stripped into single genotypes and then grouped into populations. This allows not only in-house standardization and normalization of the genotyping data retrieved from different repositories, but also the calculation of statistical indices from simple allele frequency estimates up to elaborate genetic differentiation tests within populations, together with the ability to combine population samples from different databases. This article describes the viability of implementing scripts for handling extensive datasets of SNP genotypes with low computational costs, and shows that the updating of these data marts is straightforward, permitting easy implementation of new external data and the computation of new statistical indices.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
McNamee, L. A., Launsby, B. D., Frisse, M. E., Lehmann, R. and Ebker, K. Scaling an expert system data mart: more facilities in real-time. Proc AMIA Symp1998), 498--502.
|
| |
2
|
Arnrich, B., Walter, J., Albert, A., Ennker, J. and Ritter, H. Data mart based research in heart surgery: challenges and benefit. Stud Health Technol Inform, 107, Pt 1 2004), 8--12.
|
| |
3
|
Phillips, C. Online resources for SNP analysis: a review and route map. Mol Biotechnol, 35, 1 (Jan 2007), 65--97.
|
| |
4
|
Rosenberg, N. A. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet, 70, Pt 6 (Nov 2006), 841--847.
|
| |
5
|
Smith, M. W. and O'Brien, S. J. Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nat Rev Genet, 6, 8 (Aug 2005), 623--632.
|
| |
6
|
Dougherty, D. Sed & awk. O'Reilly, Sebastopol, CA, 1990.
|
| |
7
|
Apostólico, A. and Galil, Z. Pattern matching algorithms. Oxford University Press, New York, 1997.
|
| |
8
|
Isken, M. W., Littig, S. J. and West, M. A data mart for operations analysis. J Healthc Inf Manag, 15, 2 (Summer 2001), 143--153.
|
| |
9
|
The International HapMap Consortium, A haplotype map of the human genome. Nature, 437, 7063 (Oct 27 2005), 1299--1320.
|
| |
10
|
Thorisson, G. A., Smith, A. V., Krishnan, L. and Stein, L. D. The International HapMap Project Web site. Genome Res, 15, 11 (Nov 2005), 1592--1593.
|
| |
11
|
Peacock, E. and Whiteley, P. Perlegen sciences, inc. Pharmacogenomics, 6, 4 (Jun 2005), 439--442.
|
| |
12
|
Cann, H. M., de Toma, C., Cazes, L., Legrand, M. F., Morel, V., Piouffre, L., Bodmer, J., Bodmer, W. F., Bonne-Tamir, B., Cambon-Thomsen, A., Chen, Z., Chu, J., Carcassi, C., Contu, L., Du, R., Excoffier, L., Ferrara, G. B., Friedlaender, J. S., Groot, H., Gurwitz, D., Jenkins, T., Herrera, R. J., Huang, X., Kidd, J., Kidd, K. K., Langaney, A., Lin, A. A., Mehdi, S. Q., Parham, P., Piazza, A., Pistillo, M. P., Qian, Y., Shu, Q., Xu, J., Zhu, S., Weber, J. L., Greely, H. T., Feldman, M. W., Thomas, G., Dausset, J. and Cavalli-Sforza, L. L. A human genome diversity cell line panel. Science, 296, 5566 (Apr 12 2002), 261--262.
|
| |
13
|
Pritchard, J. K., Stephens, M. and Donnelly, P. Inference of population structure using multilocus genotype data. Genetics, 155, 2 (Jun 2000), 945--959.
|
INDEX TERMS
Primary Classification:
J.
Computer Applications
J.3
LIFE AND MEDICAL SCIENCES
Subjects:
Biology and genetics
General Terms:
Design,
Management,
Performance,
Reliability,
Standardization,
Verification
Keywords:
ceph,
data mart,
genotypes,
hapmap,
perlegen,
population genetics,
snps
|