Nucleic Acids Research Advance Access published online on November 15, 2006
Nucleic Acids Research, doi:10.1093/nar/gkl855
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Database Issue |
The HIV positive selection mutation database
Center for Computational Biology, University of California Los Angeles, CA, USA 1 Molecular Biology Institute, Institute for Genomics and Proteomics, University of California Los Angeles, CA, USA 2 Department of Chemistry and Biochemistry, University of California Los Angeles, CA, USA
*To whom correspondence should be addressed. Tel: +1 310 825 7374; Fax: +1 310 206 7286; Email: leec{at}chem.ucla.edu
Received August 16, 2006. Revised October 9, 2006. Accepted October 10, 2006.
| ABSTRACT |
|---|
|
|
|---|
The HIV positive selection mutation database is a large-scale database available at http://www.bioinformatics.ucla.edu/HIV/ that provides detailed selection pressure maps of HIV protease and reverse transcriptase, both of which are molecular targets of antiretroviral therapy. This database makes available for the first time a very large HIV sequence dataset (sequences from
50 000 clinical AIDS samples, generously contributed by Specialty Laboratories, Inc.), which makes possible high-resolution selection pressure mapping. It provides information about not only the selection pressure on individual sites but also how selection pressure at one site is affected by mutations on other sites. It also includes datasets from other public databases, namely the Stanford HIV database [S. Y. Rhee, M. J. Gonzales, R. Kantor, B. J. Betts, J. Ravela and R. W. Shafer (2003) Nucleic Acids Res., 31, 298303]. Comparison between these datasets in the database enables cross-validation with independent datasets and also specific evaluation of the effect of drug treatment. | INTRODUCTION |
|---|
|
|
|---|
The HIV-1 virus is the causative agent of AIDS, a growing worldwide epidemic and also a fascinating system for studying fundamental scientific questions. For example, one major clinical problem in the treatment of AIDS is HIV's ability to develop resistance to antiviral drugs rapidly, often within weeks of introduction of a new drug (13). Foremost among the factors responsible for this are the virus' extremely high mutation rate (4,5) and replication rate (3,68). For this reason, there is great medical interest in understanding both the specific causes of drug resistance, and predicting fast versus slow evolutionary pathways to multiple drug resistance. At the same time, HIV provides an extraordinary wealth of data about fundamental scientific questions such as the fitness landscape for protein evolution (9,10).
Evolutionary biology has developed a powerful and general approach for investigating such problems: metrics of selection pressure that measure whether a particular genetic change is selected for or against during evolution. Such metrics can reveal important selection forces either constraining or driving evolution of a protein, directly from raw sequence variation data (11,12). One very widely used metric of selection pressure on amino acid mutations is known as Ka/Ks or dn/ds (13,14) and measures the ratio of observed amino acid mutations over observed synonymous mutations, normalized by the ratio expected under a neutral model. Thus a Ka/Ks = 1 value indicates neutral selection. Ordinarily Ka/Ks is << 1, indicating negative selection against amino acid mutations (far fewer observed than expected under a neutral model). Ka/Ks > 1 is referred to as positive selection (i.e. amino acid mutations increase reproductive fitness) and is observed in rare cases where new evolutionary challenges create strong pressure for rapid evolution of a protein (e.g. immune system genes like MHC that are involved in recognizing pathogenic antigens). Ordinarily, a single Ka/Ks value is calculated for a whole gene, but with very large datasets it becomes possible to estimate distinct Ka/Ks values for individual codon positions or amino acid mutations. This yields a selection pressure map of a gene, revealing its detailed functional constraints and in rare cases positive selection peaks that signal important new evolutionary pressures such as drug treatment. We used Ka/Ks because it provides a powerful tool for detecting positive selection. Phylogenetic analysis of our HIV sequence dataset using Phylip (15) shows a star-like topology (data are available at www.bioinformatics.ucla.edu/HIV/topo.png, but will be presented in detail elsewhere), in agreement with previous studies (16,17).
We have assembled a large-scale database that provides researchers detailed selection pressure maps of HIV proteins involved in drug resistance. These data have many possible applications, including prediction of mutations contributing to drug resistance, distinguishing primary drug resistance mutations from accessory mutations, rate measurements of fast versus slow evolutionary pathways to multiple drug resistance, and the evolutionary dynamics of different types of mutations as the virus moves from untreated to drug-treated conditions and back. This database makes available for the first time a very large HIV sequence dataset (sequences from
50 000 clinical AIDS samples), which makes possible high-resolution selection pressure mapping, as well as smaller datasets from other public databases. The methods and most of the data described herein have been published previously (12,18).
| DATABASE CONTENT, INTERFACE AND APPLICATIONS |
|---|
|
|
|---|
Datasets
The primary dataset consists of sequences for HIV protease and reverse transcriptase (RT) for
50 000 clinical AIDS patient samples from the United States, collected during 19992003 (12), and mostly under drug treatment. These data cover 1.4 kb each [300 000 chromatograms; six overlapping reads per sample, including both strands; see (12) for details] and were generously contributed by Specialty Laboratories Inc. Owing to HIV's high mutation rate, on average each sequence contains 32 mutations/kb [with respect to the Los Alamos reference sequence (12)], for a total of more than 2 million mutation observations in the dataset (12). Over 5000 distinct codon mutations were observed, each with an average count of 364 samples (12). For comparison, this density of polymorphism information is equivalent to sequencing
1 million people. This very large dataset, made available publicly for the first time, has made detailed selection pressure mapping possible. Of the samples, 99.3% are subtype B; non-subtype-B samples were excluded from the analysis (12). The dataset is fully HIPAA-compliant; all information concerning the source patients was removed by Specialty. The database currently includes two additional datasets, also covering HIV protease and RT. These datasets were obtained from the Stanford HIV database (19). The Stanford-Treated dataset consists of 1797 subtype B samples with known drug treatments. This dataset provides a useful comparison with the Specialty results, for validating whether a specific mutation is reproducibly selected by drug treatment. The Stanford-Untreated dataset consists of 2628 subtype B samples not under drug treatment. By comparing results from this dataset with Specialty and Stanford-Treated, users can assess whether a specific mutation is more likely to be associated with drug resistance or other types of phenotypic fitness effects (e.g. interactions with the immune system).
The Specialty raw sequence data are available as a gzip'ed FASTA file at http://www.bioinformatics.ucla.edu/HIV/Specialty_sequences.fasta.gz.
Amino acid selection pressure mapping
The first aspect of the database is mapping of Ka/Ks selection pressure at each codon position in HIV protease and the first 381 codon positions of RT (Figure 1). Further positions in RT were not sequenced in this dataset. Codon-specific selection pressure (12) was calculated using the following formula:
|
|
![]() |
![]() |
|
Selection pressure interaction mapping
The massive size of the Specialty dataset makes it possible to measure how selection pressure for one amino acid mutation Y is affected by amino acid mutations at other sites X. Specifically, the database computes Ka/Ks for mutation Y conditioned on the presence of amino acid mutations at site X versus the absence of any mutation at site X. This conditional Ka/Ks (18) calculation is performed as follows:
![]() |
and
are the numbers of amino acid mutations and synonymous mutations at site Y observed in the presence of amino acid mutations at site X and all other variables retain their previous definitions. Dividing this result by the one obtained in the absence of any mutation at site X to arrive at the conditional selection ratio (18) results in the following expression:
![]() |
and
are the numbers of samples containing either an amino acid mutation or synonymous mutation at Y and no mutation at X. The LOD score by which we evaluated the significance of apparent positive conditional selection was calculated using the following: |
|
and q as defined above. For experimental validation, this database correctly predicted 80 of 92 known mutation positive interaction pairs identified in HIV protease by independent experimental studies (P-value = 1070) (11,18). The database again provides a graphical interface (Figure 2) as a 2D heatmap showing all pairwise interactions, which users can click at any position to inspect detailed tabular results.
|
These data can yield useful insights into HIV drug resistance. For example, the data show a significant interaction between protease site 90 (a known drug resistance mutation site) and site 10 (Figure 3). Amino acid mutations at 90 displayed strong, unconditional positive selection, indicating that they directly cause drug resistance. In contrast, mutations at 10 are negatively selected in the absence of the 90 mutation, but become positively selected in the presence of the 90 mutation (Figure 3). These results closely match previous experimental studies showing that mutations at 90 cause drug resistance, while mutations at 10 have an accessory effect of compensating for the destabilizing effect of mutations at 90 (21). Thus, our database can help users by providing information that can distinguish primary drug-resistance mutations from accessory mutations (18). Users can navigate through links on every result page, to see mutations that strongly select for a given mutation, mutations that are strongly selected for by this mutation, or links to the Stanford (22) and Los Alamos HIV databases (23) giving further information about mutations at this site.
|
Comparison between the independent datasets in the database can shed additional light on such questions. For example, users can assess whether positively selected mutations in the Specialty dataset are really due to drug resistance, by comparing with the Stanford-Treated and Stanford-Untreated datasets. As shown in Figure 3b and c, the Stanford-Treated data strongly corroborate the Specialty result, while the Stanford-Untreated data show that 90 is indeed involved in drug resistance; it becomes strongly negatively selected in the absence of drug treatment. These data can help users distinguish genuine drug-treatment mutations from those that affect phenotype in other ways, e.g. interactions with the host immune system. Detailed analysis of these datasets demonstrates that the Ka/Ks results are highly reproducible: independent datasets from different sets of patients show strong quantitative agreement (18).
| FUTURE ADDITIONS |
|---|
|
|
|---|
We are currently working to add new data and features to the database. We will add a number of new datasets to the database. First, we will add data for additional HIV genes, such as the env gene, which is important for HIV immune evasion (24); although these datasets have smaller numbers of sequences, our analysis has shown that useful Ka/Ks mapping information can be obtained from such counts. Second, we will analyze mutation data from patients under specific drug-treatment to compare selection pressures caused by different drugs. Third, we will add datasets for other HIV subtypes (e.g. subtype C) to reveal, where selection pressure patterns appear to be consistent with those seen in subtype B (allowing diagnostic criteria from subtype B to be applied to other subtypes) versus where there are important differences. Fourth, we will add a new very large dataset for the Hepatitis C core gene, consisting of approximately 60 000 samples, generously donated by Specialty Laboratories. Lastly, we will add new analyses and graphical interfaces to the database, including phylogenetic analysis and clickable pathway diagrams.
| ACKNOWLEDGEMENTS |
|---|
Funding to pay the Open Access publication charges for this article was provided by NIH Grants U54 RR021813 entitled Center for Computational Biology (CCB) and T32-HG002536.
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Dybul, M., Fauci, A.S., Bartlett, J.G., Kaplan, J.E., Pau, A.K. Panel on Clinical Practices for Treatment of HIV. (2002) Guidelines for using antiretroviral agents among HIV-infected adults and adolescents Ann. Intern. Med, . 137, 381433
[Abstract/Free Full Text] - Richman, D.D., Havlir, D., Corbeil, J., Looney, D., Ignacio, C., Spector, S.A., Sullivan, J., Cheeseman, S., Barringer, K., Pauletti, D., et al. (1994) Nevirapine resistance mutations of human immunodeficiency virus type 1 selected during therapy J. Virol, . 68, 16601666
[Abstract/Free Full Text] - Wei, X., Ghosh, S.K., Taylor, M.E., Johnson, V.A., Emini, E.A., Deutsch, P., Lifson, J.D., Bonhoeffer, S., Nowak, M.A., Hahn, B.H., et al. (1995) Viral dynamics in human immunodeficiency virus type 1 infection Nature, 373, 117122[CrossRef][Medline]
- Mansky, L.M. and Temin, H.M. (1995) Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase J. Virol, . 69, 50875094
[Abstract/Free Full Text] - Korber, B., Theiler, J., Wolinsky, S. (1998) Limitations of a molecular clock applied to considerations of the origin of HIV-1 Science, 280, 18681871
[Abstract/Free Full Text] - Ho, D.D., Neumann, A.U., Perelson, A.S., Chen, W., Leonard, J.M., Markowitz, M. (1995) Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection Nature, 373, 123126[CrossRef][Medline]
- Perelson, A.S., Neumann, A.U., Markowitz, M., Leonard, J.M., Ho, D.D. (1996) HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span and viral generation time Science, 271, 15821586[Abstract]
- Coffin, J.M. (1995) HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy Science, 267, 483489
[Abstract/Free Full Text] - Kuhmann, S.E., Pugach, P., Kunstman, K.J., Taylor, J., Stanfield, R.L., Snyder, A., Strizki, J.M., Riley, J., Baroudy, B.M., Wilson, I.A., et al. (2004) Genetic and phenotypic analyses of human immunodeficiency virus type 1 escape from a small-molecule CCR5 inhibitor J. Virol, . 78, 27902807
[Abstract/Free Full Text] - Bonhoeffer, S., Chappey, C., Parkin, N.T., Whitcomb, J.M., Petropoulos, C.J. (2004) Evidence for positive epistasis in HIV-1 Science, 306, 15471550
[Abstract/Free Full Text] - Wu, T.D., Schiffer, C.A., Gonzales, M.J., Taylor, J., Kantor, R., Chou, S., Israelski, D., Zolopa, A.R., Fessel, W.J., Shafer, R.W. (2003) Mutation patterns and structural correlates in human immunodeficiency virus type 1 protease following different protease inhibitor treatments J. Virol, . 77, 48364847
[Abstract/Free Full Text] - Chen, L., Perlina, A., Lee, C.J. (2004) Positive selection detection in 40,000 human immunodeficiency virus (HIV) type 1 sequences automatically identifies drug resistance and positive fitness mutations in HIV protease and reverse transcriptase J. Virol, . 78, 37223732
[Abstract/Free Full Text] - Nei, M. and Gojobori, T. (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions Mol. Biol. Evol, . 3, 418426[Abstract]
- Li, W.H. (1993) Unbiased estimation of the rates of synonymous and nonsynonymous substitution J. Mol. Evol, . 36, 9699[CrossRef][Web of Science][Medline]
- Felsenstein, J. (1989) PHYLIPphylogeny inference package (version 3.2) Cladistics, 5, 164166
- Grassly, N.C., Harvey, P.H., Holmes, E.C. (1999) Population dynamics of HIV-1 inferred from gene sequences Genetics, 151, 427438
[Abstract/Free Full Text] - Anderson, J.P., Rodrigo, A.G., Learn, G.H., Wang, Y., Weinstock, H., Kalish, M.L., Robbins, K.E., Hood, L., Mullins, J.I. (2001) Substitution model of sequence evolution for the human immunodeficiency virus type 1 subtype B gp120 gene over the C2-V5 region J. Mol. Evol, . 53, 5562[Web of Science][Medline]
- Chen, L. and Lee, C. (2006) Distinguishing HIV-1 drug resistance, accessory, and viral fitness mutations using conditional selection pressure analysis of treated versus untreated patient samples Biol. Direct, . 1, 14[CrossRef][Medline]
- Rhee, S.Y., Gonzales, M.J., Kantor, R., Betts, B.J., Ravela, J., Shafer, R.W. (2003) Human immunodeficiency virus reverse transcriptase and protease sequence database Nucleic Acids Res, . 31, 298303
[Abstract/Free Full Text] - Loeb, D.D., Hutchison, C.A., 3rd, Edgell, M.H., Farmerie, W.G., Swanstrom, R. (1989) Mutational analysis of human immunodeficiency virus type 1 protease suggests functional homology with aspartic proteinases J. Virol, . 63, 111121
[Abstract/Free Full Text] - Para, M.F., Glidden, D.V., Coombs, R.W., Collier, A.C., Condra, J.H., Craig, C., Bassett, R., Leavitt, R., Snyder, S., McAuliffe, V., et al. (2000) Baseline human immunodeficiency virus type 1 phenotype, genotype, and RNA response after switching from long-term hard-capsule saquinavir to indinavir or soft-gel-capsule saquinavir in AIDS clinical trials group protocol 333 J. Infect. Dis, . 182, 733743[CrossRef][Web of Science][Medline]
- Rhee, S.Y., Gonzales, M.J., Kantor, R., Betts, B.J., Ravela, J., Shafer, R.W. (2003) Human immunodeficiency virus reverse transcriptase and protease sequence database Nucleic Acids Res, . 31, 298303
[Abstract/Free Full Text] - Kuiken, C., Korber, B., Shafer, R.W. (2003) HIV sequence databases AIDS Rev, . 5, 5261[Medline]
- Johnson, W.E. and Desrosiers, R.C. (2002) Viral persistance: HIV's strategies of immune system evasion Annu. Rev. Med, . 53, 499518[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
W. Valdivia-Granda and F. Larson ORION-VIRCAT: a tool for mapping ICTV and NCBI taxonomies Database, December 7, 2009; 2009(0): bap014 - bap014. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Buendia, B. Cadwallader, and V. DeGruttola A phylogenetic and Markov model approach for the reconstruction of mutational pathways of drug resistance Bioinformatics, October 1, 2009; 25(19): 2522 - 2529. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Qiu, V. Sanfiorenzo, S. Curry, Z. Guo, S. Liu, A. Skelton, E. Xia, C. Cullen, R. Ralston, J. Greene, et al. Identification of HCV protease inhibitor resistance mutations by selection pressure-based method Nucleic Acids Res., June 1, 2009; 37(10): e74 - e74. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Wang, I. Barr, F. Guo, and C. Lee Evidence of a novel RNA secondary structurein the coding region of HIV-1 pol gene RNA, December 1, 2008; 14(12): 2478 - 2488. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









