Nucleic Acids Research, 2001, Vol. 29, No. 1 133-136
© 2001 Oxford University Press
VIDA: a virus database system for the organization of animal virus genome open reading frames
Wohl Virion Centre, Department of Immunology and Molecular Pathology, Windeyer Institute of Medical Sciences and 1Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, London, UK, 2Department of Computer Science and 3Department of Crystallography, Birkbeck College, University of London, London, UK
Received September 1, 2000; Revised and Accepted October 27, 2000.
| ABSTRACT |
|---|
|
|
|---|
VIDA is a new virus database that organizes open reading frames (ORFs) from partial and complete genomic sequences from animal viruses. Currently VIDA includes all sequences from GenBank for Herpesviridae, Coronaviridae and Arteriviridae. The ORFs are organized into homologous protein families, which are identified on the basis of sequence similarity relationships. Conserved sequence regions of potential functional importance are identified and can be retrieved as sequence alignments. We use a controlled taxonomical and functional classification for all the proteins and protein families in the database. When available, protein structures that are related to the families have also been included. The database is available for online search and sequence information retrieval at http://www.biochem.ucl.ac.uk/bsm/virus_database/VIDA.html.
| INTRODUCTION |
|---|
|
|
|---|
Knowledge of complete virus genome sequences has helped biologists develop a fundamental understanding of viral replication and hostpathogen interactions. However, the consistent analysis of viral proteins in the absence of many of todays bioinformatics approaches has resulted in the level of organization and annotation of viral genomic sequences being inferior compared to other genome databases. The number of available viral genomic sequences continues to grow. For example, in the herpesvirus family the number of complete genomes has increased from 10 to 26 in the last 5 years, and partial genomic sequences encoding a range of open reading frames (ORFs) exist for at least another 25 herpesviruses. As a great deal of heterogeneity and redundancy exist for viral sequences in the primary databases, there is a need to create databases that facilitate the retrieval of relevant information. Existing virus sequence databases are mainly focused on the visualization and interpretation of complete virus genomes (13) or the detailed study of particular virus proteins (4).
We have developed a virus database, VIDA, that organizes information on viral ORFs from complete or partial genomic sequences derived from GenBank (5). The database is focused on animal viruses. It contains up to date compilations of viral-specific protein families, which are identified on the basis of sequence similarity relationships between the ORFs. The approach resembles that used by Montague and Hutchison (6) to construct clusters of orthologous groups (COGs) from protein sequences from 13 herpesvirus complete genomes. However, in VIDA, homologous protein families include both orthologous and paralogous sequences as long as they have significant sequence similarity. The families within VIDA are automatically derived for all ORFs from a given virus family, e.g. herpesviruses. Viral ORFs can exhibit high mutation rates and can diverge quickly. Therefore the identification of conserved sequence regions is a valuable tool in identifying functionally important protein regions and to cross-compare different virus genomes. For each protein family entry in VIDA it is possible to obtain alignments of the conserved regions. Additionally, complete protein sequences can be retrieved as single FASTA format files for easy import into other sequence analysis programs. Links to the original genomic entries and information on protein folds, when available, are also provided. A simple virus-specific functional classification has been developed and used to classify the protein families into typical viral processes. The procedures used to organize the virus protein sequences are shown in Figure 1a. In addition to keyword searches, facilities exist to browse the protein families in a controlled way using the virus name or functional class. The present release of VIDA includes all ORFs from the Herpesviridae family and the order Nidovirales, containing the Coronaviridae and Arteriviridae families (7).
|
| CONSTRUCTION AND ORGANIZATION OF THE VIRUS DATABASE (VIDA) |
|---|
|
|
|---|
Homologous protein families (HPFs)
VIDA uses GenBank files as a source for virus sequences. Files relating to a virus family are downloaded using keywords (for example Herpesviridae) and a number of fields within the GenBank files are parsed out into sub-files. The parsed fields include the GenBank accession number, sequence length, protein GenBank identifiers (GI numbers), sequence source, gene name and gene product. Partial ORFs in GenBank entries are not parsed into the database, which should only contain complete protein sequences. The protein sequences are further filtered to eliminate 100% redundancy and a list of synonymous GIs created for further reference.
Analysis of related proteins is often greatly enhanced by determining the sequence relationships between the individual proteins. This has the power of identifying regions of sequence similarity and of possible functional conservation. The clustering of homologous sequences therefore provides a rational way of organizing protein sequence data. We built up HPFs in two steps. Initially we identified all regions of sequence similarity in the viral ORFs using the XDOM program with default parameters (8). XDOM is based on BLASTP (9) and had previously been used to identify regions of protein sequence similarity in different complete genomes from bacteria, archaea and eukarya (10). We then used our own C program (PSCbuilder) to build up families of proteins, by taking those proteins that share at least one region of sequence similarity. These are defined as HPFs (Fig. 1b). PSCbuilder constructs families that contain as many protein members as possible, which prevents fragmentation of HPFs into subgroups of proteins and therefore maximizes the amount of functionally important information content embedded within the protein families. While most of the HPFs are defined by one region of sequence similarity others, for example the herpesvirus DNA-dependent DNA polymerase, contain several consecutive sequence similarity regions which are present in all the proteins.
Protein structure
Whenever possible, we have included information on known viral protein structures or structures homologous to viral proteins and mapped them onto homologous protein families. This was achieved by searching all viral proteins against a library of structural domain profiles derived from the CATH database (11) using PSI-BLAST (12). The search was performed with the IMPALA software (13). In both these programs the expectation value (E) threshold was set at 0.0005. We could detect structural relatives for ~8% of the homologous protein families. Links are available as the original PDB entries and/or CATH domain entries.
Functional annotation
We have developed a simple functional classification schema to assign proteins to broad functional classes that reflect typical virus processes. So far we have defined the following classes: DNA replication, RNA replication, Virus structural proteins, Glycoproteins, Nucleotide and nucleic acid metabolism, Transcription and Others. The different homologous protein families have been manually assigned to these classes and given a short functional description. We have mostly used the original ORF annotations in the GenBank files to assign the protein to a functional group.
Virus taxonomy
We have used the nomenclature used in The International Committee on Taxonomy of Viruses (ICTV) (14) to name the species, genus and subfamilies of each of the viruses used in the database. In some cases we needed to create synonymous lists to accommodate the different names used in GenBank entries for the same virus. This enables the users to perform searches using any of the synonymous names while maintaining consistency within the database. In addition, the search by virus name is facilitated by a list of standard virus names from which to choose. Taxon and virus name are integral parts of the HPF tables making it easy for users to assess the virus phylogenetic distribution of the family.
| CONTENT OF CURRENT RELEASE |
|---|
|
|
|---|
VIDA 1.0 contains entries for the Herpesviridae family and the order Nidovirales (Coronaviridae and Arteriviridae). Herpesviruses have large double stranded DNA genomes, with between 80 and 220 ORFs per genome, while Nidovirales are single stranded RNA viruses that encode 812 ORFs. In VIDA 1.0 there are 812 HPF entries (Table 1). The entries are presented in the form of tables that include information on the length of the proteins, start and end of conserved sequence regions, virus taxa and name, gene name and links to EMBL entries (Fig. 1c). In addition it is possible to retrieve the protein sequences, DNA sequences, alignment of conserved sequence regions and follow links to structural folds (CATH and PDB entries) and functional class.
|
| DATABASE ACCESS |
|---|
|
|
|---|
VIDA 1.0 is entered through an initial web page (http://www.biochem.ucl.ac.uk/bsm/virus_database/VIDA.html) that lists the current virus families within the database. Links to the complete genome sequences of viruses within the family are provided as well as links to other virus sites and general sequence analysis resources. For each virus family, links through the Search homologous protein families enable the searching of the HPFs by virus name, function (for example DNA polymerase), GI code (GenBank protein entry) or free text (for example UL6, ORF 20). It is also possible to retrieve all HPFs belonging to a defined functional class. The hits are presented as a list of HPFs with a short description covering number of protein members and function. Following the links results in the specific HPF tables (Fig. 1c).
| CONCLUSIONS AND PERSPECTIVES |
|---|
|
|
|---|
VIDA has been developed to organize viral sequence and annotations from the main sequence database repositories. The sequences are extracted from different animal virus families. Non-redundant complete ORF sequences derived from GenBank are automatically clustered into homologous protein families. The protein families are a rich source of information for functional and evolutionary studies and the alignments of the conserved sequence regions facilitate the direct study of important conserved amino acids or construction of sequence profiles. VIDA 1.0 is particularly relevant to the herpesvirus, coronavirus and arterivirus communities. It will be updated with each new GenBank release and we are committed to incorporate other animal virus families. Its functionality will be gradually improved for example, by providing sequence-based search programs. We also plan to move from the present flat file format to Oracle relational database format. Functional annotation should benefit from contributions and feedback from other experts in the field, which we strongly encourage.
| ACKNOWLEDGEMENTS |
|---|
We acknowledge the Medical Research Council for grants to D.L., N.M., C.A.O. and P.K. and the Biotechnology and Biological Sciences Research Council for grants to M.M.A., F.M.G.P. and A.J.S.
| FOOTNOTES |
|---|
* To whom correspondence should be addressed at: Wohl Virion Centre, Department of Immunology and Molecular Pathology, Windeyer Institute of Medical Sciences, University College London, 46 Cleveland Street, London W1T 4JF, UK. Tel: +44 20 7679 9559; Fax: +44 20 7679 9555; Email: p.kellam{at}ucl.ac.uk
| REFERENCES |
|---|
|
|
|---|
-
1 Tamames,J. and Tramontano,A. (2000) DANTE: A workbench for sequence analysis. Trends Biochem. Sci., 25, 402403.[ISI][Medline]
2 Hiscock,D. and Upton,C (2000) Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes. Bioinformatics, 16, 484485.
3 Wheeler,D.L., Chappey,C., Lash,A.E., Leipe,D.D., Madden,T.L., Schuler,G.D.,Tatusova,T.A. and Rapp,B.A. (2000) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 28, 1014. Updated article in this issue: Nucleic Acids Res. (2001), 29, 1116.
4 Shafer,R.W., Jung,D.R., Betts,B.J., Xi,Y. and Gonzales,M.J. (2000) Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res., 28, 346348. Updated article in this issue: Nucleic Acids Res. (2001), 29, 296299.
5 Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,L.(2000) GenBank. Nucleic Acids Res., 28, 1518.
6 Montague,M.G. and Hutchison,C.A.,III (2000) Gene content phylogeny of herpesviruses. Proc. Natl Acad. Sci. USA, 97, 53345339.
7 Snijder,E.J. and Meulenberg,J.J.M. (1998) The molecular biology of arteriviruses. J. Gen. Virol., 79, 961979.[ISI][Medline]
8 Gouzy,J., Eugene,P., Greene,E.A., Khan,D. and Corpet,F. (1997) XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences. Comput. Appl. Biosci., 13, 601608.
9 Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410.[ISI][Medline]
10 Gouzy,J., Corpet,F. and Kahn,D. (1999) Whole genome protein domain analysis using a new method for domain clustering. Comput. Chem., 23, 333340.[ISI][Medline]
11 Pearl,F.M.G., Lee,D., Bray,J.E., Sillitoe,I, Todd,A.E., Harrison,A.P., Thomton,J.M. and Orengo,C.A. (2000) Using the CATH domain database to assign structures and functions to the genome sequences. Nucleic Acids Res., 28, 277282. Updated article in this issue: Nucleic Acids Res. (2001), 29, 223227.
12 Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33983402.
13 Schäffer,A.A., Wolf,Y.l., Ponting,C.P., Koonin,E.V., Aravind,L. and Altschul,S.F.(1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics, 15, 10001011.
14 Van Regenmortel,M.H.V., Fauquet,C.M., Bishop,D.H.L., Carstens,E.B., Estes,M.K., Lemon,S.M., Maniloff,J., Mayo,M.A., MeGeoch,D.J., Pringle,C.R. and Wickner,R.B.(2000) Virus taxonomy: The Classification and Nomenclature of Viruses. The Seventh Report of the International Committee on Taxonomy of Viruses. Academic Press, San Diego, CA, p. 1167.
This article has been cited by other articles:
![]() |
R. Nascimento and R. M. E. Parkhouse Murine gammaherpesvirus 68 ORF20 induces cell-cycle arrest in G2 by inhibiting the Cdc2-cyclin B complex J. Gen. Virol., May 1, 2007; 88(5): 1446 - 1453. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. E. Zhou, J. Smith, M. Lam, A. Zemla, M. D. Dyer, and T. Slezak MvirDB--a microbial database of protein toxins, virulence factors and antibiotic resistance genes for bio-defence applications Nucleic Acids Res., January 12, 2007; 35(suppl_1): D391 - D394. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Hirahata, T. Abe, N. Tanaka, Y. Kuwana, Y. Shigemoto, S. Miyazaki, Y. Suzuki, and H. Sugawara Genome Information Broker for Viruses (GIB-V): database for comparative analysis of virus genomes Nucleic Acids Res., January 12, 2007; 35(suppl_1): D339 - D342. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Borah, S. C. Verma, and E. S. Robertson ORF73 of Herpesvirus Saimiri, a Viral Homolog of Kaposi's Sarcoma-Associated Herpesvirus, Modulates the Two Cellular Tumor Suppressor Proteins p53 and pRb J. Virol., October 1, 2004; 78(19): 10336 - 10347. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Kulkarni-Kale, S. Bhosle, G. S. Manjari, and A. S. Kolaskar VirGen: a comprehensive viral genome resource Nucleic Acids Res., January 1, 2004; 32(90001): D289 - 292. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Bortz, J. P. Whitelegge, Q. Jia, Z. H. Zhou, J. P. Stewart, T.-T. Wu, and R. Sun Identification of Proteins Associated with Murine Gammaherpesvirus 68 Virions J. Virol., December 15, 2003; 77(24): 13425 - 13432. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Mills, M. Rozanov, A. Lomsadze, T. Tatusova, and M. Borodovsky Improving gene annotation of complete viral genomes Nucleic Acids Res., December 1, 2003; 31(23): 7041 - 7055. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Holzerlandt, C. Orengo, P. Kellam, and M. M. Alba Identification of New Herpesvirus Gene Homologs in the Human Genome Genome Res., November 1, 2002; 12(11): 1739 - 1748. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




