Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (702K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Buchan, D. W. A.
Right arrow Articles by Orengo, C. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Buchan, D. W. A.
Right arrow Articles by Orengo, C. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2003, Vol. 31, No. 1 469-473
© 2003 Oxford University Press

Gene3D: structural assignments for the biologist and bioinformaticist alike

Daniel W. A. Buchan1, Stuart C. G. Rison1, James E. Bray1, David Lee1,2, Frances Pearl1, Janet M. Thornton1,3 and Christine A. Orengo*,1

1 Biomolecular Structure and Modelling Group, Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK 2 Department of Crystallography, Birkbeck College, Malet Street, Bloomsbury, London WC1E 7HX, UK 3 EMBL—European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

*To whom correspondence should be addressed. Tel: +44 2076797193; Email: c.orengo{at}ucl.ac.uk
Present address: Stuart C. G. Rison, Royal Vetinary College, Department of Pathology and Infectious Diseases, Royal College Street, London NW1 0TU
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors

Received August 15, 2002; Revised and Accepted October 2, 2002

ABSTRACT

The Gene3D database (http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/) provides structural assignments for genes within complete genomes. These are available via the internet from either the World Wide Web or FTP. Assignments are made using PSI-BLAST and subsequently processed using the DRange protocol. The DRange protocol is an empirically benchmarked method for assessing the validity of structural assignments made using sequence searching methods where appropriate assignment statistics are collected and made available. Gene3D links assignments to their appropriate entries in relevent structural and classification resources (PDBsum, CATH database and the Dictionary of Homologous Superfamilies). Release 2.0 of Gene3D includes 62 genomes, 2 eukaryotes, 10 archaea and 40 bacteria. Currently, structural assignments can be made for between 30 and 40 percent of any given genome. In any genome, around half of those genes assigned a structural domain are assigned a single domain and the other half of the genes are assigned multiple structural domains. Gene3D is linked to the CATH database and is updated with each new update of CATH.

INTRODUCTION

Considerable progress has been made in the field of genome annotation in the past five years and it is now evident that some structural or functional annotation can be provided for most of the genes in any given organism (612). Currently, the state of the art allows up to 80% (7) of the genes in any given organism to be assigned functional or structural annotation. Most annotations methods rely almost solely on inheriting functional annotation via sequence comparison but one must exercise a degree of caution when interpreting such results. This is particularly pertinent when considering the annotation of distant homologues [~30% sequence identity, (13)]. The benefit of structural annotation is often useful when assessing the functional annotations of these homologues. Use of structural data enables 3D models to be built to inform functional predictions (14,15). Gene3D aims to provide the biologist with reliable precalculated relationships to protein structures and, as a result, the relevant links to the functional and structural data curated within the CATH domain structure classification database. These data can then be used as the starting point for homology modelling or evolutionary studies. A related resource, SUPERFAMILY (16), is linked to the SCOP structural database (17).

METHODS

The Gene3D database is derived from data produced by the DomainFinder algorithm (18) and the DRange protocol (2). This resource is created by scanning the sequences from the CATH structural domains against a large database derived from the non-redundant sequence database from GenBank that contains the sequences from the completed genomes. The PSI-BLAST (1) iterative database search algorithm is used (19) to scan CATH database sequences against the GenBank sequences. Preprocessing is carried out by DomainFinder and the DRange protocol selects and validates the putative structural annotations suggested by DomainFinder. Gene3D and the associated DRange protocol are described below.

DomainFinder and DRange
The Gene3D population process is illustrated in Figure 1. The procedure starts with a dataset of non-identical sequences from the CATH database (CATH S95Rep) sequences, which is searched against a library of sequences (in this case the sequences from the GenBank non-redundant database which includes the sequences from the completed genomes) using PSI-BLAST (Fig. 1A) with the aim of producing a series of matches of the structural domains to the genomic sequences (Fig. 1B).



View larger version (13K):
[in this window]
[in a new window]
 
Figure 1. Populating the Gene3D Database. (A) CATH Representative sequences (S95Reps) are scanned against the GenBank non-redundant database containing the sequences from the completed genomes using PSI-BLAST. Search results (B) are processed by DomainFinder to generate ‘Ranges’ (C). These are ‘cleaned-up’ by the DRange package (D) and final assignments are assimilated in the Gene3D database (E).

 
In the subsequent step, the DomainFinder algorithm is used to convert the ‘raw’ hits into ‘Ranges’ (18). These ‘Ranges’ act as descriptors which indicate which regions of a gene are putatively thought to belong to which CATH Homologous Superfamilies (Fig. 1C). In the last data manipulation step (Fig. 1D) assignments are cleaned-up using the DRange package (Fig. 1D) and the resulting assignments stored in the Gene3D database (Fig. 1E). The DRange package is composed of three modules: Collapse, MultiParse and CleanAssign (2). These three modules are used to verify structural domain assignments. The ‘clean-up’ procedure is a triage procedure distinguishing between probably correct and probably incorrect assignments.

Results
Gene3D is the repository for structural assignments verified using the DRange protocol and is available on-line at http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/. This protocol is applied to all complete genomes released. In May 2002, Gene3D included whole genome structural assignments for 66 genomes. The data are also available via the CATH FTP site at ftp://ftp.biochem.ucl.ac.uk/pub/Gene3D/.

Typical assignments statistics, for four typical genomes in the database, one from each of the major branches of life (one multicellular eukaryote, one unicellular eukaryote, an archaea and a bacterium) are presented in Table 1. The level of assignment ranges from ~22% to ~55% of the genes in a given organism in the database receive annotation with at least one structural domain. Of these genes, usually around half are annotated with a single domain and the other half of the genes are assigned multiple domains (see Fig. 2). The figure also shows that the eukaryotic genomes have many more genes with a large number of domains and closer inspection of these indicates that the largest of these genes are made of long strings of immunoglobulin like domains and are likely to be cell–cell signalling domains. The percentage of residues covered (see Table 1) is often around half (and frequently lower) of the percentage of genes with an annotation. This indicates that many genes that pick up a domain are not being fully annotated and could be annotated with further domains.


View this table:
[in this window]
[in a new window]
 
Table 1. Genome assignments statistics
 


View larger version (19K):
[in this window]
[in a new window]
 
Figure 2. Bar chart showing the distribution domains assigned to genes in three typical organisms: Caenorhabdatis elegans, Methanococcus jannaschii and Escherichia coli. The Y axis has been truncated.

 
Cursory inspection of the assignment data shows that bacterial and archaeal genomes pick up approximately the same ratios of the various types of CATH domains and that no single genome appears to be strongly biased in the type of CATH domains it utilises (see Fig. 3). The eukaryotes appear to make more use of the all-beta domains in the CATH database, which is probably due to their greater use of cell–cell signalling proteins that typically use immunoglobulin like domains.



View larger version (31K):
[in this window]
[in a new window]
 
Figure 3. Bar chart showing the relative percentage of domain classes from the four major CATH classes for genes that have been assigned a CATH domain. The four classes are: Class 1: all alpha folds; Class 2: all beta folds; Class: 3 mixed alpha and beta folds and Class 4: folds with little secondary structure.

 
In the database, the eukaryotic genomes pick up the least annotation which may be due to a prokaryotic bias in the structures that are deposited within the PDB (20).

The Gene3D Web Server
The Gene3D web server is made up of a number of inter linked web pages which allow the retrieval of data on specific genes within the represented genomes. Each genome features an entry page (Fig. 4A) with a summary of the assignment statistics and a CATH wheel (21). The CATH wheel is a pie plot indicating which folds in CATH are present in the organism. Those folds not detected in the genome are blacked out. The statistics are similar to those presented in Table 1. From this entry page, it is possible to search the genome in one of the two ways. The first is by a simple keyword or gene identifier search that returns a list of matching genes. The second is to browse the complete list of genes within an organism that have had a structural assignment made to them. By either route once a gene is selected a results page is returned (Fig. 4B). These results page presents a schematic diagram of both the gene (hatched in green) and the structural domains assigned (colour coded by domain type). Presented alongside this is the ‘ranges’ data for this assignment and the E-values from PSI-BLAST upon which this assignment was accepted. We recommend that for batch downloads users refer to the FTP site (ftp://ftp.biochem.ucl.ac.uk/pub/Gene3D/).



View larger version (29K):
[in this window]
[in a new window]
 
Figure 4. Diagram shows a typical entry page (A) for a given genome (e.g. Mycoplasma genitalium) and the statistics presented and an (B) example of the diagram and data that can be retrieved for a gene.

 
DISCUSSION

The data within Gene3D are there to provide biologists and bioinformaticists with an initial stepping stone from which structural, functional and evolutionary studies can begin. In future, we hope to integrate Pfam domain assignments (12) to maximise the annotated coverage of genomes and we also hope to provide alignments of the CATH or Pfam domains to the genes that they matched. It is our hope that by integrating Pfam domain assignments, we can provide the assignments for most, if not all, of the genes in the complete genomes.

That we can annotate so much of the complete genome sequences from the structure databases alone suggests that we may not need to solve structures for every sequence but rather for every sequence family containing relatives of high sequence identity (for example ~40%) sequence identity. In such families, homology modelling could then be used to predict the structures of all the relatives from one representative structure. This bodes well for the success of the structural genomics projects.

REFERENCES

  1. Altschul,S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

  2. Buchan,D., Shepherd,A., Lee,D., Pearl,F., Rison,S., Thornton,J. and Orengo,C. (2002) Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database. Genome Res., 12, 503–514.[Abstract/Free Full Text]

  3. Laskowski,R. (2001) PDBsum: summaries and analyses of PDB structures. Nucleic Acids Res., 29, 221–222.[Abstract/Free Full Text]

  4. Pearl,F., Martin,N., Bray,J., Buchan,D., Harrison,A., Lee,D., Reeves,G., Shepherd,A., Sillitoe,I., Todd,A., Thornton,J. and Orengo,C. (2001) A rapid classification protocol for the CATH domain database to support structural genomics. Nucleic Acids Res., 29, 223–227.[Abstract/Free Full Text]

  5. Bray,J., Todd,A., Pearl,F., Thornton,J. and Orengo,C. (2000) The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues. Protein Eng., 13, 153–165.[Abstract/Free Full Text]

  6. Gerstein,M. (1997) A structural census of genomes: comparing bacterial, eukaryotic and archaeal genomes in terms of protein structure. J. Mol. Biol., 274, 562–576.[CrossRef][ISI][Medline]

  7. Teichmann,S., Chothia,C. and Gerstein,M. (1999) Advances in structural genomics. Curr. Opin. Struct. Biol., 9, 390–399.[CrossRef][ISI][Medline]

  8. Muller,A., MacCallum,R. and Sternberg,M. (1999) Benchmarking PSI-BLAST in genome annotation. J. Mol. Biol., 293, 1257–1271.[CrossRef][ISI][Medline]

  9. Iliopoulos,I., Tsoka,S., Andrade,M., Janssen,P., Audit,B., Tramontano,A., Valencia,A., Leroy,C., Sander,C. and Ouzounis,C. (2001) Genome sequences and great expectations. Genome Biol., 2, Interactions0001.

  10. Apweiler,R., Biswas,M., Fleischmann,W., Kanapin,A., Karavidopoulou,Y., Kersey,P., Kriventseva,E., Mittard,V., Mulder,N., Phan,I. and Zdobnov,E. (2001) Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes. Nucleic Acids Res., 29, 44–48.[Abstract/Free Full Text]

  11. Kanehisa,M., Goto,S., Kawashima,S. and Nakaya,A. (2002) The KEGG databases at GenomeNet. Nucleic Acids Res., 30, 42–46.[Abstract/Free Full Text]

  12. Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S., Griffiths-Jones,S., Howe,K., Marshall,M. and Sonnhammer,E. (2002) The Pfam protein families database. Nucleic Acids Res., 30, 276–280.[Abstract/Free Full Text]

  13. Todd,A., Orengo,C. and Thornton,J. (2001) Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol., 307, 1113–1143.[CrossRef][ISI][Medline]

  14. Laskowski,R., Luscombe,N., Swindells,M. and Thornton,J. (1996) Protein clefts in molecular recognition and function. Protein Sci., 5, 2438–2452.[Abstract]

  15. Luscombe,N., Laskowski,R. and Thornton,J. (1997) NUCPLOT: a program to generate schematic diagrams of protein–nucleic acid interactions. Nucleic Acids Res., 25, 4940–4945.[Abstract/Free Full Text]

  16. Gough,J. and Chothia,C. (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res., 30, 268–272.[Abstract/Free Full Text]

  17. Lo Conte,L., Brenner,S., Hubbard,T., Chothia,C. and Murzin,A. (2002). SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30, 264–272.[Abstract/Free Full Text]

  18. Pearl,F., Lee,D., Bray,J., Buchan,D., Shepherd,A. and Orengo,C. (2002) The CATH extended protein-family database: providing structural annotations for genome sequences. Protein Sci., 11, 233–244.[Abstract/Free Full Text]

  19. Altschul,S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSI–BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

  20. Westbrook,J., Feng,Z., Jain,S., Bhat,T., Thanki,N., Ravichandran,V., Gilliland,G., Bluhm,W., Weissig,H., Greer,D., Bourne,P. and Berman,H. (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res., 30, 245–248.[Abstract/Free Full Text]

  21. Michie,A., Orengo,C. and Thornton,J. (1996). Analysis of domain structural class using an automated class assignment protocol. J. Mol. Biol., 262, 168–185.[CrossRef][ISI][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
O. Krishnadev, N. Rekha, S. B. Pandit, S. Abhiman, S. Mohanty, L. S. Swapna, S. Gore, and N. Srinivasan
PRODOC: a resource for the comparison of tethered protein domain architectures with in-built information on remotely related domain families
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W126 - W129.
[Abstract] [Full Text] [PDF]


Home page
Protein Sci.Home page
I. Sillitoe, M. Dibley, J. Bray, S. Addou, and C. Orengo
Assessing strategies for improved superfamily recognition
Protein Sci., July 1, 2005; 14(7): 1800 - 1810.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Liu and B. Rost
Sequence-based prediction of protein domains
Nucleic Acids Res., July 7, 2004; 32(12): 3522 - 3530.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L. J. McGuffin, S. A. Street, K. Bryson, S.-A. Sorensen, and D. T. Jones
The Genomic Threading Database: a comprehensive resource for structural annotations of the genomes from key organisms
Nucleic Acids Res., January 1, 2004; 32(90001): D196 - 199.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
K. Fleming, A. Muller, R. M. MacCallum, and M. J. E. Sternberg
3D-GENOMICS: a database to compare structural and functional annotations of proteins between sequenced genomes
Nucleic Acids Res., January 1, 2004; 32(90001): D245 - 250.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. Hershkovitz, E. Tannenbaum, S. B. Howerton, A. Sheth, A. Tannenbaum, and L. D. Williams
Automated identification of RNA conformational motifs: theory and application to the HM LSU 23S rRNA
Nucleic Acids Res., November 1, 2003; 31(21): 6249 - 6257.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (702K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (15)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Buchan, D. W. A.
Right arrow Articles by Orengo, C. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Buchan, D. W. A.
Right arrow Articles by Orengo, C. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?