Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (356K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Raghavan, S.
Right arrow Articles by Ouzounis, C. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Raghavan, S.
Right arrow Articles by Ouzounis, C. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Novel coding regions in four complete archaeal genomes
Nucleic Acids Research Pages 4405-4408


Novel coding regions in four complete archaeal genomes
Introduction
Materials And Methods
Results
Discussion
Acknowledgements
References


Novel coding regions in four complete archaeal genomes

Sowmya Raghavan, Christos A. Ouzounis*

Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK

Received August 2, 1999; Revised and Accepted September 28, 1999

ABSTRACT

In the process of analysing the four available complete archaeal genomes, we have noted that certain regions characterised as `non-coding' exhibit significant sequence similarity to other protein sequences from Archaea and other species. Using established technology, we have identified a number of potential protein coding regions in these putative `non-coding' regions. We have detected 524 such cases, of which 113 regions appear to code for proteins present in archaeal or other species, while the remaining 411 regions are mostly start/stop definition conflicts. Of the 113 protein coding regions, only 21 code for proteins with homologues of known function. The number of novel coding sequences identified herein amounts to 1.5% of the total genome entries, while the conflicting cases represent an additional 5%. The observed differences between the four complete archaeal genomes seem to reflect disparate approaches to genome annotation. Genome sequence collections should be regularly checked to improve gene prediction by sequence similarity and greater effort is required to make gene definitions consistent across related species.

INTRODUCTION

In the past few years, we have witnessed the successful completion of several whole-genome sequencing projects. However, the speed of availability of DNA sequences is not complemented by functional information regarding the location of genes on the genome and their functional roles (1). Given the rate limiting nature of experimental determination of new gene functions, it is important to develop fast yet reliable computational methods that `predict' genes or potential coding regions (2). In addition, there is a need to constantly monitor the annotation of genome sequences to add new information either from experimental sources or from predictions based on characterised sequences from related organisms. The latter method is especially valuable in the case of archaeal genomes, which exhibit a large number of hitherto unknown functions. In this report, we have examined whether regions of archaeal genomes that were originally discarded as non-coding may in principle exhibit significant similarities to the rapidly growing DNA and protein sequence databases.

MATERIALS AND METHODS

We have scanned genome annotations to extract all gaps between open reading frames (ORFs). Gaps longer than 70 nt were then extracted from the genomic sequence. We thus obtained 2237 inter-ORF regions from the complete genomes of Methanococcus jannaschii (323 regions) [MJ] (3), Methanobacterium thermoautotrophicum (590 regions) [RTH/MTH] (4), Archaeoglobus fulgidus (661 regions) [AF] (5) and Pyrococcus horikoshii (663 regions) [PH] (6). Our approach consists of matching all such extracted non-coding regions against a non-redundant protein sequence database for the detection of significant sequence similarities that imply the presence of protein coding segments (7).

These 2237 nt sequences were used as queries for BLASTX v.2.0 (8) runs against the non-redundant protein sequence database. Hits with P values <10-6 were extracted and manually examined. In addition, we have checked the clear cases with FRAME, a tool for sequencing error detection (9). All computations were performed on a cluster of Ultra 10 Sun workstations running Solaris 2.6.

Certain regions may be missed by employing a stringent threshold P value of 10-6. However, given the abundance of complete genome sequences, and the availability of related species, this cut-off appears to be permissible.

RESULTS

The list of 524 regions contains both complete ORFs (113) and sections of previously characterised ORFs with conflicting start/end sites (411). The former can only be identified by similarity to homologous protein sequences in the database, while the latter can be detected with reference to similarity, direction and annotation of the downstream ORFs.

Most of the 113 newly discovered coding regions appear to code for homologues of previously reported genome sequences (Table 1). Of these, 55 probably contain frameshift errors which may have impeded correct ORF prediction (Fig. 1). Most of these protein coding regions match other archaeal proteins within the same or in a related species, therefore making the predictions more reliable.


Figure 1. Representation of the single frame (light grey) and the multiple frame (dark grey) potential protein coding regions listed in Table 1 for the four complete archaeal genomes (see Materials and Methods for full species names). Single frame cases can be considered to be due to conservative gene prediction, while multiple frame cases can be considered to be due to frameshift errors. It is evident that there is great disparity between M.jannaschii and A.fulgidus, which appear to contain many frameshifts, versus M.thermoautotrophicum and P.horikoshii, which appear to contain more conservative gene predictions.

Table 1. The novel potential coding regions in the three complete archaeal genomes
Gap_ID is the identifier of the gap upstream of the reported ORF; Start is the left-most (right-most for the negative strand) site of the gap; End is the right-most (left-most for the negative strand) site of the gap; Frame is the direction and number of frames (where more than one frame is detected, a frameshift is likely); Best_Hit is the closest homologue in the database; Score, P value and Identity correspond to the BLASTX score, P value and identity of the query sequence to the highest scoring HSP; Comment contains information about homology and, when possible, function (underlined). The following convention is used: if the family is small and can be represented by all members, then the query ORF is described as `homolog'; if the family is large, the query ORF is described as `family homolog'. All results are available on the World Wide Web at
http://www.ebi.ac.uk/research/cgg/annotation/neworfs/ ; comments/corrections are welcome at neworfs@ebi.ac.uk .

There are 21 cases which have similarity to a protein of known function. The ORF upstream of MJ0273 codes for the second protease synthase and sporulation-negative regulatory protein homologue (PAI) in M.jannaschii, the other one being MJ1207 (10). Two more cases are the ORF upstream of PH0582 encoding subunit 1 of molybdopterin converting factor moaD and the ORF upstream of PH0879 encoding the hydrogenase expression/formation protein HypC. Other notable discoveries are: a tungsten formylmethanofuran dehydrogenase subunit F homologue (upstream of AF0463) (reported in all three other species); a protein translocation complex sec61 [gamma] subunit homologue (upstream of PH0003); three ribosomal proteins S17E/S24/S27E (upstream of PH1317, PH1910 and PH1940, respectively); DNA-dependent RNA polymerase subunit E[prime][prime] (upstream of PH1909).

Most of the newly identified protein coding genes have archaeal proteins as closest homologues. Exceptions include the ORF upstream of MJ1569, which is similar to a unique protein reported in Escherichia coli, the ORF upstream of AF1072 that encodes a unique Synechocystis sp. homologue and the ORF upstream of PH0772 encoding a yeast gene homologue (Table 1).

Some of the detected regions which match hypothetical proteins cannot always be corroborated by consistent predictions in other genomes, and they may be isolated cases of false positive ORF assignments. However, in cases where all related species contain at least one such protein, it is compulsory to include the regions predicted herein. Additional evidence can also be provided by the presence of duplicated genes within the same genome (see Table 1 for examples).

It is not clear whether the genome sequencing annotation procedures and in particular ORF detection (11) or automatic ORF translation by TrEMBL (12) generate the correct start/stop sites. Another difficulty is that each run may correspond to different coding regions (e.g. MTH1892), and therefore all the uniquely matching regions should be considered and reported. Examples of conflicting start/stop sites are the [beta] subunit of tungsten formylmethanofuran dehydrogenase from M.jannaschii (upstream of MJ1194) and a heterosulphide reductase (upstream of MJ1190) (not shown).

One interesting observation is that the four genomes differ significantly with respect to the reported false negative ORF calls, therefore pointing out the variability of approaches and lack of standards in genome sequence annotation. It is apparent from Table 1 that the M.jannaschii and P.horikoshii genomes contain most of the newly identified protein coding regions (>2% of the number of ORFs) (Fig. 1). Yet, in M.jannaschii, this seems to stem mostly from frameshift cases, while in P.horikoshii it seems to come from conservative ORF calls (Fig. 1). In contrast, M.thermoautotrophicum and A.fulgidus contain fewer frameshift errors.

DISCUSSION

Similar approaches have been used before for the discovery of bacterial genes (13) in the databases, in particular for E.coli (2). It has been noted before that this approach can be used not only for ORF prediction but for sequencing error detection (8,9). As mentioned above, in 55 out of 113 cases there are putative frameshifts that should be carefully checked. At present, we are not able to reconstruct the actual coding sequences without access to primary information, and therefore only the boundaries of the potential coding regions are reported (Table 1).

In some cases, however, the sequence may indeed be correct, and those genes should then be considered as being reminiscent of cryptic genes (14) or vestigial sequences. The genome of Rickettsia prowazekii (15) is known to contain non-coding regions that appear to be genes deactivated by several instances of mutation. Indeed, the two categories are not mutually exclusive, as genes that have been deactivated by evolutionary mechanisms may in principle revert to a functional state.

Matching non-coding regions of nucleotide sequences against protein databases should be carried out routinely during all genome sequencing projects, to eliminate inconsistencies and false positive (or negative) ORF assignments.

ACKNOWLEDGEMENTS

We thank Nikos Kyrpides (Department of Microbiology, University of Illinois at Urbana-Champaign, IL) for discussions, Gillian Adams for help and the EBI Visitors Programme for travel support. C.O. acknowledges support by the European Molecular Biology Laboratory, the European Commission DGXII and IBM Research.

REFERENCES

1. Andrade, M.A. and Sander,C. (1997) Curr. Opin. Biotechnol., 8, 675-683. MEDLINE Abstract

2. Borodovsky, M., Rudd,K.E. and Koonin,E.V. (1994) Nucleic Acids Res., 22, 4756-4767. MEDLINE Abstract

3. Bult, C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D. et al.(1996) Science, 273, 1058-1073. MEDLINE Abstract

4. Smith, D.R., Doucette-Stamm,L.A., Deloughery,C., Lee,H., Dubois,J., Aldredge,T., Bashirzadeh,R., Blakely,D., Cook,R., Gilbert,K. et al.(1997) J. Bacteriol., 179, 7135-7155. MEDLINE Abstract

5. Klenk, H.-P., Clayton,R.A., Tomb,J.-F., White,O., Nelson,K.E., Ketchum,K.A., Dodson,R.J., Gwinn,M., Hickey,E.K., Peterson,J.D. et al.(1997) Nature, 390, 364-370. MEDLINE Abstract

6. Kawarabayasi, Y., Sawada,M., Horikawa,H., Haikawa,Y., Hino,Y., Yamamoto,S., Sekine,M., Baba,S., Kosugi,H., Hosoyama,A. et al.(1998) DNA Res., 5, 147-155. MEDLINE Abstract

7. Altschul, S.F., Boguski,M.S., Gish,W. and Wootton,J.C. (1994) Nature Genet., 6, 119-129. MEDLINE Abstract

8. Gish, W. and States,D.J. (1993) Nature Genet., 3, 266-272. MEDLINE Abstract

9. Brown, N.P., Sander,C. and Bork,P. (1998) Bioinformatics, 14, 367-371. MEDLINE Abstract

10. Kyrpides, N.C. and Ouzounis,C.A. (1999) Proc. Natl Acad. Sci. USA, 96, 8545-8550. MEDLINE Abstract

11. McIninch, J.D., Hayes,W.S. and Borodovsky,M. (1996) Intell. Sys. Mol. Biol., 4, 165-175.

12. Apweiler, R., Gateau,A., Contrino,S., Martin,M.J., Junker,V., O'Donovan,C., Lang,F., Mitaritonna,N., Kappus,S. and Bairoch,A. (1997) Intell. Sys. Mol. Biol., 5, 33-43.

13. Robison, K., Gilbert,W. and Church,G.M. (1994) Nature Genet., 7, 205-214. MEDLINE Abstract

14. Hall, B.G., Yokoyama,S. and Calhoun,D.H. (1983) Mol. Biol. Evol., 1, 109-124. MEDLINE Abstract

15. Andersson, S.G.E., Zomorodipour,A., Andersson,J.O., Sicheritz-Ponten,T., Alsmark,U.C., Podowski,R.M., Naslund,A.K., Eriksson,A.S., Winkler,H.H. and Kurland,C.G. (1998) Nature, 396, 133-140. MEDLINE Abstract


*To whom correspondence should be addressed. Tel: +44 1223 494653; Fax: +44 1223 494471; Email: ouzounis{at}ebi.ac.uk Present address: Sowmya Raghavan, Centre for Mathematical Modelling and Computer Simulation, Bangalore 560037, India


This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: jnl.info{at}oup.co.uk
Last modification:
Copyright© Oxford University Press, 1999.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
N. C. Kyrpides, C. A. Ouzounis, I. Iliopoulos, V. Vonstein, and R. Overbeek
Analysis of the Thermotoga maritima genome combining a variety of sequence similarity and genome context tools
Nucleic Acids Res., November 15, 2000; 28(22): 4573 - 4576.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (356K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Raghavan, S.
Right arrow Articles by Ouzounis, C. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Raghavan, S.
Right arrow Articles by Ouzounis, C. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?