| Nucleic Acids Research | Pages |
Novel coding regions in four complete archaeal genomes
Introduction
Materials And Methods
Results
Discussion
Acknowledgements
References
Novel coding regions in four complete archaeal genomes
Received August 2, 1999; Revised and Accepted September 28, 1999
ABSTRACT In the process of analysing the four available complete archaeal genomes, we have noted that certain regions characterised as `non-coding' exhibit significant sequence similarity to other protein sequences from Archaea and other species. Using established technology, we have identified a number of potential protein coding regions in these putative `non-coding' regions. We have detected 524 such cases, of which 113 regions appear to code for proteins present in archaeal or other species, while the remaining 411 regions are mostly start/stop definition conflicts. Of the 113 protein coding regions, only 21 code for proteins with homologues of known function. The number of novel coding sequences identified herein amounts to 1.5% of the total genome entries, while the conflicting cases represent an additional 5%. The observed differences between the four complete archaeal genomes seem to reflect disparate approaches to genome annotation. Genome sequence collections should be regularly checked to improve gene prediction by sequence similarity and greater effort is required to make gene definitions consistent across related species.
INTRODUCTION
In the past few years, we have witnessed the successful completion of several whole-genome sequencing projects. However, the speed of availability of DNA sequences is not complemented by functional information regarding the location of genes on the genome and their functional roles (1). Given the rate limiting nature of experimental determination of new gene functions, it is important to develop fast yet reliable computational methods that `predict' genes or potential coding regions (2). In addition, there is a need to constantly monitor the annotation of genome sequences to add new information either from experimental sources or from predictions based on characterised sequences from related organisms. The latter method is especially valuable in the case of archaeal genomes, which exhibit a large number of hitherto unknown functions. In this report, we have examined whether regions of archaeal genomes that were originally discarded as non-coding may in principle exhibit significant similarities to the rapidly growing DNA and protein sequence databases.
MATERIALS AND METHODS
We have scanned genome annotations to extract all gaps between open reading frames (ORFs). Gaps longer than 70 nt were then extracted from the genomic sequence. We thus obtained 2237 inter-ORF regions from the complete genomes of Methanococcus jannaschii (323 regions) [MJ] (3), Methanobacterium thermoautotrophicum (590 regions) [RTH/MTH] (4), Archaeoglobus fulgidus (661 regions) [AF] (5) and Pyrococcus horikoshii (663 regions) [PH] (6). Our approach consists of matching all such extracted non-coding regions against a non-redundant protein sequence database for the detection of significant sequence similarities that imply the presence of protein coding segments (7).
These 2237 nt sequences were used as queries for BLASTX v.2.0 (8) runs against the non-redundant protein sequence database. Hits with P values <10-6 were extracted and manually examined. In addition, we have checked the clear cases with FRAME, a tool for sequencing error detection (9). All computations were performed on a cluster of Ultra 10 Sun workstations running Solaris 2.6.
Certain regions may be missed by employing a stringent threshold P value of 10-6. However, given the abundance of complete genome sequences, and the availability of related species, this cut-off appears to be permissible.
RESULTS
The list of 524 regions contains both complete ORFs (113) and sections of previously characterised ORFs with conflicting start/end sites (411). The former can only be identified by similarity to homologous protein sequences in the database, while the latter can be detected with reference to similarity, direction and annotation of the downstream ORFs.
Most of the 113 newly discovered coding regions appear to code for homologues of previously reported genome sequences (Table 1). Of these, 55 probably contain frameshift errors which may have impeded correct ORF prediction (Fig. 1). Most of these protein coding regions match other archaeal proteins within the same or in a related species, therefore making the predictions more reliable.
Figure 1. Representation of the single frame (light grey) and the multiple frame (dark grey) potential protein coding regions listed in Table 1 for the four complete archaeal genomes (see Materials and Methods for full species names). Single frame cases can be considered to be due to conservative gene prediction, while multiple frame cases can be considered to be due to frameshift errors. It is evident that there is great disparity between M.jannaschii and A.fulgidus, which appear to contain many frameshifts, versus M.thermoautotrophicum and P.horikoshii, which appear to contain more conservative gene predictions.
![]() |
![]() |
Table 1. The novel potential coding regions in the three complete archaeal genomes There are 21 cases which have similarity to a protein of known function. The ORF upstream of MJ0273 codes for the second protease synthase and sporulation-negative regulatory protein homologue (PAI) in M.jannaschii, the other one being MJ1207 (10). Two more cases are the ORF upstream of PH0582 encoding subunit 1 of molybdopterin converting factor moaD and the ORF upstream of PH0879 encoding the hydrogenase expression/formation protein HypC. Other notable discoveries are: a tungsten formylmethanofuran dehydrogenase subunit F homologue (upstream of AF0463) (reported in all three other species); a protein translocation complex sec61 [gamma] subunit homologue (upstream of PH0003); three ribosomal proteins S17E/S24/S27E (upstream of PH1317, PH1910 and PH1940, respectively); DNA-dependent RNA polymerase subunit E[prime][prime] (upstream of PH1909). Most of the newly identified protein coding genes have archaeal proteins as closest homologues. Exceptions include the ORF upstream of MJ1569, which is similar to a unique protein reported in Escherichia coli, the ORF upstream of AF1072 that encodes a unique Synechocystis sp. homologue and the ORF upstream of PH0772 encoding a yeast gene homologue (Table 1). Some of the detected regions which match hypothetical proteins cannot always be corroborated by consistent predictions in other genomes, and they may be isolated cases of false positive ORF assignments. However, in cases where all related species contain at least one such protein, it is compulsory to include the regions predicted herein. Additional evidence can also be provided by the presence of duplicated genes within the same genome (see Table 1 for examples). It is not clear whether the genome sequencing annotation procedures and in particular ORF detection (11) or automatic ORF translation by TrEMBL (12) generate the correct start/stop sites. Another difficulty is that each run may correspond to different coding regions (e.g. MTH1892), and therefore all the uniquely matching regions should be considered and reported. Examples of conflicting start/stop sites are the [beta] subunit of tungsten formylmethanofuran dehydrogenase from M.jannaschii (upstream of MJ1194) and a heterosulphide reductase (upstream of MJ1190) (not shown). One interesting observation is that the four genomes differ significantly with respect to the reported false negative ORF calls, therefore pointing out the variability of approaches and lack of standards in genome sequence annotation. It is apparent from Table 1 that the M.jannaschii and P.horikoshii genomes contain most of the newly identified protein coding regions (>2% of the number of ORFs) (Fig. 1). Yet, in M.jannaschii, this seems to stem mostly from frameshift cases, while in P.horikoshii it seems to come from conservative ORF calls (Fig. 1). In contrast, M.thermoautotrophicum and A.fulgidus contain fewer frameshift errors. Similar approaches have been used before for the discovery of bacterial genes (13) in the databases, in particular for E.coli (2). It has been noted before that this approach can be used not only for ORF prediction but for sequencing error detection (8,9). As mentioned above, in 55 out of 113 cases there are putative frameshifts that should be carefully checked. At present, we are not able to reconstruct the actual coding sequences without access to primary information, and therefore only the boundaries of the potential coding regions are reported (Table 1). In some cases, however, the sequence may indeed be correct, and those genes should then be considered as being reminiscent of cryptic genes (14) or vestigial sequences. The genome of Rickettsia prowazekii (15) is known to contain non-coding regions that appear to be genes deactivated by several instances of mutation. Indeed, the two categories are not mutually exclusive, as genes that have been deactivated by evolutionary mechanisms may in principle revert to a functional state. Matching non-coding regions of nucleotide sequences against protein databases should be carried out routinely during all genome sequencing projects, to eliminate inconsistencies and false positive (or negative) ORF assignments. We thank Nikos Kyrpides (Department of Microbiology, University of Illinois at Urbana-Champaign, IL) for discussions, Gillian Adams for help and the EBI Visitors Programme for travel support. C.O. acknowledges support by the European Molecular Biology Laboratory, the European Commission DGXII and IBM Research.
*To whom correspondence should be addressed. Tel: +44 1223 494653; Fax: +44 1223 494471; Email: ouzounis{at}ebi.ac.uk Present address: Sowmya Raghavan, Centre for Mathematical Modelling and Computer Simulation, Bangalore 560037, India
This article has been cited by other articles:
DISCUSSION
ACKNOWLEDGEMENTS
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: jnl.info{at}oup.co.uk
Last modification:
Copyright© Oxford University Press, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
N. C. Kyrpides, C. A. Ouzounis, I. Iliopoulos, V. Vonstein, and R. Overbeek
Analysis of the Thermotoga maritima genome combining a variety of sequence similarity and genome context tools
Nucleic Acids Res.,
November 15, 2000;
28(22):
4573 - 4576.
[Abstract]
[Full Text]
[PDF]
![]()
This Article ![]()
![]()
Abstract
![]()
Print PDF (356K)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (3)
![]()
Request Permissions ![]()
Commercial Re-use Guidelines
for Open Access NAR Content
![]()
Google Scholar ![]()
![]()
Articles by Raghavan, S.
![]()
Articles by Ouzounis, C. A.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Raghavan, S.
![]()
Articles by Ouzounis, C. A.
![]()
Social Bookmarking ![]()
![]()
What's this?


