Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (170K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Ray, W. C.
Right arrow Articles by Daniels, C. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ray, W. C.
Right arrow Articles by Daniels, C. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2003, Vol. 31, No. 1 109-113
© 2003 Oxford University Press

PACRAT: a database and analysis system for archaeal and bacterial intergenic sequence features

William C. Ray* and Charles J. Daniels1

Children's Research Institute and The Department of Pediatrics, The Ohio State University, 700 Children's Drive, Columbus, OH 43205, USA 1 The Department of Microbiology, The Ohio State University, 484 West 12th Avenue, Columbus, OH 43210, USA

*To whom correspondence should be addressed. Tel: +1 6147222557; Fax: +1 6147223273; Email: ray.29{at}osu.edu

Received August 14, 2002; Accepted August 22, 2002

ABSTRACT

Analysis of intergenic sequences for purposes such as the investigation of transcriptional signals or the identification of small RNA genes is frequently complicated by traditional biological database structures. Genome data is commonly treated as chromosome-length sequence records, detailed by gene calls demarcating subsequences of the chromosomes. Given this model, the determination of non-called subsequences between any gene and its nearest neighbors requires an exhaustive search of all gene calls associated with the chromosome. Further compounding the issue, the location of intergenic regions for many called genes cannot be resolved unambiguously due to uncertainties in gene boundaries, as well as the presence of other conflicting gene calls. To address these difficulties we have constructed the PACRAT (http://www.biosci.ohio-state.edu/~pacrat/) database system. PACRAT preprocesses GenBank genome submissions, evaluates for every gene the character of its relationship to those genes nearest to it, and produces a relationally linked model of the gene ordering for the genome. Using this information, the interface allows the researcher to query gene data as well as intergenic sequence data based on a number of criteria. These include the ability to filter searches based on the status of start and stop positions, or upstream/downstream sequences as conflicting with called genes and automated extension of upstream or downstream searches to find probable operon promoters or terminators. The database is also indexed by KEGG classification, allowing, for example, functionally-related groups of high-quality promoter-containing regions to be easily retrieved as a group.

INTRODUCTION

The PACRAT system (1) is an integrated database, data warehousing and data analysis system. It was designed to simplify the task of acquiring and analyzing functionally correlated sets of sequences for potential biologically relevant patterns.

Rationale
It is well understood that the analysis of functionally related sequences for conserved patterns is complicated by degeneracy in the patterns themselves. This understanding is explicitly codified in traditional genomic databases by the inclusion of uncertainty-quantifying statistics such as E-scores with respect to nearest-neighbor functional annotations, or by the use of qualifying adjectives such as ‘putative’ or ‘potential’. Uncertainties with respect to gene boundaries, however, have not traditionally been codified in databases that treat genomes as chromosome-length sequences with genes indicated as subregions.

This information is however, particularly critical to researchers interested in investigating intergenic regions, or near-gene-start regions for sequences that may be functionally active, such as transcription signals or genes encoding small RNAs. It is especially important in the identification of small RNA genes, where their numbers may equal or exceed those encoding tRNAs (2,3). It is also of relevance to the examination of protein gene sequences themselves, as the presence of miscalled starts in a database may erroneously include non-gene sequence in, or exclude valid sequence from, statistical analyses of gene coding regions.

Further complicating the issue, such a treatment of genome and gene data makes the acquisition of intergenic regions difficult. In this model, there is no indication of whether a gene boundary overlaps another called gene, and while gene entries are organised in order of increasing start coordinate, probable miscalls result in situations where non-neighboring gene entries must be examined to detect possible conflicts. Because of this, retrieval of the intergenic regions associated with any gene requires the examination of not only the immediately neighboring gene entries, but potentially exhaustive examination of the other gene entries for the chromosome.

Since gene boundaries are unlikely to be exactly determined without experimentally mapping each translated gene sequence, we have developed the PACRAT system to catalog and classify the characteristics of the boundaries of genes from GenBank (4) submissions and to allow researchers interested in these sequences to retrieve genes, or the intergenic sequences related to them, based on the characteristics of their called-boundary relationships to all other nearby genes.

It is also important in such analyses to be able to conveniently retrieve groups of sequences based on proposed functional relationships. Therefore, providing multi-sequence retrieval and analysis functions through proposed functional mappings such as KEGG (5) classification has been an additional focus of the project.

Database design
The PACRAT system is a relational database built using the MySQL (DataKonsultAB) SQL database system. The system loads GenBank whole-genome submissions in .gbk format, and their related KEGG gene classification tables. At load time, a bi-directionally linked list is built from the individual gene regions indicated in the GenBank file. Linkage is based upon the immediately preceding and following called genes in the case of non-overlapping genes, or upon the gene with the largest exterior extent for overlapping genes. From the bi-directionally linked list, the relationship between each gene, in terms of proximity or potential conflict between it and its nearest neighbors is determined. The bi-directional linkage is stored as relational data in an SQL table, along with codes indicating the gene-call's positional relationship to its neighbors. Also classified is an impact range for the gene, detailing the genome-coordinate extent over which non-explicitly linked data must be examined to retrieve other potentially interesting or related features. This feature is useful not only in allowing the system to retrieve useful information regarding surrounding features when queried for a gene, it also provides the ability to easily determine the set of genomic features that may be of interest for any particular genomic coordinate. Another table includes annotation data related by gene id. For retrieval efficiency, sequences extracted for the gene, upstream (pre-gene) and downstream (post-gene) intergenic regions and 100 bp sequences preceding and following the called start and stop respectively are also stored and linked by gene id.

THE DATA

The data that are available from the PACRAT system included gene nucleotide sequences and protein sequence translations, as well as the intergenic regions associated with a gene, 100 bp regions immediately preceding or following a gene, or user-specified variable regions surrounding gene called starts or stops in the range of -99 to +99 around the respective boundary. Also available is a quick overview of the genome region local to a gene showing other sequence features in the immediate area. This display includes notation indicating the identity and positional extents of genes in the locale, as well as other marked sequence features (misc_feature, repeat_region, etc.) in the GenBank file. Through the use of the bi-directional linkage data, the system is capable of chaining backwards through apparent operon structures to determine the upstream boundary for the operon and to return the sequence relative to this point rather than the gene immediately queried, if the user desires. Likewise, it can chain forwards to return operon-related terminator regions.

This data may be searched by gene identifier, description or product name as annotated, or by KEGG functional classification. The retrieval results may be filtered to only include results that do, or do not, display any particular PACRAT classification of start or stop relationship with its neighbors. Sequences are returned in multi-sequence FASTA format, and may, through other facilities of the PACRAT system, be analysed and stored for future use directly in PACRAT.

Due to the particular interest in archaeal transcription as a model of eucaryal Pol II transcription, with the corresponding requirement for intergenic sequence data, the PACRAT system currently focuses on complete archaeal genomes that have been submitted to GenBank. All archaeal genomes available as complete genomes from GenBank as of July 2002 are currently available from the system. Bacterial genomes can be loaded into the system on request, and several are currently available. Genomes are periodically reloaded to take advantage of updated information in the GenBank or KEGG databases.

Characteristics
The archaeal genomes in the database currently display PACRAT intergenic region (IGR) classifications as shown in Table 1. The assumption that significant overlaps between genes, or between genes and regulatory regions do not occur is certainly not universally correct. However, the population of called genes with these apparent characteristics is generally small in the Archaea. This scarcity brings into question the boundary calls for those genes that do show significant overlaps. Of note, several of the genomes listed in Table 1 have been significantly reannotated by NCBI. In one, the data as originally submitted showed greater than half the genes in the genome as having called-boundary conflicts of significant length. These numbers have been significantly reduced in the NCBI reannotation.


View this table:
[in this window]
[in a new window]
 
Table 1 PACRAT classifications of archaeal genomes in GenBank
 
Searching
The PACRAT database is searched through a World Wide Web interface available at http://www.biosci.ohio-state.edu/~pacrat/. This interface provides complete access to the functionally-related sequence retrieval engine, and the filtering features of the PACRAT database. The PACRAT system also allows the storage and editing of retrieval results, and MEME (20), MAST (22) and CLUSTALW (23) analyses of the results directly within the online system. Details regarding the use of these facilities are available from the ‘About’ page linked from the above URL.

RESULTS

Comparison between retrievals filtered for only clearly non-conflicting upstream regions and those made with upstream regions that have been classified as in conflict with other genes, demonstrates that the elimination of clearly conflicted upstream regions eliminates potentially damaging noise from promoter pattern analyses. Figure 1 shows the results of MEME pattern discovery analyses for the probable promoter regions of several different functional classes of Archaeoglobus fulgidus genes. These analyses were performed on sequences that showed no upstream region conflicts. The BRE and TATA elements displayed in the patterns are statistically significant componenets of archaeal promoters (24). Figure 2 shows the results of a similar MEME analysis performed on the naive upstream regions of A.fulgidus carbohydrate metabolism genes that significantly overlap other genes. While the information content of this pattern is comparable to the canonical promoter patterns shown in Figure 1, the positions at which this pattern is found are not recognisably correlated. Also of note, no recognisable promoter-like pattern was found in these conflicted upstream regions. It should be understood that discarding sequences with possible conflicts does risk the discard of some data that is valid. However, in pattern-discovery analyses, the presence of corrupting, invalid data, especially that which can contain strong but non-relevant patterns, can be a larger problem than a lack of some valid data items. It remains the responsibility of the user to determine the significance and eventual disposition of sequences that PACRAT marks as conflicted, and in fact, this very iterative process may be useful in refining the quality of the base data.



View larger version (17K):
[in this window]
[in a new window]
 
Figure 1. A comparison of the information content and multilevel consensus for promoter-like patterns found in the upstream sequence regions associated with A.fulgidus RNA genes, ABC-transporter genes, and tRNA biosynthesis genes. The consensus and information content diagrams are aligned on the leading AA pair followed by the pair of low-information-content positions. These patterns occur consistently near position -25 from the called gene starts.

 


View larger version (17K):
[in this window]
[in a new window]
 
Figure 2. A pattern with significant information content is found in the upstream regions of A.fulgidus carbohydrate metabolism genes that PACRAT classifies as having conflicted upstream regions. The position of this subpattern in the upstream regions ranges from 69 to 1508 bases before the called gene start position with an average position of -450, and a standard deviation of 479 positions.

 
FUTURE DEVELOPMENTS

The PACRAT system will continue to monitor and analyse GenBank archaeal submissions, and eubacterial submissions as requested. Because the impact range feature of the database has been productive in the analysis of the Pyrococcus archaeal trio for snoRNA localities (25), the analysis of assorted Archaeal genomes for tRNA intron-endonuclease cut sites, and in the annotation of the Haemophilus ducreyi genome, we anticipate making queries based on this data available through the web interface as well.

REFERENCES

  1. Ray,W.C. and Daniels,C.J. (2001) The PACRAT system: an extensible WWW-based system for correlated sequence retrieval, storage and analysis. Bioinformatics, 17, 100–104.[Abstract/Free Full Text]

  2. Benson,D.A., Boguski,M.S., Lipman,D.J., Ostell,J., Ouellette,B.F., Rapp,B.A. and Wheeler,D.I. (1999) GenBank. Nucleic Acids Res., 27, 12–17.[Abstract/Free Full Text]

  3. Goto,S., and Kanehisa,M. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28, 27–30.[Abstract/Free Full Text]

  4. Tang,T.H., Bachellerie,J.P., Rozhdestvensky,T., Bortolin,M.L., Huber,H., Drungowski,M., Elge,T., Brosius,J. and Huttenhofer,A., (2002) Identification of 86 candidates for small non-messenger RNAs from the archaeon Archaeoglobus fulgidus. Proc. Natl Acad. Sci. USA, 99, 7536–7541.

  5. Klein,R.J., Misulovin,Z., and Eddy,S.R. (2002) Noncoding RNA genes identified in AT-rich hyperthermophiles. Proc. Natl Acad. Sci. USA, 99, 7542–7547.[Abstract/Free Full Text]

  6. Kawarabayasi,Y., Hino,Y., Horikawa,H., Yamazaki,S., Haikawa,Y., Jin-no,K., Takahashi,M., Sekine,M., Baba,S., Ankai,A. et al. (1999) Provisional Refseq: Revised by NCBI Microbial Genomes Annotation Project. (2001) GenBank accession no. NC_000854.1.

  7. Klenk,H.P., Clayton,R.A., Tomb,J.F., White,O., Nelson,K.E., Ketchum,K.A., Dodson,R.J., Gwinn,M., Hickey,E.K., Peterson,J.D. et al. (1997) Provisional Refseq: Revised by NCBI Microbial Genomes Annotation Project. GenBank accession no. NC_000917.1.

  8. Ng,W.V., Kennedy,S.P., Mahairas,G.G., Berquist,B., Pan,M., Shukla,H.D., Lasky,S.R., Baliga,N., Thorsson,V., Sbrogna,J. et al. (2000) Provisional Refseq: Revised by NCBI Microbial Genomes Annotation Project. (2001) GenBank accession no. NC_002607.1.

  9. Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G., Blake,J.A., FitzGerald,L.M., Clayton,R.A., Gocayne,J.D. et al. (1996) Provisional Refseq: Revised by NCBI Microbial Genomes Annotation Project. (2001) GenBank accession no. NC_000909.1.

  10. Smith,D.R., Doucette-Stamm,L.A., Deloughery,C., Lee,H.-M., Dubois,J., Aldredge,T., Bashirzadeh,R., Blakely,D., Cook,R., Gilbert,K. et al. (1997) Provisional Refseq: Revised by NCBI Microbial Genomes Annotation Project. (2001) GenBank accession no. NC_000916.1

  11. Slesarev,A.I., Mezhevaya,K.V., Makarova,K.S., Polushin,N.N., Shcherbinina,O.V., Shakhova,V.V., Belova,G.I., Aravind,L., Natale,D.A., Rogozin,I.B. et al. (2002) Provisional Refseq. GenBank accession no. NC_003551.1.

  12. Birren,B. (2002) Provisional Refseq. GenBank accession no. NC_003552.1.

  13. Deppenmeier,U., Johann,A., Hartsch,T., Merkl,R., Schmitz,R.A., Martinez-Arias,R., Henne,A., Weizer,A., Baeumer,S., Jacobi,C. et al. (2001) Provisional Refseq. GenBank accession no. NC_003901.1.

  14. Fitz-Gibbon,S.T., Ladner,H., Kim,U.J., Stetter,K.O., Simon,M.I. and Miller,J.H. (2001) Provisional Refseq. GenBank accession no. NC_003364.1.

  15. Heilig,R. (1999) Provisional Refseq: Revised by NCBI Microbial Genomes Annotation Project. (2001) GenBank accession no. NC_000868.1.

  16. Weiss,R.B. (2002) Provisional Refseq. GenBank accession no. NC_003413.1.

  17. Kawarabayasi,Y., Sawada,M., Horikawa,H., Haikawa,Y., Hino,Y., Yamamoto,S., Sekine,M., Baba,S., Kosugi,H., Hosoyama,A. et al. (1998) Provisional Refseq: Revised by NCBI Microbial Genomes Annotation Project. (2001) GenBank accession no. NC_000961.1.

  18. NCBI Microbial Genomes Annotation Project (2001) Provisional Refseq. GenBank accession no. NC_002754.1.

  19. Kawarabayasi,Y., Hino,Y., Horikawa,H., Jin-no,K., Takahashi,M., Sekine,M., Baba,S., Ankai,A., Kosugi,H, Hosoyama,A. et al. (2001) Provisional Refseq: Revised by NCBI Microbial Genomes Annotation Project. (2001) GenBank accession no. NC_003106.1.

  20. Ruepp,A., Graml,W., Santos-Martinez,M.L., Koretke,K.K., Volker,C., Mewes,H.W., Frishman,D, Stocker,S., Lupas,A.N. and Baumeister,W. et al. (2000) Provisional Refseq: Revised by NCBI Microbial Genomes Annotation Project. (2001) GenBank accession no. NC_002578.1.

  21. Kawashima,T., Kawashima,T., Yamamoto,Y., Aramaki,H., Nunoshiba,T., Kawamoto,T., Watanabe,K., Yamazaki,M., Kanehori,K., Amano,N. et al. (2000) Reviewed Refseq: Revised by NCBI Microbial Genomes Annotation Project. (2001) GenBank accession no. NC_002689.2.

  22. Bailey,T.L. and Gribskov,M. (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics, 14, 48–54.[Abstract/Free Full Text]

  23. Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244.[CrossRef][Web of Science][Medline]

  24. Soppa,J. (2001) Basal and regulated transcription in archaea. Adv. Appl. Microbiol., 50, 171–217.[Web of Science][Medline]

  25. Ray,W.C., Munson,R.S. and Daniels,C.J. (2001) Tricross: using dot-plots in sequence-id space to detect uncataloged intergenic features. Bioinformatics, 17, 1105–1112.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (170K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Ray, W. C.
Right arrow Articles by Daniels, C. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ray, W. C.
Right arrow Articles by Daniels, C. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?