Nucleic Acids Research, 2000, Vol. 28, No. 1 193-196
© 2000 Oxford University Press
UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs
1Dipartimento di Fisiologia e Biochimica Generali, Università di Milano, via Celoria 26, 20133 Milano, Italy, 2Area di Ricerca di Bari, Consiglio Nazionale delle Ricerche (CNR), via Amendola 166/5, 70126 Bari, Italy, 3Centro di Studio sui Mitocondri e Metabolismo Energetico del Consiglio Nazionale delle Ricerche (CNR), via Orabona 4, 70126 Bari, Italy, 4Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, via Orabona 4, 70126 Bari, Italy and 5National Center for Biotechnology Information, NLM-NIH, Bethesda, MD, USA
Received September 30, 1999; Accepted October 4, 1999.
| ABSTRACT |
|---|
|
|
|---|
The 5' and 3' untranslated regions of eukaryotic mRNAs may play a crucial role in the regulation of gene expression controlling mRNA localization, stability and translational efficiency. For this reason we developed UTRdb, a specialized database of 5' and 3' untranslated sequences of eukaryotic mRNAs cleaned from redundancy. UTRdb entries are enriched with specialized information not present in the primary databases including the presence of nucleotide sequence patterns already demonstrated by experimental analysis to have some functional role. All these patterns have been collected in the UTRsite database so that it is possible to search any input sequence for the presence of annotated functional motifs. Furthermore, UTRdb entries have been annotated for the presence of repetitive elements. All internet resources implemented for retrieval and functional analysis of 5' and 3' untranslated regions of eukaryotic mRNAs are accessible at http://bigarea.area.ba.cnr.it:8000/EmbIT/UTRHome/
| INTRODUCTION |
|---|
|
|
|---|
Understanding the basic mechanisms of cell growth, differentiation and response to environmental stimuli, i.e., the program controlling the temporal and spatial order of molecular events, is becoming a real challenge in Molecular Biology. Indeed, although most of the regulatory elements are thought to be embedded in the non-coding part of the genomes, nucleotide databases are biased by the presence of expressed sequences mostly corresponding to the protein coding portion of the genes. Among non-coding regions, the 5' and 3' untranslated regions (5'-UTR and 3'-UTR) of eukaryotic mRNAs have often been experimentally demonstrated to contain sequence elements crucial for many aspects of gene regulation and expression (17).
The main functional roles so far demonstrated for 5'- and 3'-UTR sequences are: (i) control of mRNA cellular and subcellular localization (4,7,8); (ii) control of mRNA stability (1,9); and (iii) control of mRNA translation efficiency (10,11).
Several regulatory signals have already been identified in 5'- and 3'-UTR sequences, usually corresponding to short oligonucleotide tracts, also able to fold in specific secondary structures, which are protein binding sites for various regulatory proteins.
The analysis of large collections of functionally equivalent sequences (12,13), such as 5'- and 3'-UTR sequences, could indeed be very useful for defining their structural and compositional features as well as for searching the alleged function-associated sequence patterns (1416). For this reason we constructed UTRdb, a specialized sequence collection, deprived from redundancy, of 5'- and 3'-UTR sequences from eukaryotic mRNAs.
UTRdb entries have been enriched with specialized information not present in the primary databases, including the presence of sequence patterns demonstrated by experimental evidence to play some functional role. Additionally, because ~10% of mammalian mRNAs contain repetitive elements in their UTRs (17) which are not usually annotated in the original records, we decided to include this information in our database.
We also created UTRsite, a collection of functional sequence patterns located in the 5'- or 3'-UTR sequences which could prove very useful for automatic annotation of anonymous sequences generated by sequencing projects, as well as for finding previously undetected signals in known gene sequences.
| ASSEMBLING UTRdb COLLECTIONS |
|---|
|
|
|---|
The specialized database of UTR sequences was generated by UTRdb_gen, a computer program we devised for this task. Eight sequence collections were generated for both 5'- and 3'-UTR sequences, one for each of the eukaryotic divisions of the EMBL/GenBank nucleotide database, namely: (i) Human; (ii) Rodent; (iii) Other mammal; (iv) Other vertebrate; (v) Invertebrate; (vi) Plant; (vii) Fungi; and (viii) Patent.
UTRdb_gen, performing an accurate parsing of the Feature Table of the relevant EMBL entries is able to automatically generate the various UTRdb collections. Although the feature keys 5'UTR and 3'UTR are valid features for the EMBL/Genbank entries, only a small percentage of the entries are adequately annotated. Indeed, of the 120 767 primary entries where UTRdb_gen was able to extract 5'- or 3'-UTR sequences, only 15.8% contained the 5'UTR or 3'UTR feature key in the corresponding EMBL entry. UTRdb_gen is able to define UTR regions even when these keys are not reported in the primary entry by using a predefinite syntactic parsing of other relevant feature keys, such as mRNA, CDS, exon, intron, etc.
UTRdb_gen automatically annotates generated UTR entries by adding some specialized information such as completeness (or not) of the UTR region, number of spanned exons and cross-referencing to the primary database entry. A cross reference between 5'- and 3'-UTR sequences from the same mRNA has also been established.
The generation of UTR entries cleaned from redundancy has been obtained by using CLEANUP program (18) which is able to generate automatically, very quickly, cleaned collections by removing entries having a similarity and overlapping degree with longer entries present in the database above a user-fixed threshold. In this case, the cut-off parameters we used for the CLEANUP application were 95% for similarity and 90% for overlapping.
The UTR entries have been further enriched by using the program UTRnote (kindly provided by G. Grillo, Area de Ricerca di Bari del Consiglio Nazionale delle Ricerche) including information about the location of experimentally defined patterns collected in UTRsite and of repetitive elements present in the Repbase database (19). The UTRsite entries describe the various regulatory elements present in UTR regions whose functional role has been established on an experimental basis. Each UTRsite entry is constructed on the basis of information reported in the literature and revised by distinguished scientists experimentally working on the functional characterization of the relevant UTR regulatory element.
| CONTENT OF UTRdb |
|---|
|
|
|---|
Table 1 reports a summary description of UTRdb (release 12.0) which in total contains 120 767 entries and 37 353 172 nucleotides. On average, >29.3% of entries proved to be redundant and were removed from the database.
|
5'-UTR sequences were defined as the mRNA region spanning from the cap site to the starting codon (excluded), whereas 3'-UTR sequences were defined as the mRNA region spanning from the stop codon (excluded) to the poly-A starting site.
A sample UTRdb entry is shown in Figure 1. The UTRdb entries have been formatted according to the EMBL database format.
|
Table 2 reports functional patterns and repetitive elements included in UTRsite (release 3.0). More entries will be included in further releases. A sample UTRsite entry is reported in Figure 2. Functional patterns, defined on the basis of the information reported in the literature and/or advice by the scientists expert in the field, were described by using the pattern description syntax used in the PATSCAN program (20).
|
|
| AVAILABILITY OF UTRdb |
|---|
|
|
|---|
UTRdb and UTRsite are publicly available by anonymous FTP (ftp://area.ba.cnr.it/pub/embnet/database/utr/ ). All internet resources we implemented for retrieval and functional analysis of 5'- and 3'-UTR sequences are accessible at http://bigarea.area.ba.cnr.it:8000/EmbIT/UTRHome/ (21). These include SRS retrieval (22) of UTRdb and UTRsite, also available at the EBI WWW server (http://srs.ebi.ac.uk:80/ ), UTRscan and UTRfasta. The UTRscan utility allows the enquirer to search user-submitted sequences for any of the patterns collected in UTRsite. The UTRfasta utility allows database searches against fully annotated UTRdb entries.
| CONCLUSIONS AND PERSPECTIVES |
|---|
|
|
|---|
The important role that untranslated regions of eukaryotic mRNAs may play in gene regulation and expression is now widely recognized. Indeed, experimental studies have demonstrated that sequence motifs located in the untranslated regions are involved in crucial biological functions.
The huge amount of functionally equivalent sequences stored in UTRdb now makes possible the study of their structural and compositional features and the application of statistical methods for the identification of significant signals. Previous cleaning-up of databases is necessary however to avoid artefacts caused by redundant sequences. Even if statistical significance does not necessarily mean biological significance, it may provide a useful indication for further experimental work, such as site-directed mutagenesis.
UTRdb will be updated with the new EMBL database releases and UTRsite will be continuously updated by adding new entries describing functional patterns whose biological role has been experimentally demonstrated.
| ACKNOWLEDGEMENTS |
|---|
For revision of UTRsite entries we would like to thank Jim Malter (APP 3'-UTR stability control element), Alain Krol (SECIS), Matthias Hentze (IRE and 15-LOX DICE), Bill Marzluff (histone stemloop structure), Ann-Bin Shyu (ARE), Arturo Verrotti (CPE), Robin Wharton (nanos), Elizabeth Goodwin (TGE), Roger Kaspar (ribosomal protein mRNA TOP), Danuta Radzioch (TNF mRNA translation repression element), Ruben Boado (GLUT1 mRNA stabilising element) and Zendra E. Zehner (Vimentin 3'UTR mRNA element). This work was supported by EU grant ERB-BIO4-CT96-0030 and by Programma Biotecnologie legge 95/95 (MURST 5%).
| FOOTNOTES |
|---|
* To whom correspondence should be addressed at: Dipartimento di Fisiologia e Biochimica Generali, Università di Milano, via Celoria 26, 20133 Milano, Italy. Tel: +39 02 7064 4803; Fax: +39 02 7063 2811; Email: graziano.pesole@unimi.it
| REFERENCES |
|---|
|
|
|---|
-
1 Decker,C.J. and Parker,R. (1994) Trends Biochem. Sci., 19, 336340.[ISI][Medline]
2 Kaufman,R.J. (1994) Curr. Opin. Biotech., 5, 550557.[Medline]
3 Klausner,R.D., Rouault,T.A. and Harford,J.B. (1993) Cell, 72, 1928.[ISI][Medline]
4 Singer,R.H. (1992) Curr. Opin. Cell Biol., 4, 1519.[Medline]
5 Wilhelm,J.E. and Vale,R.D. (1993) J. Cell Biol., 123, 269274.
6 McCarthy,J.E.G. and Kollmus,H. (1995) Trends Biochem. Sci., 20, 191197.[ISI][Medline]
7 Bashirullah,A., Cooperstock,R.L. and Lipshitz,H.D. (1998) Annu. Rev. Biochem., 67, 335394.[ISI][Medline]
8 Johnston,D. (1995) Cell, 81, 161170.[ISI][Medline]
9 Beelman,C.A. and Parker,R. (1995) Cell, 81, 179183.[ISI][Medline]
10 Curtis,D., Lehman,R. and Zamore,P.D. (1995) Cell, 81, 171178.[ISI][Medline]
11 Sonenberg,N. (1994) Curr. Opin. Genet. Dev., 4, 310315.[Medline]
12 Mengeritsky,G. and Smith,T.F. (1987) Comput. Appl. Biosci., 3, 223227.
13 Konopka,A.K. (1994) In Smith,D.W. (ed.), Informatics and Genome Projects. Academic Press, San Diego, CA.
14 Pesole,G., Liuni,S., Grillo,G. and Saccone,C. (1997) Gene, 205, 95102.[ISI][Medline]
15 Pesole,G., Grillo,G. and Liuni,S. (1996) Comp. Chem., 20, 141144.
16 Pesole,G., Fiormarino,G. and Saccone,C. (1994) Gene, 140, 219225.[ISI][Medline]
17 Makalowski,W., Zhang,J. and Boguski,M. (1996) Genome Res., 6, 846857.
18 Grillo,G., Attimonelli,M., Liuni,S. and Pesole,G. (1996) Comput. Appl. Biosci., 12, 18.
19 Jurka,J. (1998) Curr. Opin. Struct. Biol., 8, 333337.[ISI][Medline]
20 Dsouza,M., Larsen,N. and Overbeek,R. (1997) Trends Genet., 13, 497498.[ISI][Medline]
21 Pesole,G. and Liuni,S. (1999) Trends Genet., 15, 379380.[ISI][Medline]
22 Etzold,T., Ulyanov,A. and Argos,P. (1996) Methods Enzymol., 266, 114128.[ISI][Medline]
23 Hentze,M.W. and Kuhn,L.C. (1996) Proc. Natl Acad. Sci. USA, 93, 81758182.
24 Williams,A.S. and Marzluff,W.F. (1995) Nucleic Acids Res., 23, 654662.
25 Chen,C. and Shyu,A. (1995) Trends Biochem. Sci., 20, 465470.[ISI][Medline]
26 Goodwin,E.B., Okkema,P.G., Evans,T.C. and Kimble,J. (1993) Cell, 75, 329339.[ISI][Medline]
27 Hubert,N., Walczak,R., Sturchler,C., Schuster,C., Westhof,E., Carbon,P. and Krol,A. (1996) Biochimie, 78, 590596.[Medline]
28 Walczak,R., Westhof,E., Carbon,P. and Krol,A. (1996) RNA, 2, 367379.[Abstract]
29 Zaidi,S.H.E. and Malter,J.S. (1994) J. Biol. Chem., 269, 2400724013.
30 Verrotti,A., Thompson,S., Wreden,C., Strickland,S. and Wickens,M. (1996) Proc. Natl Acad. Sci. USA, 93, 90279032.
31 Dahanukar,A. and Wharton,R. (1996) Genes Dev., 10, 26102620.
32 Amaldi,F. and Pierandrei-Amaldi,P. (1997) Prog. Mol. Subcell. Biol., 18, 117.[Medline]
33 Kaspar,R.L., Kakegawa,T., Cranston,H., Morris,D.R. and White,M.W. (1992) J. Biol. Chem., 267, 508514.
34 Morris,D.R., Kakegawa,T., Kaspar,R.L. and White,M.W. (1993) Biochemistry, 32, 29312937.[Medline]
35 Hel,Z., Di Marco,S. and Radzioch,D. (1998) Nucleic Acids Res., 26, 28032812.
36 Zehner,Z.E., Shepherd,R.K., Gabryszuk,J., Fu,T.F., Al-Ali,M. and Holmes,W.M. (1997) Nucleic Acids Res., 25, 33623370.
37 Boado,R.J. and Pardridge,W.M. (1998) Brain Res. Mol. Brain Res., 59, 109113.[Medline]
38 Ostareck-Lederer,A., Ostareck,D., Standart,N. and Thiele,B. (1994) EMBO J., 13, 14761481.[ISI][Medline]
This article has been cited by other articles:
![]() |
T. Bakheet, B. R. G. Williams, and K. S. A. Khabar ARED 3.0: the large and diverse AU-rich transcriptome Nucleic Acids Res., January 1, 2006; 34(suppl_1): D111 - D114. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Yan and T. G. Marr Computational analysis of 3'-ends of ESTs shows four classes of alternative polyadenylation in human, mouse, and rat Genome Res., March 1, 2005; 15(3): 369 - 375. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Ingouff, I. Farbos, M. Wiweger, and S. v. Arnold The molecular characterization of PaHB2, a homeobox gene of the HD-GL2 family expressed during embryo development in Norway spruce J. Exp. Bot., May 1, 2003; 54(386): 1343 - 1350. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Yang, Y.-L. Zhang, G. M. Buchold, A. M. Jetten, and D. A. O'Brien Analysis of Germ Cell Nuclear Factor Transcripts and Protein Expression During Spermatogenesis Biol Reprod, May 1, 2003; 68(5): 1620 - 1630. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kamoun Molecular Genetics of Pathogenic Oomycetes Eukaryot. Cell, April 1, 2003; 2(2): 191 - 199. [Full Text] [PDF] |
||||
![]() |
Y.-J. Hu Prediction of consensus structural motifs in a family of coregulated RNA sequences Nucleic Acids Res., September 1, 2002; 30(17): 3886 - 3893. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Attimonelli, D. Catalano, C. Gissi, G. Grillo, F. Licciulli, S. Liuni, M. Santamaria, G. Pesole, and C. Saccone MitoNuc: a database of nuclear genes coding for mitochondrial proteins. Update 2002 Nucleic Acids Res., January 1, 2002; 30(1): 172 - 173. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. H. Jacobs, O. Rackham, P. A. Stockwell, W. Tate, and C. M. Brown Transterm: a database of mRNAs and translational control elements Nucleic Acids Res., January 1, 2002; 30(1): 310 - 311. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Otulakowski, T. Freywald, Y. Wen, and H. O'Brodovich Translational activation and repression by distinct elements within the 5'-UTR of ENaC alpha -subunit mRNA Am J Physiol Lung Cell Mol Physiol, November 1, 2001; 281(5): L1219 - L1231. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. P. Venables, M. Ruggiu, and H. J. Cooke The RNA-binding specificity of the mouse Dazl protein Nucleic Acids Res., June 15, 2001; 29(12): 2479 - 2483. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Gorodkin, S. L. Stricklin, and G. D. Stormo Discovering common stem-loop motifs in unaligned RNA sequences Nucleic Acids Res., May 15, 2001; 29(10): 2135 - 2144. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Bakheet, M. Frevel, B. R. G. Williams, W. Greer, and K. S. A. Khabar ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins Nucleic Acids Res., January 1, 2001; 29(1): 246 - 254. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Beaudoing, S. Freier, J. R. Wyatt, J.-M. Claverie, and D. Gautheret Patterns of Variant Polyadenylation Signal Usage in Human Genes Genome Res., July 1, 2000; 10(7): 1001 - 1010. [Abstract] [Full Text] |
||||
![]() |
S. Wiemann, B. Weil, R. Wellenreuther, J. Gassenhuber, S. Glassl, W. Ansorge, M. Böcher, H. Blöcker, S. Bauersachs, H. Blum, et al. Toward a Catalog of Human Genes and Proteins: Sequencing and Analysis of 500 Novel Complete Protein Coding Human cDNAs Genome Res., March 1, 2001; 11(3): 422 - 435. [Abstract] [Full Text] |
||||
![]() |
R. V. Davuluri, Y. Suzuki, S. Sugano, and M. Q. Zhang CART Classification of Human 5' UTR Sequences Genome Res., November 1, 2000; 10(11): 1807 - 1816. [Abstract] [Full Text] |
||||
![]() |
M. D. Clark, S. Hennig, R. Herwig, S. W. Clifton, M. A. Marra, H. Lehrach, S. L. Johnson, and t. W.-G. E. Group An Oligonucleotide Fingerprint Normalized and Expressed Sequence Tag Characterized Zebrafish cDNA Library Genome Res., September 1, 2001; 11(9): 1594 - 1602. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Beaudoing and D. Gautheret Identification of Alternate Polyadenylation Sites and Analysis of their Tissue Distribution Using EST Data Genome Res., September 1, 2001; 11(9): 1520 - 1526. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







