Nucleic Acids Research, 2005, Vol. 33, Database issue D141-D146
© 2005, the authors
Nucleic Acids Research, Vol. 33, Database issue © Oxford University Press 2005; all rights reserved
UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs
1 Dipartimento di Scienze Biomolecolari e Biotecnologie, Università di Milano, via Celoria 26, 20133 Milano, Italy, 2 Sezione di Bioinformatica e Genomica, Istituto Tecnologie Biomediche del Consiglio Nazionale delle Ricerche (CNR), via Amendola 165/A, 70126 Bari, Italy, 3 Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, via Orabona 4, 70126 Bari, Italy and 4 EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
* To whom correspondence should be addressed. Tel: +39 02 50314915; Fax: +39 02 50314912; Email: graziano.pesole{at}unimi.it
Received September 15, 2004; Accepted September 18, 2004
| ABSTRACT |
|---|
|
|
|---|
The 5' and 3' untranslated regions of eukaryotic mRNAs play crucial roles in the post-transcriptional regulation of gene expression through the modulation of nucleo-cytoplasmic mRNA transport, translation efficiency, subcellular localization and message stability. UTRdb is a curated database of 5' and 3' untranslated sequences of eukaryotic mRNAs, derived from several sources of primary data. Experimentally validated functional motifs are annotated (and also collated as the UTRsite database) and cross-links to genomic and protein data are provided. The integration of UTRdb with genomic and protein data has allowed the implementation of a powerful retrieval resource for the selection and extraction of UTR subsets based on their genomic coordinates and/or features of the protein encoded by the relevant mRNA (e.g. GO term, PFAM domain, etc.). All internet resources implemented for retrieval and functional analysis of 5' and 3' untranslated regions of eukaryotic mRNAs are accessible at http://www.ba.itb.cnr.it/UTR/.
| INTRODUCTION |
|---|
|
|
|---|
One of the main challenges of the post-genomic era is the understanding of the mechanisms that control the spatio-temporal regulation of gene expression. The fate of newly synthesized mRNA with respect to its nucleo-cytoplasmic transport, stability, translation efficiency and subcellular localization is determined at the post-transcriptional level. Such regulation is mostly mediated by cis-acting elements located in the 5' and 3' untranslated regions of mRNAs (5' UTR and 3' UTR) (1).
In several cases, specific functional sequence elements have been identified and characterized. These usually correspond to short oligonucleotide tracts whose biological activity relies on a combination of their primary sequence and specific secondary structure. These motifs act either as target sites for RNA-binding factors or interact directly with the translation machinery.
The availability of a large collection of functionally related sequencessuch as UTRsis invaluable for the inference of structural and compositional features and for the identification of conserved candidate regulatory motifs. For this reason, we have developed UTRdb, a collection of 5' and 3' UTR sequences derived from eukaryotic mRNAs. Sequences collated in UTRdb were generated by custom software. UTRdb is a non-redundant database and annotation includes information not available in the primary databases such as genome localization and structure and presence of known regulatory elements.
We have also created UTRsite, a collection of regulatory elements located in 5' and 3' UTRs whose function and structure have been experimentally determined and published. The UTRsite collection may prove useful in automatic annotation projects of unknown sequences as well as for finding previously undetected signals in known sequences.
For the most recent release of the UTRdb and UTRsite databases, we have focused on the improvement of data quality, increasing the degree of integration with other resources and the incorporation of genome-related facilities. Besides a new graphical interface, we have introduced new specific UTR collections: (i) UTRef from RefSeq database (2); (ii) UTRait from TRAIT database of muscle-specific transcripts (3); and (iii) UTRexp, a collection of UTR sequences whose functional activity has been experimentally investigated (see below for details). The UTRsite collection of functional motifs has also been significantly expanded. Moreover, we have mapped human UTRs on genome assemblies, facilitating the direct comparison and integration of several annotated genomic features available through batch queries of Ensembl databases.
The integration of UTRs and protein/genomic resources is potent in that it allows the retrieval of specific UTR subsets based on their genomic coordinates and/or features associated with the encoded proteins (e.g. GO terms, PFAM domains, etc.).
| GENERATION OF UTRdb AND ITS INTEGRATION WITH OTHER DATABASES |
|---|
|
|
|---|
UTRdb entries are automatically generated through the accurate parsing of the Feature Table of entries in primary databases (e.g. EMBL). Entry curation includes the detection of contaminating vector sequences, the removal of sequence redundancy and the annotation of repetitive elements and known regulatory motifs collected in the UTRsite database. Details of this process can be found in (4).
The current release of UTRdb contains three further specialized divisions: UTRef, UTRait and UTRexp. Sequences collected in UTRef and UTRait have been generated from the RefSeq (2) and TRAIT (3) databases, respectively. UTRexp contains UTRs that have been investigated experimentally and shown to contain functional motifs. Some of these sequences are not present in primary sequence databases and have been manually extracted from literature resources.
In the current release, we have also determined the genomic coordinates of human UTR sequences using the program BLAT (5) with the human genome assembly (Release NCBI 34). Only those UTRs that unambiguously mapped to a single genomic location were considered. Exonic structure of mapped UTRs was then refined by applying the program Spidey (6) to compare the UTR and its corresponding genomic location.
We have tried to associate each mapped UTR to the specific protein encoded by the corresponding mRNA using the relevant Ensembl coordinates. A protein was defined a neighbor of a 5' UTR if its start site corresponds to the end of the 5' UTR sequence (and the converse for 3' UTRs) Once the neighbor protein of a given UTR entry had been defined, we were also able to identify the Ensembl transcripts cross-referenced to the neighbor protein. If, for a given UTR entry, no annotated protein matched our criteria, we associated any Ensembl gene overlapping the same genomic region with the UTR.
The cross-referencing of UTRs and Ensembl features (protein, transcript, gene) provides a valuable resource as UTRs automatically inherit the large body of functional features annotated with the Ensembl project (7).
We have also endeavored to cross-link the UTRdb human division with IPI (International Protein Index) (8), which contains a complete non-redundant data set representing the human proteome, derived from different curated protein databases.
In future releases of UTRdb, we plan to extend the cross-referencing between UTRs and protein/genomic resources to all other organisms included in Ensembl.
UTRdb entries (see Figure 1 for an example) are annotated for the occurrence of regulatory motifs whose activity has been assessed by experimental investigation, located in the 5' or 3' UTR of eukaryotic mRNAs. All these motifs are collected in the UTRsite database. Each UTRsite entry (Figure 2) is prepared/reviewed/updated by expert scientists (in many cases, those who performed the experimental analysis). We have now developed a Submission Tool for the generation/management/update of UTRsite entries (Figure 3). This tool allows selected annotators, to annotate/update all the information in the entry in a user-friendly manner via a personal login.
|
|
|
The databases UTRdb, UTRsite and the new specific UTR collections (UTRef, UTRait, UTRexp) have been organized into MySQL relational database management system.
| UTRdb CONTENT |
|---|
|
|
|---|
The main section of UTRdb (Release 19) contains nine sequence collections, one for each of the eukaryotic divisions of the EMBL nucleotide database (Release 78), namely (i) human; (ii) mouse; (ii) rodent; (iv) other mammal; (v) other vertebrate; (vi) invertebrate; (vii) plant; (viii) fungi; and (ix) virus.
UTRef was generated from Reference Sequence collections (RefSeq Rel. 3). Table 1 reports a summary description of UTRdb which contains 298 036 entries and 128 286 081 nucleotides. UTRsite collects a total of 52 regulatory motifs, including upstream Open Reading Frames (uORFs) with known regulatory activity, whose occurrences have been annotated in 30 370 entries of UTRef collection.
|
| AVAILABILITY OF UTRdb |
|---|
|
|
|---|
UTRdb and UTRsite are accessible through an SRS retrieval system, which has been updated to include the new fields added. In particular, all the information derived from genome mapping of UTRs, the IPI cross-link, and cross-referencing to neighbor proteins and Ensembl genes/transcripts is available through a new field named Genomic Features as reported in Figure 1. It is also possible to browse these fields by querying the SRS Extended query form where relevant query fields have been added.
In addition, to access all of the information indirectly linked to UTRs, we have developed a custom browsing systemthe UTR genome browser (Figure 4). Through this retrieval system, it is possible to select and extract specific UTR subsets defined by accession numbers derived from a variety of databases including IPI (8), Interpro (11), GO (12), GENEW (13), MIM (14), etc. as well as by genomic coordinates.
|
The user can choose to download selected entries in Fasta format or to display a summary of their relevant genomic features, including genomic coordinates and cross-references to a variety of genomic resources. A graphical representation displays selected UTRs in terms of genomic coordinates and shows other human UTRs, cDNAs, proteins and ESTs in the same genome location.
With this new tool, it is now possible to obtain specific UTR subsets from mRNAs coding for proteins of a selected protein family, containing a specific domain or belonging to a specific GO class. Further investigations of such homogeneous sets of UTRs may allow the identification of common features or conserved regions whose potential functional activity may then be experimentally characterized.
Further on-line utilities are UTRscan and UTRblast. The UTRscan feature allows the enquirer to search user submitted sequences for any of the motifs collected in UTRsite. The UTRblast utility allows database searches against any of the UTRdb sections.
UTRdb, UTRsite and other related resources are publicly available at http://www.ba.itb.cnr.it/UTR/.
| ACKNOWLEDGEMENTS |
|---|
We thank David Horner for helpful comments on the manuscript. This work was supported by Ministero dell'Istruzione e Ricerca, Italy (projects: MIUR Cluster C03/2000-CEGB, PON 2000-2006 Progetto BIG, FIRB project Bioinformatica per la Genomica e la Proteomica) and Telethon. F.M. was the recipient of a EU Marie Curie fellowship at the European Bioinformatic Institute, Hinxton, UK.
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact journals.permissions{at}oupjournals.org.
| REFERENCES |
|---|
|
|
|---|
- Mignone,F., Gissi,C., Liuni,S. and Pesole,G. ( (2002) ) Untranslated regions of mRNAs. Genome Biol., , 3, , REVIEWS0004. .
- Pruitt,K.D., Tatusova,T. and Maglott,D.R. ( (2003) ) NCBI Reference Sequence project: update and current status. Nucleic Acids Res., , 31, , 3437.
[Abstract/Free Full Text] . - Toppo,S., Cannata,N., Fontana,P., Romualdi,C., Laveder,P., Bertocco,E., Lanfranchi,G. and Valle,G. ( (2003) ) TRAIT (TRAnscript Integrated Table): a knowledgebase of human skeletal muscle transcripts. Bioinformatics, , 19, , 661662.
[Abstract/Free Full Text] . - Pesole,G., Liuni,S., Grillo,G., Licciulli,F., Mignone,F., Gissi,C. and Saccone,C. ( (2002) ) UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002. Nucleic Acids Res., , 30, , 335340.
[Abstract/Free Full Text] . - Kent,W.J. ( (2002) ) BLATthe BLAST-like alignment tool. Genome Res., , 12, , 656664.
[Abstract/Free Full Text] . - Wheelan,S.J., Church,D.M. and Ostell,J.M. ( (2001) ) Spidey: a tool for mRNA-to-genomic alignments. Genome Res., , 11, , 19521957.
[Abstract/Free Full Text] . - Birney,E., Andrews,T.D., Bevan,P., Caccamo,M., Chen,Y., Clarke,L., Coates,G., Cuff,J., Curwen,V., Cutts,T. et al. ( (2004) ) An overview of Ensembl. Genome Res., , 14, , 925928.
[Abstract/Free Full Text] . - Kersey,P.J., Duarte,J., Williams,A., Karavidopoulou,Y., Birney,E. and Apweiler,R. ( (2004) ) The International Protein Index: an integrated database for proteomics experiments. Proteomics, , 4, , 19851988.[CrossRef][Web of Science][Medline] .
- Grillo,G., Licciulli,F., Liuni,S., Sbisa,E. and Pesole,G. ( (2003) ) PatSearch: a program for the detection of patterns and structural motifs in nucleotide sequences. Nucleic Acids Res., , 31, , 36083612.
[Abstract/Free Full Text] . - Griffiths-Jones,S., Bateman,A., Marshall,M., Khanna,A. and Eddy,S.R. ( (2003) ) Rfam: an RNA family database. Nucleic Acids Res., , 31, , 439441.
[Abstract/Free Full Text] . - Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. ( (2003) ) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., , 31, , 315318.
[Abstract/Free Full Text] . - Harris,M.A., Clark,J., Ireland,A., Lomax,J., Ashburner,M., Foulger,R., Eilbeck,K., Lewis,S., Marshall,B., Mungall,C. et al. ( (2004) ) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., , 32, (Database issue), D258D261.
[Abstract/Free Full Text] . - Wain,H.M., Lush,M.J., Ducluzeau,F., Khodiyar,V.K. and Povey,S. ( (2004) ) Genew: the Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res., , 32, (Database issue), D255D257.
[Abstract/Free Full Text] . - Hamosh,A., Scott,A.F., Amberger,J., Bocchini,C., Valle,D. and McKusick,V.A. ( (2002) ) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res., , 30, , 5255.
[Abstract/Free Full Text] .
This article has been cited by other articles:
![]() |
G. Grillo, A. Turi, F. Licciulli, F. Mignone, S. Liuni, S. Banfi, V. A. Gennarino, D. S. Horner, G. Pavesi, E. Picardi, et al. UTRdb and UTRsite (RELEASE 2010): a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs Nucleic Acids Res., October 30, 2009; (2009) gkp902v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Zeng, S. Zhu, and H. Yan Towards accurate human promoter recognition: a review of currently used sequence features and classification methods Brief Bioinform, September 1, 2009; 10(5): 498 - 508. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Fan, P. B. Bitterman, and O. Larsson Regulatory element identification in subsets of transcripts: Comparison and integration of current computational methods RNA, August 1, 2009; 15(8): 1469 - 1482. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. C. Saunders, S. Cawthraw, S. J. Mountjoy, A. C. Tout, A. R. Sayers, J. Hope, and O. Windl Ovine PRNP untranslated region and promoter haplotype diversity J. Gen. Virol., May 1, 2009; 90(5): 1289 - 1293. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Chen, T. Ara, and D. Gautheret Using Alu Elements as Polyadenylation Sites: A Case of Retroposon Exaptation Mol. Biol. Evol., February 1, 2009; 26(2): 327 - 334. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. H. Jacobs, A. Chen, S. G. Stevens, P. A. Stockwell, M. A. Black, W. P. Tate, and C. M. Brown Transterm: a database to aid the analysis of regulatory sequences in mRNAs Nucleic Acids Res., January 1, 2009; 37(suppl_1): D72 - D76. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. M. Muro, R. Herrington, S. Janmohamed, C. Frelin, M. A. Andrade-Navarro, and N. N. Iscove Identification of gene 3' ends by automated EST cluster analysis PNAS, December 23, 2008; 105(51): 20286 - 20290. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Ramirez-Valle, S. Braunstein, J. Zavadil, S. C. Formenti, and R. J. Schneider eIF4GI links nutrient sensing by mTOR to cell proliferation and inhibition of autophagy J. Cell Biol., April 21, 2008; 181(2): 293 - 307. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Wang, J. Zhang, Y. Zhang, S. Kern, and R. L. Danner Nitric oxide-p38 MAPK signaling stabilizes mRNA through AU-rich element-dependent and -independent mechanisms J. Leukoc. Biol., April 1, 2008; 83(4): 982 - 990. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Mangone, P. MacMenamin, C. Zegar, F. Piano, and K. C. Gunsalus UTRome.org: a platform for 3'UTR biology in C. elegans Nucleic Acids Res., January 11, 2008; 36(suppl_1): D57 - D62. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Kong, Y. Zhang, Z.-Q. Ye, X.-Q. Liu, S.-Q. Zhao, L. Wei, and G. Gao CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine Nucleic Acids Res., July 13, 2007; 35(suppl_2): W345 - W349. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Bhatti, D. M. Church, J. L. Rutter, J. P. Struewing, and A. J. Sigurdson Candidate Single Nucleotide Polymorphism Selection using Publicly Available Tools: A Guide for Epidemiologists Am. J. Epidemiol., October 15, 2006; 164(8): 794 - 804. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Xiong, E. Liu, Y. Yan, R. T. Silver, F. Yang, I. H. Chen, Y. Chen, S. Verstovsek, H. Wang, J. Prchal, et al. An Unconventional Antigen Translated by a Novel Internal Ribosome Entry Site Elicits Antitumor Humoral Immune Reactions J. Immunol., October 1, 2006; 177(7): 4907 - 4916. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Sanchez, B. Galy, T. Dandekar, P. Bengert, Y. Vainshtein, J. Stolte, M. U. Muckenthaler, and M. W. Hentze Iron Regulation and the Cell Cycle: IDENTIFICATION OF AN IRON-RESPONSIVE ELEMENT IN THE 3'-UNTRANSLATED REGION OF HUMAN CELL DIVISION CYCLE 14A mRNA BY A REFINED MICROARRAY-BASED SCREENING STRATEGY J. Biol. Chem., August 11, 2006; 281(32): 22865 - 22874. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-H. Chang, H.-D. Huang, T.-N. Chuang, D.-M. Shien, and J.-T. Horng RNAMST: efficient and flexible approach for identifying RNA structural homologs. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W423 - W428. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-Y. Huang, C.-H. Chien, K.-H. Jen, and H.-D. Huang RegRNA: an integrated web server for identifying regulatory RNA motifs and elements. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W429 - W434. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Wang, J. Zhang, S. Theel, J. J. Barb, P. J. Munson, and R. L. Danner Nitric oxide activation of Erk1/2 regulates the stability and translation of mRNA transcripts containing CU-rich elements Nucleic Acids Res., June 6, 2006; 34(10): 3044 - 3056. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Yousef, M. Nebozhyn, H. Shatkay, S. Kanterakis, L. C. Showe, and M. K. Showe Combining multi-species genomic data for microRNA identification using a Naive Bayes classifier Bioinformatics, June 1, 2006; 22(11): 1325 - 1334. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Kimura, A. Wakamatsu, Y. Suzuki, T. Ota, T. Nishikawa, R. Yamashita, J.-i. Yamamoto, M. Sekine, K. Tsuritani, H. Wakaguri, et al. Diversification of transcriptional modulation: Large-scale identification and characterization of putative alternative promoters of human genes Genome Res., January 1, 2006; 16(1): 55 - 65. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Mokrejs, V. Vopalensky, O. Kolenaty, T. Masek, Z. Feketova, P. Sekyrova, B. Skaloudova, V. Kriz, and M. Pospisek IRESite: the database of experimentally verified IRES structures (www.iresite.org) Nucleic Acids Res., January 1, 2006; 34(suppl_1): D125 - D130. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Sun, L. D. Hurst, G. G. Carmichael, and J. Chen Evidence for a preferential targeting of 3'-UTRs by cis-encoded natural antisense transcripts Nucleic Acids Res., October 4, 2005; 33(17): 5533 - 5543. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. de la Grange, M. Dutertre, N. Martin, and D. Auboeuf FAST DB: a website resource for the study of the expression regulation of human gene products Nucleic Acids Res., July 28, 2005; 33(13): 4276 - 4284. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
















