ABSTRACT
A substantial fraction of vertebrate mRNAs contain long conserved blocks in their untranslated regions as well as long blocks without silent changes in their protein coding regions. These conserved blocks are largely comprised of unique sequence within the genome, leaving us with an important puzzle regarding their function. A large body of experimental data shows that these regions are associated with regulation of mRNA stability. Combining this information with the rapidly accumulating data on endogenous antisense transcripts, we propose that the conserved sequences form long perfect duplexes with antisense transcripts. The formation of such duplexes may be essential for recognition by post-transcriptional regulatory systems. The conservation may then be explained by selection against the dominant negative effect of allelic divergence.
Since the early 1980s many studies on particular genes have noted sequence conservation in the 3' untranslated regions (UTRs) of vertebrate mRNAs (1 -3 ). Duret et al. (4 ) estimated that >30% of vertebrate mRNAs had conserved regions in their 3' UTRs, defined as sharing at least 70% identity over >100 nucleotides between corresponding homologous genes (orthologs). They also noted the less frequent but still significant conservation in 5' UTRs. We have recently observed long stretches of protein coding regions without silent changes in a substantial fraction of vertebrate mRNAs; most of these contain unusually conserved blocks both in the coding regions and in 5' or 3' UTRs (H.Sicotte and D.Lipman, unpublished data). A representative sample from a comparison of human and mouse orthologs is shown in Table 1 . These conserved sequences are essentially unique in the genome and thus match only to corresponding regions of orthologous mRNAs in other species. The observed level of conservation is far greater than expected for non-coding regions or synonymous sites in coding regions on the basis of known evolutionary rates and divergence times (5 ).
What function constrains these regions? Sequence specific recognition, e.g., by RNA binding proteins, is an unlikely explanation because of the length of the conserved sequences. Furthermore, because so many different mRNAs contain these conserved regions, which are unique for each set of orthologs, sequence specific recognition would lead into an almost infinite regress. With >30% of the genes containing these unique conserved regions, then another 30% of the genes would be needed to code for these binding proteins, not to mention the proteins regulating these binding proteins, and so on. One might posit that many of these different sequences share common RNA secondary structure thus reducing the number of different binding proteins, but the sequence conservation would remain a mystery. It has been shown that short AU rich motifs promote mRNA degradation (6 ). Such motifs are often seen in the conserved portions of 3' UTRs but these cannot explain the striking conservation between orthologs either. Another possibility would be that the conservation is due to the encoding of a protein on the complementary strand. Extensive database searches using translations of the complementary strand to these conserved regions did not reveal homologies to known proteins which could explain this conservation (results not shown).
A number of studies provide evidence that the conserved regions in 3' UTRs are required for the regulation of mRNA stability (7 ). Typically deletion of these regions render the mRNA unresponsive to regulatory signals which normally lead to destabilization (8 -10 ). Conversely, introduction of these regions into reporter mRNAs make them responsive to regulated destabilization (11 -13 ). Conserved regions in 5' UTRs (14 ) and coding regions (15 -17 ) have also been implicated in regulation of mRNA stability.
The large number of bases in conserved blocks suggests a base-pairing interaction between mRNA and another nucleic acid. Over the last several years there has been an increasing number of reports of antisense RNA transcripts encoded by the complementary strand of a gene (18 -22 ). Although most reported examples do not show evidence of coding regions, in some cases these countertranscripts encode expressed proteins (23 ,24 ). These countertranscripts are sometimes found in different tissues or developmental stages than their corresponding sense mRNA and thus a regulatory role for endogenous antisense has been proposed (25 -28 ). Examples of regulation of gene expression by endogenous antisense have also been described for nematode (29 ), dictyostelium (30 ) and prokaryotes (31 ).
Why would the antisense-based regulatory mechanism require sequence conservation? If cells have a destabilization/degradation system which specifically recognizes long, nearly perfect RNA duplex, then mutations in a region corresponding to a duplex will be selected against because of their mismatch with the other allele (Fig. 1 ). Consider, for example, the developmental expression pattern for Hoxa 11 sense and antisense transcripts (27 ); where sense transcripts are at high levels, antisense transcripts are at low levels, and vice-versa. When the Hoxa 11 antisense is abundant, most sense transcripts will be duplexed. Assuming the rate of transcription for the two alleles is roughly equal, a mutation in a region corresponding to a duplex would result in approximately half the sense transcripts forming mismatched duplexes. Let us further assume that the half life of a sense transcript is 12 h and the half life of a perfectly matching sense/antisense duplex is 12 min. When most of the sense transcripts are in perfect duplexes the drop in mRNA levels could therefore be an order of magnitude or more. However, a mutation leading to allelic divergence in a complementary region could lead to defective recognition of approximately half of the sense/antisense duplexes; thus, half the sense transcripts would have a half life of 12 min and half would have a half life approaching 12 h. The endogenous antisense mechanism would then only be able to reduce mRNA levels by a factor of two. Thus, the conserved regions in mRNAs will be maintained through selection against allelic divergence. In the three cases where the endogenous antisense has been sequenced and the corresponding orthologous mRNA sequences are also available, there is a strong correlation of complementary segments and sequence conservation. For example, in the BFGF gene, there is a single silent change between human and rat sequences in the 280 bases of the coding region which overlap the antisense transcript (unpublished observations).
With this hypothesis, one would predict that a chromosomal translocation in a region corresponding to a duplex would lead to upregulation of the product of the normal allele. An interesting example of this is the bcl-2/IgH translocation seen in B-cell lymphomas which is associated with increased levels of bcl-2 mRNA and bcl-2 protein as well as detectable levels of a bcl-2/IgH antisense transcript (32 ). Note that the translocation occurs within the 3' UTR which contains a number of conserved blocks on either side of the breakpoint. Oligonucleotides complementary specifically to this chimeric antisense downregulate the bcl-2 gene product leading to apoptosis while oligonucleotides complementary to the bcl-2/IgH sense transcript have no effect (32 ,33 ). Presumably the chimeric antisense binds to the normal bcl-2 sense mRNA but is not efficiently recognized by the destabilization/degradation system and thus it acts as a competitive inhibitor of the normal bcl-2 antisense transcript.
Table 1
*Tel: +1 301 496 2475; Fax: +1 301 480 9241; Email: lipman@ncbi.nlm.nih.gov
mRNA
Accession no. (human)
Conserved regions (in nt)
5' UTR
length (%)Coding regionb identical blocks
3' UTR
length (%)
Immediate-early response protein NOT
X75918
153, 172, 150
Human polyposis locus (DP2.5 gene)
M73548
147, 199
Octamer binding transcription factor 1 (OTF1)
L20433
116, 125
Homeobox protein hox-c4 (hox-3e) (cp19)
X07495
124, 136
Acute phase response factor
L29277
178
69 (96%)
RNA binding protein EWS
X79233
133
156 (97%)
hnRNP-E2
X78136
209
167 (91%)
Eukaryotic initiation factor 4AII
D30655
167, 116
345 (96%) 190 (96%)
Fibrillin
L13923
151
484 (87%)
Glutamate receptor 2 (HBGR2)
L20814
158 (85%)
258, 157
202 (98%)
p68 protein
X52104
175
301 (97%)
Thryoid hormone receptor [alpha] (c-erbA-1)
X55005
173 (91%)
183
80 (96%)
S-adenosylmethionine decarboxylase
M21154
122 (95%)
92, 139, 134
119 (88%) 119 (97%)
Sodium- and chloride-dependent taurine transporter
Z18956
160
69 (94%)
Transcription activator ZFX
X59739
159
575 (86%)
Homeobox c8 protein
M16938
208 (98%)
278
184 (85%) 163 (88%)
Leukemia virus receptor 1 (GLVR1)
L20859
152 (94%)
145
145 (94%) 171 (88%)
Very low density lipoprotein receptor
L20470
112
431 (93%)
Nervous system-specific octamer -binding transcription factor n-Oct 3
Z11933
60 (97%)
147
Glutamate (NMDA) receptor subunit [xi]1
D13515
139
78 (96%)
Voltage-dependent L-type Ca channel
Z34822
137
84 (99%)
[alpha]1 subunit
101 (95%)
REFERENCES
