ABSTRACT
The genome sequences from increasing numbers of organisms allow for rapid and organized examination of gene expression. Yet current computational-based paradigms for gene recognition are limited and likely to miss genes expressing non-coding RNAs or mRNAs with small open reading frames (ORFs). We have utilized two strategies to determine if there are additional transcripts in the yeast Saccharomyces cerevisiae that were not identified in previous analyses of the genome. In one approach, we identified strong consensus polymerase III promoters based on sequence, and determined experimentally if these promoters drive the expression of an RNA polymerase III transcript. This approach led to the identification of a new, non-essential 170 nt non-coding RNA. An alternative strategy analyzed RNA expression from large sequence gaps >2 kb between predicted ORFs. Fifteen unique RNA transcripts ranging in size from 161 to 1200 nt were identified from a total of 59 sequence gaps. Several of these RNAs contain unusually small potential ORFs, while one is clearly non-coding and appears to be a small nucleolar RNA. These results suggest that there are likely to be additional previously unidentified non-coding RNAs in yeast, and that new paradigms for gene recognition will be required to identify all expressed genes from an organism.
The genome sequencing projects provide an opportunity to identify the complete spectrum of genes in an organism and to utilize that information to understand gene expression and function. A key step in this process will be to determine the chromosomal locations of genes and the diversity of gene products. Current approaches primarily focus on identifying gene locations based on open reading frame criteria. For example, in the yeast Saccharomyces cerevisiae, the first eukaryote whose genome has been completely sequenced, 5885 genes containing open reading frames (ORFs) are predicted (1 ). However, genes that express transcripts that either contain short ORFs (<300 nt) or are non-coding RNAs would be missed in this type of sequence analysis and would need to be identified by other means.
Non-coding RNAs, or ncRNAs, have received considerable attention in recent years as it has become apparent that there is a striking diversity of these molecules in all cell types. Much work has focused on understanding the roles of RNAs in RNase P (2 ) and telomerase (3 ) action, of small nuclear RNAs (snRNAs) in mRNA splicing (4 ,5 ), and of small nucleolar RNAs (snoRNAs) in rRNA maturation (6 -8 ). Yet ncRNAs have also been implicated in the processes of transcription (9 ), translation (10 ), transport (11 ), RNA editing (12 ), mRNA stability (13 ), differentiation (14 ) and protein degradation (15 ). For example, the Xist ncRNA of mammals is critical for X chromosome inactivation in females (16 ), while the Drosophila roX1 ncRNA localizes to the male X chromosome and is potentially involved in dosage compensation (17 ). In Caenorhabditis elegans, the lin-4 ncRNA negatively regulates translation of the Lin-14 protein by duplex formation at repeated sequence elements within the lin-14 RNAs 3' untranslated region (UTR) (18 ,19 ). The 10Sa RNA of Escherichia coli contains both transfer and messenger domains that enable the RNA to associate with ribosomes and tag aberrant polypeptides with a degradation signal (15 ). In addition, the bic RNA of birds is preferentially activated in metastatic tumors in conjunction with c-myc activation, suggesting that bic RNA may collaborate with c-myc in late stages of tumor progression (20 ).
The wide range of roles played by these ncRNAs and the large number of organisms in which they have been detected suggests that eukaryotic cells are likely to contain several ncRNAs that have yet to be identified. Moreover, the roles of these new ncRNAs will need to be determined. The yeast S.cerevisiae is an ideal model system for this type of analysis due to the availability of the genome sequence and the ability to easily test gene function by genetic analysis. For these reasons, we have asked if there are additional unidentified and unpredicted RNAs in yeast that might have been missed in ORF-based methods of genome analysis. Specifically, we utilized two strategies. One strategy assayed RNA expression from consensus polymerase III promoters, while the second strategy analyzed RNA expression from genomic sequences lacking predicted ORFs. We report here the identification of 16 unique RNA transcripts. Sequence analysis of these transcripts reveal ncRNAs and potential mRNAs that contain small unpredicted ORFs. These results provide evidence that yeast and likely other organisms express many more RNAs than previously predicted.
Oligonucleotides were designed to anneal to RNA just 3' of the B box sequence from each candidate promoter. These oligonucleotides were end labeled with [[gamma]-32P]ATP. Total yeast RNA was isolated as previously described (21 ) from yeast strain yRP683 (MATa, leu2, lys2, his4, trp1, ura3) grown to mid-log phase in rich (YEPD) media at 14, 30 and 37°C or in YEP + 3% glycerol at 30°C. Northern blot analysis was performed by loading 40 µg of RNA in each lane of 1.25% agarose/6.7% formaldehyde gels or 6% polyacrylamide/7 M urea gels, blotting to Zeta-probe (BioRad), and hybridizing to the radiolabeled oligonucleotides 15 h at 42°C. The resulting blots were washed at 50°C and imaged using a Molecular Dynamics PhosphorImager.
Chromosomal sequence locations of all known and predicted ORFs within the yeast genome were obtained from the MIPS (Munich Information Centre for Protein Sequences) computer database. Gap regions >= 2 kb between ORFs were amplified by PCR for 30 cycles with primers designed with 5' restriction sites to anneal ~150 nt away from each flanking 5' and 3' ORF. PCR products were radiolabeled by random priming and extending with [[alpha]-32P]dATP. Northern blot analysis of total yeast RNA was performed as described above using the radiolabeled PCR products as hybridization probes.
PCR products of gap regions that expressed RNAs on northern blots were cloned into pBluescript (Stratagene), then the gap sequences were digested into at least three fragments. Each fragment was random-prime radiolabeled with [[alpha]-32P]dATP and used to probe northern blots of total yeast RNA as described above. Once an expressed RNA was isolated to a distinct restriction fragment, oligonucleotides were designed complementary to both Watson and Crick strands within the restriction fragment. These oligonucleotides were end labeled with [[gamma]-32P]ATP and used in northern blot analyses as described above. The 5'-3' orientation of an expressed RNA was determined as the complement of the oligonucleotide to which it hybridized. These complementary oligonucleotides were then used in primer extension reactions to determine the 5'-end of each RNA. The oligonucleotide that annealed to RNA170 from the polymerase III promoter analyses, was also used for primer extension of RNA170. Primer extension was performed by annealing 10 µg of total yeast RNA with 1 × 106 c.p.m. of oligonucleotide and extending with avian myeloblastosis virus reverse transcriptase (AMV-RT) as described in Current Protocols (22 ). Extension products were purified by phenol:CHCl3 extraction and ethanol precipitation, then electrophoresed on a 6% polyacrylamide/7 M urea sequencing gel. Dideoxy sequencing reactions of the respective gap fragments using the same oligonucleotides as from primer extension analysis were run beside extension products to determine the 5' nucleotide(s).
RNase protection and RNase H cleavage methods were utilized to estimate the 3'-ends of the RNAs. RNase protections were performed with RNase One (Promega) as per the manufacturer's instructions by the use of radiolabeled antisense transcripts produced by T7 or T3 transcriptions with [[alpha]-32P]UTP from plasmids containing the gap sequences or the RNA170 gene. Protection products were purified by ethanol precipitation and analyzed on polyacrylamide gels as described above. RNase H cleavage reactions were performed as previously described (23 ) with 10 µg of total yeast RNA and 500 ng of oligo dT or an oligonucleotide complementary to the identified RNA. Cleavage products were purified by phenol:CHCl3 extraction and ethanol precipitation, then electrophoresed on 6% polyacrylamide/7 M urea gels that were blotted and probed with radiolabeled gap restriction fragments. The precise 3' nucleotides of RNA170 and RNA161 were determined by a ligase-mediated RT-PCR method. T4 RNA ligase was used to ligate a DNA oligonucleotide onto the 3'-end of a gel-purified RNA. The product was reverse transcribed using a primer complementary to the DNA oligonucleotide, PCR amplified using the reverse transcription primer and an oligonucleotide homologous to an internal RNA sequence, cloned into pBluescript and sequenced.
Deletions of RNA716, RNA515, RNA530 and RNA487 were made by PCR amplifying ~500 bp sequences flanking the RNA genes and cloning these on either side of a URA3 gene inserted into the BamHI site of pBluescript. A SacI-KpnI digest of this plasmid was used to transform haploid cells of yRP683 using the LiOAc method (24 ). Deletion of RNA161 was accomplished by transformation of a PCR product that contains the neomycin (neo) gene, which confers resistance to the drug G418. Specifically, PCR primers were designed at their 5'-ends with homology to 50 bp sequences flanking the RNA gene, and at their 3'-ends with ~20 bp homologous to sequences flanking the neo gene. The neo gene was amplified from plasmid pRP665, in which expression of neo is under the control of the GPD promoter (25 ) and the terminator sequence from the PGK1 gene (26 ). The PCR product was then transformed into yRP683 haploid cells using the LiOAc method. Deletion of RNA170 was done by amplifying ~500 bp sequences flanking the RNA gene and cloning these on each side of either the URA3 or neo gene inserted into the BamHI site of pBluescript. A SacI-KpnI digest of this plasmid was used to transform haploid cells of yRP683 using the LiOAc method.
Overexpression of the above RNAs was accomplished by cloning respective restriction fragments into the polylinker of p426 (27 ), a 2µ vector containing the URA3 gene in a pBluescript backbone. These plasmids were transformed into yRP683 using the LiOAc method. Levels of overexpression were examined by northern blot analysis as described in the above sections. A 2µ plasmid expressing RNA170 with mutations in its B Box was made by PCR amplifying the gene using primers containing the mutated nucleotides (Fig. 1 B) and cloning the products into p426. Plasmids expressing TDS4 were generously donated by Stephen Buratowski.
The first method we used to search for functional RNAs was designed to specifically identify new RNA polymerase III transcripts. Such RNA transcripts are typically not translated and function in various cellular processes such as translation and RNA processing. Promoters of polymerase III genes are characterized by two conserved domains (28 ). Given this organization, we used the conserved A and B box domains of the tRNA type 2 promoter to search for new polymerase III promoters in the yeast genome. Specifically, a computer search for the consensus B box sequence (GTTCRANYC) was performed allowing only one mismatch. The resulting list of candidates was then searched for at least 50% conservation of the A box sequence (TRGCNNAGYNGG) within 300 nt upstream of the B box, and a poly-T termination signal prior to the next downstream gene. Sequences located within known or predicted ORF regions were eliminated, as well as sequences that were, or had homology to, known polymerase III transcripts. For the 10 candidates that met the above criteria, oligo hybridization probes located just 3' of the B box were used to probe northerns to determine if any RNA transcripts were expressed. In this analysis one probe identified an RNA doublet near 190 nt (Fig. 1 A).
The transcribed region corresponding to this RNA was determined by a number of techniques and is shown in Figure 1 B. Primer extension reactions revealed two major transcriptional start sites for the RNA, located 7 and 15 bases upstream of the A Box. These distances are typical for tRNA polymerase III promoters. Surprisingly, two 3'-ends were determined by sequencing cDNAs (see Materials and Methods). A minor end was identified at a poly-T tract that presumably corresponds to an RNA polymerase III transcriptional terminator. The major 3'-end was identified as being located 70 nt upstream of this large poly-T tract. These results raise the possibility that the mature transcripts are produced by an RNA processing event from the primary transcripts that arise by termination at the poly-T tract. Together, the mapping data reveal that the mature transcripts deriving from the two major start sites are 170 and 162 nt in length. The difference in size between the RNA doublet visualized on northern blots (190 and 183 nt) and the mapping data may reflect gel mobility anomalies of the RNA due to strong structural elements within the transcript.
Two observations provide experimental evidence that these transcripts, termed together as RNA170, are transcribed by RNA polymerase III. First, when wild-type or mutant copies of the RNA170 gene are introduced into yeast on plasmids, point mutations within the consensus B box element (Fig. 1 B) decrease expression to the levels seen in an untransformed wild-type strain (Fig. 2 A). In addition, expression levels of the RNA increased >5-fold in yeast strains that overexpressed TDS4 (Fig. 2 B), a limiting component of the polymerase III transcription machinery (29 ). Consistent with RNA170 being produced by RNA polymerase III, RNase H cleavage of the RNA with oligo dT showed that RNA170 is not polyadenylated (data not shown).
A second strategy we used to identify new RNA transcripts took advantage of the observation that the yeast genome has a very compact distribution of genes (1 ). In fact, the majority of predicted ORFs in yeast are oriented <1 kb apart, allowing adequate sequence space for promoters and 5'-3' UTRs. This dense packing of genes suggests that the rare >= 2 kb sequence gaps that are located between some ORFs are not simply random nucleotides, but are functionally important, possibly encoding gene products. An example of this arrangement is a 2.1 kb gap between predicted ORFs on chromosome II that contains the gene for the untranslated TLC1 RNA, a component of telomerase (34 ). To test whether other gaps also express RNAs, we examined several large gaps in the yeast genome by northern analysis using probes specific to those regions. In contrast to the promoter based strategy described above, this approach should identify transcripts expressed by any of the RNA polymerases and could also identify mRNAs which were not predicted due to the small size of their ORFs.
Our analysis proceeded in the following steps. First, computer searches of all 16 yeast chromosomes identified a total of 58 sequence gaps >= 2 kb located between known and hypothetical ORFs (Table 1 ). For the purposes of this work, we have defined gaps as the first nucleotide downstream of an ORFs start/stop codon to the last nucleotide upstream of the next ORFs start/stop codon. Telomeric and centromeric regions contain numerous non-ORF elements and so were avoided in our analysis. Polymerase chain reaction (PCR) amplification of each gap region was performed using primers that annealed ~150 nt away from flanking ORFs. This distance should avoid overlap with the vast majority of 5' and 3' UTRs of the flanking ORFs. PCR products were then radiolabeled and used as hybridization probes for northern blot analysis of total yeast RNA. RNA was prepared from cells growing under various conditions (see Materials and Methods). As a positive control for this method, a PCR probe made from the 2.1 kb gap of chromosome II containing the TLC1 RNA was used. Surprisingly, the TLC1 gap probe not only detected the expected 1300 nt RNA transcript, but also hybridized to a unique 161 nt doublet (Fig. 3 A). Sequence analysis of this RNA indicated it is likely to be a new snoRNA (see below). From the 58 identified sequence gaps, 14 new RNA transcripts ranging in size from 450 to 1200 nt were found, thus, with the 161 nt transcript, giving a total number of 15 new RNA transcripts (Table 1 ). Examples of these RNAs are shown in Figure 3 B-E. Together, >20% of the >= 2 kb gaps expressed RNAs, with some gaps expressing two or three unique transcripts.
A variety of techniques was employed to map the location of the 15 unique RNAs identified in the northern analysis described above. First, all RNAs were approximately mapped by hybridizing northern blots with probes derived from different restriction fragments from the respective gap regions. In all cases, the RNA was localized to a distinct region within the gap. This observation indicated that these RNAs are not derived from the neighboring mRNAs of predicted or known ORFs. Five transcripts were then analyzed further. These included the 161 nt doublet RNA expressed from near the TLC1 gene, termed RNA161, and four of the more abundant RNAs expressed from other gaps (RNA716, RNA515, RNA530 and RNA487, where each number represents the transcript size). These RNAs were precisely mapped using a combination of primer extension, RNase protection, RNase H cleavage and cloning of RT-PCR products. The results of these analyses are shown in Figure 4 .
We have utilized two strategies to methodically search for new RNA transcripts in yeast. First, consensus RNA polymerase III promoters were identified and analyzed for RNA transcription. Second, sequences >= 2 kb lacking predicted ORFs were tested for RNA expression by northern blot analysis. These strategies resulted in the identification of 16 new RNAs ranging in size from 161 to 1200 nt. Two of the RNAs are clearly non-coding, while several contain potential small ORFs. The identification of such a large number of new transcripts from only 10 candidate polymerase III promoters and 59 chromosomal gaps (including the TLC1 gap region) provides evidence that there are many RNAs expressed in yeast that cannot be predicted by standard homology searches or current open reading frame criteria. Specifically, >20% of the >= 2 kb gaps located between predicted genes express RNAs. Moreover, these RNAs can be expressed from regions not expected to be transcribed. For example, RNA170 is expressed from sequences within the assumed promoter region of the neighboring gene, ERG2. Therefore, the density of genes on chromosomes, at least in some regions, may be even higher than currently predicted (1 ).
We hypothesize that careful examination of other regions of the genome is likely to reveal additional new RNAs for several reasons. First, because such a large percentage of gaps >2 kb expressed RNAs (>20%), it is possible that a similar percentage of smaller gaps might also express RNAs. In addition, we found that the size and coding potential of our 16 new RNAs correlated with the size of the ORF gap in which they were expressed. In particular, large gaps >= 2 kb between ORFs expressed primarily mRNAs of 487 to >1000 nt. In contrast, the small ncRNAs, RNA170 and RNA161, were expressed from smaller gaps of between 980 and 631 nt, respectively. These results suggest that analysis of smaller gaps (<2 kb) will reveal transcripts that are more likely to be non-coding. Next, in the polymerase III promoter search, we demanded a stringent match to the consensus B box sequence, then we utilized other criteria to narrow our northern analysis to 10 candidates. Therefore, other uncharacterized polymerase III genes may exist that simply did not meet our criteria. As the identification of novel RNAs continues, the genetic analysis of their function in yeast will be important for an understanding of the multiple roles of RNA molecules in eukaryotic cells.
We wish to thank Stephen Buratowski for generously donating the TDS4 plasmids, Heli Roiha for providing us with the various related yeast species, and Peter Geiduschek and George Kassavetis for reagents. This work was funded by the Howard Hughes Medical Institute. W.O. is supported by a postdoctoral fellowship from HHMI.
REFERENCES



