Nucleic Acids Research Advance Access originally published online on April 22, 2007
Nucleic Acids Research 2007 35(10):3192-3202; doi:10.1093/nar/gkm187
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2007, Vol. 35, No. 10 3192-3202
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Computational Biology |
Non-EST-based prediction of novel alternatively spliced cassette exons with cell signaling function in Caenorhabditis elegans and human
Department of Genetics and Center for Genome Sciences, Washington University in St Louis, 4444 Forest Park Parkway, Campus Box 8510, St Louis, MO 63108, USA
*To whom correspondence should be addressed. Tel: +1-314-362-2751; Fax: +1-314-362-2156; Email: rmitra{at}genetics.wustl.edu
Received January 3, 2007. Revised March 15, 2007. Accepted March 16, 2007.
| ABSTRACT |
|---|
|
|
|---|
To better understand the complex role that alternative splicing plays in intracellular signaling, it is important to catalog the numerous splice variants involved in signal transduction. Therefore, we developed PASE (Prediction of Alternative Signaling Exons), a computational tool to identify novel alternative cassette exons that code for kinase phosphorylation or signaling protein-binding sites. We first applied PASE to the Caenorhabditis elegans genome. In this organism, our algorithm had an overall specificity of
76.4%, including 33 novel cassette exons that we experimentally verified. We then used PASE to analyze the human genome and made 804 predictions, of which 308 were found as alternative exons in the transcript database. We experimentally tested 384 of the remaining unobserved predictions and discovered 26 novel human exons for a total specificity of
41.5% in human. By using a test set of known alternatively spliced signaling exons, we determined that the sensitivity of PASE is
70%. GO term analysis revealed that our exon predictions were found in the introns of known signal transduction genes more often than expected by chance, indicating PASE enriches for splice variants that function in signaling pathways. Overall, PASE was able to uncover 59 novel alternative cassette exons in C. elegans and humans through a genome-wide ab initio prediction method that enriches for exons involved in signaling. | INTRODUCTION |
|---|
|
|
|---|
One intriguing aspect of signal transduction is that there seems to be a relatively small number of signaling pathways, yet cells are able to generate multiple responses to different signals from the environment. One explanation is that alternative splicing of pre-mRNA generates a much larger number of unique signaling proteins than is generally appreciated, thus enabling a diverse set of responses. Several pieces of evidence support this hypothesis. Bioinformatics analysis and DNA microarray experiments indicate that 5974% of all human genes are alternative spliced (1,2) and that roughly 75% of alternative splicing events alter the protein-coding region of the transcript (35). This suggests that alternative splicing has the potential to produce a large number of different proteins from the surprisingly limited number of genes in multicellular organisms. The connection between alternative splicing and intracellular signaling is further strengthened by studies of glucocorticoid signaling, where it has been shown that alternative splicing is the mechanism by which a single gene (the glucocorticoid receptor) is able to mediate a variety of different responses in different cell types (6). Similar results have been observed in other signaling pathwaysboth the
-amino-butyric acid type A receptor gamma-subunit (GABAAR
2) gene and the myosin light chain kinase (MLCK) gene contain alternative cassette exons that encode for phosphorylation sites that alter their signaling functions (79).
In order to accelerate our efforts to understand the role alternative splicing plays in intracellular signaling pathways, we need to identify and accurately catalog alternatively spliced (AS) isoforms involved in signal transduction. However, there are currently several difficulties to obtaining such a catalog. While most known alternative-splicing events have been discovered through large-scale EST sequencing, this approach has some drawbacks. First, in many model organisms, where intracellular signaling is most easily dissected, relatively few EST sequences have been collected (e.g. Caenorhabditis elegans
300 000 ESTs). Even in organisms with high EST coverage such as human, many AS isoforms go undetected because EST libraries are biased to highly expressed splice forms as well as to the 3' or 5' ends of genes. Also, many minor splice variants are not captured by this conventional approach because of their specific expression in particular tissues or developmental stages. Indeed, many human tissues have not been adequately sampledthere are
210 human cell types (10) and many of these have little or no EST coverage. Finally, even when AS transcripts are found, their functions are often unknown. As a result, our ability to detect alternative splice variants and determine their function is limited.
Recognition of these shortcomings has sparked considerable interest in developing computational approaches to splice variant prediction (11). Attempts at ab initio prediction of alternative splicing that is based on intronic sequence alone have proven difficult because of the high number of pseudo-splice sites in intronic sequences (12). In addition, many of the known AS cassette exons that affect signaling are quite small, and may be difficult to find by conventional gene prediction methods alone. Recently, progress has been made using species conservation as well as protein domain information (1315), yet the total number of experimentally validated novel isoforms remains modest compared to the numbers found by traditional EST sequencing. Also, no hypothesis can be made about the function of these observed alternative splice variants.
To complement these approaches and to address the issue of finding signaling exons, we developed prediction of alternative signaling exons (PASE), a computational tool to identify AS cassette exons that are likely to be involved in intracellular signaling. This algorithm uses first-order Markov models and a Bayesian classifier to identify likely donors and acceptor sites in an intron. Using these potential splice sites, it identifies in-frame cassette exons that code for phosphorylation sites or signaling protein-binding sites. As an additional filter, only exons that are conserved across species are kept. Here, we report the results of a genome-wide application of our algorithm and demonstrate that our genome-wide ab initio approach enriches for novel exons likely to be involved in cell signaling.
| METHODS |
|---|
|
|
|---|
Preparation of intron, EST/mRNA and species conservation data
All intron data sets were processed from REFSEQ genomic alignment annotations, which were downloaded from the UCSC Genome Browser (May 2004 release). Redundant transcripts and intron-less genes were removed from both human and C. elegans data sets. Spliced-ESTs and phastCons Most conserved species conservation blocks were downloaded as chromosome coordinate data from the UCSC Genome Browser (as of January 2, 2005). Two PERL programs (compare_to_expressedseq.pl and compare_to_conserved.pl) were written to compare the overlap of coordinates from these data sets to the predicted exon coordinate data.
Acceptor and donor splice site scoring
The C. elegans and human cassette exon models consist of the pairing of acceptor and donor splice sites with an exon size restriction of 30330 bp. Both acceptor and donor splice site models use 12-mer first-order Markov chains to capture the over-represented dinucleotide frequencies around the splice sites (34). These models were trained on a randomly sampled set of 5000 REFSEQ internal exons from human and C. elegans genome, respectively.
Calculation of the first-order Markov chain model of the splice site:
|
|
Log-odds ratio of the splice site versus the background probability Pb:
|
The background probabilities were calculated by counting the overlapping dinucleotide frequencies of all intronic sequences from the REFSEQ genes of each genome.
Bayesian classification of an acceptordonor pair exon model
A Bayesian classifier was trained and tested with sets of real and pseudo-exon acceptordonor pair scores. Pseudo-exons in this case are any pair of AGGT dinucleotides within range of the exon size limits taken from randomly generated intronic sequences based on species-specific dinucleotide distribution. For the human test set, a ratio of 1:10 real exons to pseudo-exons was used (1000 real exons versus 10 000 pseudo-exons), reflecting the order of magnitude number of pseudo-exons relative to real exons typically found in the human genome (12). The C. elegans test set had a 1:4 ratio (1000 real exons versus 4000 pseudo-exons) of real exons to pseudo-exons. Training of the Bayesian classifier parameters for both the real and pseudo-exon acceptordonor pair score distributions were calculated by using multivariate Gaussian distributions with the prior probabilities based on the ratios of real exon to pseudo-exons.
The function for the Multivariate Gaussian distribution model for exon acceptordonor pair scores is the following:
Let xi = {acceptor bit score, donor bit score},
= covariance matrix of acceptor and donor bit score distributions and det (
) = determinant of the covariance matrix
|
The calculation of the probability that an exon is real, given its acceptordonor pair scores is the following:
Let P(R) and P(P) be the prior probability of Real Exons and Pseudo-Exons, respectively
|
Exon translations
A PERL program (intron_phase.pl) was written to determine the reading frames of exons and the phase of each intron as 0, 1 or 2 depending on where the codon was spliced. When predicting exons, we also translated the five flanking amino acids from both of the adjacent constitutive exons (See Supplementary Data).
PASE-sensitivity test
A splicing events data file AltSplice-rel3.events.txt and gene sequence file AltSplice-rel3.genes.txt.gz were downloaded from Alternative Splicing Database (http://http://www.ebi.ac.uk/asd/) (24). Single cassette exon splicing events and their sequences were extracted using get_single_cassette.pl PERL program. WU-BLAST was used for BLASTX comparison with the Phospho.ELM database (25) and only exons with 100% identity were selected. PASE was then applied to corresponding REFSEQ gene structures that lack the cassette exons.
Creation of Scansite log-odds matrices
The current release of Scansite (version 2.0) includes 63 motifs characterizing the binding and/or substrate specificities of many families of Ser/Thr- or Tyr-kinases, SH2, SH3, PDZ, 14-3-3 and PTB domains, together with signature motifs for PtdIns(3,4,5)P3-specific PH domains (35). PDZ-binding motifs were excluded due to the COOH-terminal sequence requirement. We modified the Scansite matrices to be log-odds scoring matrices to take into account the background distribution of the proteomes for human and C. elegans. This modification allowed us to compare the information content of the matrices (Supplementary Table A) and to rank the scores of predicted functional exons independent from percentile ranking (See Supplementary Data).
Let Sij be the Scansite selectivity value of amino acid i in position j:
|
|
The calculation of the log-odds scoring matrix is the following:
|
|
Using these matrices, a PERL script was written (find_scansite.pl) to search the putative exons for signaling motifs using bit score thresholds of 10 and 6 bits.
Primer design of selected candidates
Batch processing of the primer design for all candidate exons was done with a PERL program (prediction_to_primer3.pl) in combination with the PRIMER3 software (36). Primers were designed using the following PRIMER3 settings: primer length minimum, 19 nt, desired; 25 nt and maximum, 32 nt; melting temperature minimum, 64°C, desired length, 70°C and maximum length, 73°C; minimum GC content of 45, and maximum of 80; product length, 150700 nt; and pre-filtering of potentially mispriming sequences with the provided library of human repeats. Figure 3 illustrates the primer design and the expected PCR products. Primer sequences were ordered from Integrated DNA Technologies, Coralville, IN, USA (list of primers available in Supplementary Data).
Semi-nested RT-PCR experiments
Pooled total RNA samples from 18 different tissues types were used for semi-nested PCR validation in human (Supplementary Table D). Total RNA from whole worms were used for PCR validation in C. elegans. Superscript II Reverse Transcription (Invitrogen, Carlsbad, CA, USA) was used to create cDNA with candidate gene-specific reverse primers. All cDNA samples were tested for the presence of RNA Polymerase II transcript as a control. The cDNA from all tissues was then pooled together and a 1:10 dilution was made to be used as a template for the first round of semi-nested PCR. PCR was carried out with the Sigma Jumpstart Taq DNA polymerase kit on an MJ Research PTC-200 (Bio-Rad Laboratories, Mississauga, ON, Canada), with first round of 25 cycles (45 s at 94°C), annealing (30 s at 56°C) and extension (1 min at 72°C) and second round for 35 cycles using the same program using 1:100 dilution of first round reaction as template. PCR products were separated in 2% agarose gels supplemented with ethidium bromide, under a UV light. High-throughput analysis using Phoretix 1D (Nonlinear Dynamics, Newcastle upon Tyne, UK) facilitated multiple lane band size determinations for all PCR experiments.
Cloning and sequencing
Second round PCR products of the expected predicted size were then ligated into pGEM-T Easy TA cloning vectors (Promega, Madison, WI, USA) and transformed into GeneChoice High Efficiency GC10 chemically competent cells (Cat No. D-1). Bacterial clones were plated on LB/X-gal/IPTG agar plates and grown overnight at 37°C. A maximum of 24 colonies were picked from each plated transformation and used for colony PCR with standard M13 primers. Two microliters of this colony PCR product was then used as the template for cycle sequencing using Applied Biotech, Inc. BigDye® Terminator v3.1 Cycle Sequencing Kit (cat no. 4336917) and then run on an ABI 3700.
GO term over-representation
The gene ontology (GO) terms were taken from the non-redundant ermineDB GO database (37). In this database, 3080 genes are annotated with the signal transduction GO term ID 0007165 out of a total 18 506 annotated human gene products. Sampling of the genes was done without replacement, therefore we calculated the probability of sampling r genes annotated to a given GO term by using the hypergeometric distribution:
|
|
Accession numbers
The RT-PCR verified sequences that were sequenced were deposited in Genbank, under accession numbers EF491733
[GenBank]
EF491822
[GenBank]
.
| RESULTS |
|---|
|
|
|---|
Overview of PASE
Our goal was to develop an algorithm that would identify novel alternative cassette exons involved in intracellular signaling. We focused on single cassette exons because they make up the significant portion (5361%) of alternative splicing events in most species (16,17). Our algorithm can be summarized in the following steps (Figure 1):
- Identify cassette exon splicing events using splice-site Markov models and a Bayesian classifier.
- Translate candidate exons and remove those that have premature stops or altered reading frames.
- Select exons that are in conserved sequence regions between species.
- Select exons with predicted phosphorylation or protein-binding motifs known to be involved in intracellular signaling.
|
We discuss each of these steps below.
Identification of cassette exons
PASE first scans intronic sequences of REFSEQ gene structures for candidate cassette exons. Candidate cassette exons must have pairs of acceptor and donor splice sites within 30330 bp of each other. Splice site models were generated by training 12-position first-order Markov chains of over-represented dinucleotides for both acceptor and donor splice sites from a randomly sampled training set of real exons from human REFSEQ and C. elegans WORMBASE annotations. We then trained a Bayesian classifier to discriminate acceptordonor pair scores of real exons from pseudo-exons of similar length. Pseudo-exons are any pair of acceptors and donors generated from a random background intronic sequence distribution. This classifier achieved an average of 96% accuracy on multiple training set runs for both C. elegans and human test sets (See Methods section).
Exon amino acid translations
Alternative exons that introduce frameshifts or premature stop codons are under strong negative selection and are not likely to be functional because they disrupt protein structure (18). Therefore, PASE filters out predicted exons with in-frame stop codons or whose length in base pairs was not a multiple of three, as these cause frameshifts.
Species conservation of exons
Sequences encoding alternative exons are significantly more conserved than neutral sequences (19,20). Furthermore, orthologous exons that are alternative in other species are often found to be more conserved than orthologous constitutive exons and they often have conserved intronic sequences flanking them (14,21). PASE requires predicted exons to overlap sequences that are identified as conserved by the PhastCons phylo-HMM program. This algorithm identifies blocks of highly conserved genomic sequence elements using results from multiple sequence alignments of up to 17 different vertebrate species when compared to human (22). For C. elegans, only C. briggsae was used for comparative genomics.
Signaling interaction motifs
To find cassette exons that encode for signaling motifs, we modified the Scansite 2.0 algorithm (http://scansite.mit.edu) (23) and used it to identify cassette exons with intracellular signaling motifs. Scansite uses a database of experimentally generated position-specific scoring matrices (PSSMs) to identify signaling motifs such as kinase phosphorylation sites, SH2 and SH3 domains (Figure 2). We modified this algorithm to use an information-content-based scoring system, and to account for the species-specific background frequencies of the different amino acids. These modifications were important for improving motif score thresholds in our application (see Supplementary Data). We then scored all putative exons against PSSMs for 59 cell signaling motifs (Supplementary Table A).
|
|
Prediction of alternatively spliced signaling exons in C. elegans
Caenorhabditis elegans is a model organism that is often used to study signal transduction because it shares many pathways with human and mouse but is more amenable to rapid genetic manipulation. Furthermore, relatively few ESTs have been sequenced from C. elegans (300 000 ESTs sequenced from worm versus
7 x 106 sequenced from human), so this represents an ideal organism to try an ab initio approach to find novel alternative exons involved in signaling. Using the C. elegans genome sequence as input, PASE predicted 140 putative alternative exons involved in signaling (Table 1). Seventy-four of these could be identified in the C. elegans spliced-ESTs database (Supplementary Table B). The remaining 66 predictions represent either novel alternative exons or false positives and were selected for experimental validation.
|
Experimental validation of novel C. elegans predictions
We pooled C. elegans total RNA from a mixture of developmental stages. Next we performed semi-nested RT-PCR specific to the predicted alternative exons followed by agarose gel electrophoresis (Figure 3). Bands of the predicted size were cloned and sequenced to determine if the predicted exon was present in the product. We evaluated 66 predictions and found 33 novel exons with correctly predicted 5' splice junctions (50%, see Table 2). These results, together with the 74 predictions already supported by the EST data, indicate the specificity [TP/(TP + FP)] of our algorithm is
76.4% (107/140) in C. elegans.
|
Analysis of the criteria used to predict exons in C. elegans
In order to understand the relative importance of the three filters used by PASEdonor and acceptor site pairs, sequence conservation, Scansite scorewe analyzed the enrichment of our validated alternative exons (either experimentally or from EST data) from the set of all candidate exons at different steps in our algorithm. Because our algorithm does not use EST information to make its prediction, this is a reasonable estimate of the specificity of the algorithm. PASE scanned 118 457 introns in 21 584 genes. After applying the acceptordonor exon model, Bayesian classifier and reading frame filter PASE found 6008 translatable exon predictions for worm (Table 1). Here,
13% of these exon predictions are observed as splice variants in the EST data (Table 1). After selecting exons that are conserved between C. elegans and C. briggsae, 815 predictions remained, of which, 38.1% were validated splice forms (Table 1). Thus, a 3- to 4-fold enrichment in documented splice variants is observed when predicted translatable exons are limited to regions of highly conserved sequence. Next, we analyzed the contribution of the intracellular signaling motifs without using a conservation filter. When a 6-bit (or 10-bit) scoring threshold was used, 823 (or 113) of the 6008 exons met this criteria, 30.8% (or 37.2%) of which were in the EST database or experimentally validated. When the conservation filter and the 6-bit threshold were combined, 140 exons passed the filter, 74 (52.8%) of which were validated by spliced-ESTs. Together with the 33 novel validated exons, a total of 107 (76.4%) exons were validated. From this analysis, we conclude that conservation and Scansite score are largely independent filters, and while both contribute significantly to the specificity of the algorithm, conservation plays a slightly larger role.
Prediction of alternatively spliced signaling exons in human
We next sought to use PASE to find novel signaling exons in humans. We used PASE to scan 197 684 human introns from 22 615 genes (Table 1) and predicted 804 AS exons involved in signaling. Of these, 308 (38.5%) could be found in the human spliced-ESTs library (Supplementary Table C). The remaining 496 predictions are either novel exons or false positives of our algorithm. We ranked these predictions by Scansite bit score and selected the top 384 predictions for experimental validation.
Experimental validation of novel human predictions
Since testing each of the 384 predictions individually across each tissue would require a large number of experiments, we decided to pool total RNA from 18 human tissues (Supplementary Table D). In order to compensate for the increased dilution of already low expressed splice forms, we performed a sensitive, semi-nested RT-PCR approach specific to the predicted alternative exons followed by agarose gel electrophoresis (Figure 3). Bands of the predicted size were cloned and sequenced. Of the top 384 predictions that were not in the EST database, we found 26 (6.7%) with correctly predicted 5' splice junctions that were validated using this approach (Table 3). Overall, our algorithm achieved a specificity of
41.5% (334/804) in human (Table 1).
|
Analysis of the criteria used to predict exons in human
We wanted to understand the role each filter played in distinguishing novel exons from pseudo-exons, so we calculated how each step of our algorithm enriched for bona fide exons. PASE scanned 197 684 human introns (
1075 MB) and applied the acceptordonor exon model, Bayesian classifier and reading frame filter to produce 207 176 predictions (Table 1). Six percent of these predictions are supported by experimental evidence (either present in the EST database or validated by our RT-PCR experiments). We next applied the conservation filter. Of the 207 176 exons, only 5160 (2.4%) were conserved across vertebrate species. Twenty-two percent of these had experimental support, somewhat lower than that observed in C. elegans. We also separately searched the 207 176 translatable exons for signaling motifs with log-odds scores greater than 6 bits (or 10 bits) and found that 35 190 (or 5489) exon predictions had one or more motifs that met or exceeded this score threshold (Table 1). In this subset of predictions, the enrichment for matches with expressed sequences was slightly lower than in C. elegans with 13.7% (19.4% for 10 bits) of these high-scoring exons matching already observed spliced-EST patterns in human. When both species conservation and high-scoring Scansite motif criteria are combined, there is a significant increase for the enrichment of expressed sequences with 308 of 804 (38.5%) predictions matching with spliced-ESTs, reflecting
2-fold increase in specificity. Thus, including the 26 experimentally validated exons, the total specificity was
41.5%. The results are similar to those observed in C. elegans: both the conservation and Scansite filters contribute significantly to our specificity and these filters are largely independent.
Determining the sensitivity of the PASE algorithm
A test set was created using splicing event data from the Alternative Splicing Database (ASD) (24) and a list of experimentally verified phosphorylation sites in eukaryotic proteins from PhosphoELM (25). A set of 3797 single cassette exon splicing events were extracted from the ASD database and used in a BLASTX comparison with the PhosphoELM database. A total of 20 PhosphoELM sites had 100% sequence identity matched with a single cassette exon (Supplementary Table E). We then applied PASE to the corresponding REFSEQ gene structures that lack the signaling cassette exons. Fourteen of the exons were correctly predicted on both acceptor and donor splice sites as well as correctly identifying the phosphorylation site (total sensitivity (TP/TP + FN) = 70%). Of the false negatives, two cassette exons were correctly predicted, but PASE missed the phosphorylation site. Three exons had incorrect donor sites predicted, of which one had the phosphorylation site predicted correctly. One exon in the test set was not predicted as well as having a missed phosphorylation site.
Predicted exons are found in signaling proteins
If our validated alternative exons are functional (i.e. play a role in signal transduction), then we expect to find them predominantly in genes involved in intracellular signaling. On the other hand, if PASE is not finding functional exons, we would expect our predictions to be randomly distributed across all genes (after correcting for differences in the amount of intronic sequence). Therefore, to determine if PASE is finding functional exons, we used a hypergeometric test to calculate P-values of the enrichment of our predicted alternative exons to occur in signal transduction genes labeled with term GO:0007165 (See Methods section). The 26 novel human exons discovered in this study mapped to 20 GO annotated genes, and our set of 334 validated exons (covered by ESTs or experimentally validated here) mapped to 218 GO annotated genes. In both cases, a significant enrichment was observed (8 of 20 GO annotated genes, P = 0.002, and 62 of 218 GO annotated genes, P = 3e6). These results support the conclusion that we are enriching for functional exons. A similar analysis was performed on all the 804 predictions, which mapped to 504 GO annotated genes, 151 of which were signal transduction genes. Interestingly, this complete set of predictions also showed significant enrichment (P < 4e12).
Differential tissue-specific splicing of the novel exons in LRP1 and ESR1
To determine if the inclusion and/or exclusion of the exons occurred in a tissue-specific manner, we performed RT-PCR in each of the 18 human tissues. We focused on two novel exons from two genes, estrogen receptor alpha (ESR1/ER
) and low-density lipoprotein receptor 1 (LRP1), because both genes are involved in important signal transduction pathways with clinical significance (26,27). We used flanking exon primers for the canonical splice junction and semi-nested, exon-specific primers for the novel exon variant. In the case of ER
, we found only breast and liver expressed the novel minor splice variant, but all tissues tested expressed the constitutive splice junction (Figure 4A). This result shows that the novel ER
exon is typically excluded from most tissues, and that its inclusion may be due to tissue-specific splicing mechanisms. In the case of LRP1, we saw a broader distribution of this minor splice variant, with the exception of uterus tissue, which did not show any expression of this novel exon (Figure 4B). The observation that these isoforms are expressed in a tissue-specific fashion lends further support to the idea that these are not stochastic events, but instead regulated to perform a tissue-specific function. Also, these results indicate that the flanking PCR is not sensitive for the detection of these minor splice variants, compared to the exon-specific semi-nested PCR approach.
|
| DISCUSSION |
|---|
|
|
|---|
Our algorithm is the first attempt to predict signaling cassette exons by combining sequence-based exon prediction with additional information from short peptide motifs that are bound or phosphorylated by signaling proteins. This approach, when applied to C. elegans, made 140 predictions, 74 of which were present in the C. elegans EST database. We experimentally tested the remaining 66 predictions, finding an additional 33 novel isoforms. Thus, the overall specificity of our algorithm is
76.4% (107/140) in C. elegans. We also used the algorithm to find human cassette exons, making 804 predictions, of which 308 were found as alternative exons in sequenced ESTs. We experimentally tested 384 of the remaining 496 predictions and discovered an additional 26 novel human exons (total specificity 334/804
41.5%). Overall, we discovered 59 novel cassette exons. The human exons that we uncovered are likely to be involved in signal transduction because they were found in the introns of known signal transduction genes more often than expected by chance (P < 0.003). Using a test set of known AS phosphorylation sites, we determined that the sensitivity of our algorithm is
70%.
In both organisms analyzed, a large fraction of predictions were found in the EST database52.8% in C. elegans and 38.5% in humans. Surprisingly, when we experimentally tested the remaining predictions, we saw different discovery rates in the two organisms. In C. elegans, 50% of the predicted exons not present in spliced-ESTs were validated; in human, only 6.7% were validated. Why was the discovery rate lower in humans than C. elegans? One might hypothesize that because C. elegans has shorter introns than humans (
15-fold) and their splice sites have a higher information content (28), PASE makes more accurate predictions in C. elegans. However, this seems unlikely to be the main reason because the fraction of predictions covered by ESTs was similar in C. elegans and humans. Another possibility is that our experimental approach is not able to detect novel exons in humans as well as it can in C. elegans. To validate our C. elegans predictions, we used total RNA obtained from whole worms at various stages of their life cycles. This means that RNA from every cell type was present in the sample, which may explain the higher discovery rate. On the other hand, because we analyzed RNA from 18 human tissues, we sampled only a small fraction of human cell types (out of a total possible 210 cell types). Therefore, it is possible that some of our predictions were bona fide exons but were not present in our RNA pool, and that our total specificity is at least 41.5%. In fact, the GO term analysis for all 804 predictions (P < 4e12) suggests there may be more signaling exons that we have yet to find.
The novel exons found here all encode for phosphorylation or binding sites; this allows one to predict interactions with specific signaling proteins. For example, the novel exon we found in estrogen receptor alpha (ER
) is predicted to have a SH3 class II binding site (K-P-X-X-Q/K-X) targeted by the p85
SH3 domain. Thus, one would predict that this isoform of ER
interacts with a protein containing a p85
SH3 domain. Indeed, previous work suggests that ER
directly interacts with the p85
SH3 domain of PI3K, but since the canonical ER
protein does not contain a binding site, the mechanism of this interaction is currently unknown (29). The cassette exon found here may explain this interaction. The involvement of the estrogen-signaling pathway in cancer (27), obesity (30,31) and cardiovascular disease (32) makes this a particularly interesting direction for future work. In addition, several other interesting exon predictions coincided with published literature and could be candidates for further investigation (See Supplementary Table C).
Our exon validation approach was designed to facilitate the detection of rare alternative exons in pooled RNA samples, and we found it to be robust and sensitive. One drawback to our validation pipeline was that only the 5' splice junctions were sequencedthe 3' splice junctions were not. The correct 5' junction guarantees the preservation of the reading frame through the novel exon and the presence of the putative interaction site, but the protein might be modified or truncated downstream of the exon. Therefore, for a subset of our predictions, we also confirmed the 3' splice junctions and found 31 out of 31 of them had the correct predicted 3' splice junction (See Supplementary Data C and D).
The results presented here demonstrate that PASE is able to find alternative signaling exons with high selectivity. We used PASE to predict exons in humans and C. elegans and discovered 59 novel exons, several of which may play important biological roles. Because PASE does not use EST data to predict exons, it may be particularly useful when applied to organisms with low EST coverage (e.g. as was the case with C. elegans). Currently, PASE is able to predict alternative single cassette exonswe plan to extend the algorithm to encompass exon extensions, intron retentions and alternative 3' or 5' exons and possibly include the prediction of the disruption of putative signaling motifs (33). In addition, we anticipate an improvement in performance when more signaling-related protein features such as sites for acetylation, proteolysis and even protein domains are included in these splicing event predictions.
| SUPPLEMENTARY DATA |
|---|
|
|
|---|
Supplementary Data are available at NAR online.
| ACKNOWLEDGEMENTS |
|---|
We would like to thank Gary Stormo, Justin Sonnenburg, Katherine Varley and Jason Gertz for reviewing our manuscript. Tim Schedl and Elaine Mardis for their excellent suggestions concerning our experimental approach. Total RNA samples of C. elegans were graciously donated by Jennifer Davila-Aponte from the Sean Eddy lab. We would also like to thank Yun Yue, Lee Tessler and Michael Brooks for helpful discussions. This work and the Open Access publication charges were funded by NIH grant no. 5P50HG003170-03 and GATP training grant no. T32-HG000045, Whitaker Foundation (St. Louis, MO).
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Johnson JM, et al. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science (2003) 302:21412144.
[Abstract/Free Full Text] - Kan Z, Rouchka EC, Gish WR, States DJ. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome. Res. (2001) 11:889900.
[Abstract/Free Full Text] - Okazaki Y, et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature (2002) 420:563573.[CrossRef][Medline]
- Zavolan M, Kondo S, Schonbach C, Adachi J, Hume DA, Hayashizaki Y, Gaasterland T. Genome Res (2003) 13:12901300.
[Abstract/Free Full Text] - Lareau LF, Green RE, Bhatnagar RS, Brenner SE. Curr Opin Struct Biol (2004) 14:273282.[CrossRef][Web of Science][Medline]
- Zhou J, Cidlowski JA. The human glucocorticoid receptor: one gene, multiple proteins and diverse responses. Steroids (2005) 70:407417.[CrossRef][Web of Science][Medline]
- Meier J, Grantyn R. Preferential accumulation of GABAA receptor gamma 2L, not gamma 2S, cytoplasmic loops at rat spinal cord inhibitory synapses. J. Physiol. (2004) 559(Pt. 2):355365.
[Abstract/Free Full Text] - Stamm S, Ben-Ari S, Rafalska I, Tang Y, Zhang Z, Toiber D, Thanaraj TA, Soreq H. Gene (2005) 344:120.[CrossRef][Web of Science][Medline]
- Birukov KG, et al. Differential regulation of alternatively spliced endothelial cell myosin light chain kinase isoforms by p60(Src). J. Biol. Chem. (2001) 276:85678573.
[Abstract/Free Full Text] - Freitas RA. Nanomedicine. In: In: Appendix C: Catalog of Distinct Cell Types in the Adult Human Body. (1999) Georgetown, TX: Landes Bioscience.
- Zavolan M, van Nimwegen E. The types and prevalence of alternative splice forms. Curr. Opin. Struct. Biol. (2006) 16:362367.[CrossRef][Web of Science][Medline]
- Sun H, Chasin LA. Multiple splicing defects in an intronic false exon. Mol. Cell. Biol. (2000) 20:64146425.
[Abstract/Free Full Text] - Ohler U, Shomron N, Burge CB. Recognition of unknown conserved alternatively spliced exons. PLoS Comput. Biol. (2005) 1:113122.[Medline]
- Yeo GW, Van Nostrand E, Holste D, Poggio T, Burge CB. Proc Natl Acad Sci U S A (2005) 102:28502855.
[Abstract/Free Full Text] - Hiller M, Huse K, Platzer M, Backofen R. Nucleic Acids Res (2005) 33:56115621.
[Abstract/Free Full Text] - Stamm S, Zhu J, Nakai K, Stoilov P, Stoss O, Zhang MQ. DNA Cell Biol (2000) 19:739756.[CrossRef][Web of Science][Medline]
- Mironov AA, Fickett JW, Gelfand MS. Frequent alternative splicing of human genes. Genome. Res. (1999) 9(12):12881293.
[Abstract/Free Full Text] - Magen A, Ast G. The importance of being divisible by three in alternative splicing. Nucleic Acids Res. (2005) 33:55745582.
[Abstract/Free Full Text] - Modrek B, Lee CJ. Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat. Genet. (2003) 34:177180.[CrossRef][Web of Science][Medline]
- Sugnet CW, Kent WJ, Ares M Jr, Haussler D. Pac Symp Biocomput (2004) 6677.
- Sorek R, Ast G. Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome. Res. (2003) 13:16311637.
[Abstract/Free Full Text] - Siepel A, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome. Res. (2005) 15:10341050.
[Abstract/Free Full Text] - Yaffe MB, Leparc GG, Lai J, Obata T, Volinia S, Cantley LC. Nat Biotechnol (2001) 19:348353.[CrossRef][Web of Science][Medline]
- Stamm S, Riethoven JJ, Le Texier V, Gopalakrishnan C, Kumanduri V, Tang Y, Barbosa-Morais NL, Thanaraj TA. Nucleic Acids Res (2006) 34:D4655.
[Abstract/Free Full Text] - Diella F, et al. Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics (2004) 5:79.[CrossRef][Medline]
- Boucher P, Gotthardt M, Li WP, Anderson RG, Herz J. Science (2003) 300:329332.
[Abstract/Free Full Text] - Ariazi EA, Ariazi JL, Cordera F, Jordan VC. Curr Top Med Chem (2006) 6:181202.[CrossRef][Web of Science][Medline]
- Lim LP, Burge CB. A computational analysis of sequence features involved in recognition of short introns. Proc. Natl Acad. Sci. USA (2001) 98:1119311198.
[Abstract/Free Full Text] - Simoncini T, Rabkin E, Liao JK. Molecular basis of cell membrane estrogen receptor interaction with phosphatidylinositol 3-kinase in endothelial cells. Arterioscler. Thromb. Vasc. Biol. (2003) 23:198203.
[Abstract/Free Full Text] - Mueller SO, Korach KS. Estrogen receptors and endocrine diseases: lessons from estrogen receptor knockout mice. Curr. Opin. Pharmacol. (2001) 1:613619.[CrossRef][Medline]
- Heine PA, Taylor JA, Iwamoto GA, Lubahn DB, Cooke PS. Proc Natl Acad Sci U S A (2000) 97:1272912734.
[Abstract/Free Full Text] - Wang M, Crisostomo P, Wairiuko GM, Meldrum DR. Am J Physiol Heart Circ Physiol (2006) 290:H22042209.
[Abstract/Free Full Text] - Hiller M, Huse K, Platzer M, Backofen R. Genome Biol (2005) 6:R58.[CrossRef][Medline]
- Burge CB. Modeling dependencies in pre-mRNA splicing signals. In: Computational Methods in Molecular BiologySalzberg SL, Searls DB, Kasif S, eds. (1998) Amsterdam: Elsevier Science.
- Obenauer JC, Cantley LC, Yaffe MB. Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. (2003) 31:36353641.
[Abstract/Free Full Text] - Rozen S, Skaletsky H. Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol. (2000) 132:365386.[Medline]
- Lee HK, Braynen W, Keshav K, Pavlidis P. BMC Bioinformatics (2005) 6:269.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
G. G. Leparc and R. D. Mitra A sensitive procedure to detect alternatively spliced mRNA in pooled-tissue samples Nucleic Acids Res., December 18, 2007; 35(21): e146 - e146. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



).
