Nucleic Acids Research, 2003, Vol. 31, No. 24 7271-7279
© 2003 Oxford University Press
Article |
Gene structure prediction in syntenic DNA segments
1 Molecular Biology Institute, 2 Molecular, Cell and Developmental Biology and 3 Human Genetics, University of California Los Angeles, Los Angeles, CA 90095, USA
*To whom correspondence should be addressed at 242 Boyer Hall, UCLA, 611 East CE Young Drive, Box 951570, Los Angeles, CA 90095, USA. Tel: +1 310 825 2546; Fax: +1 310 206 7286; Email: lake@mbi.ucla.edu
Present address:
Jonathan E. Moore, Biology Department, Pomona College, Claremont, CA 91711, USA
Received July 14, 2003; Revised and Accepted October 14, 2003
| ABSTRACT |
|---|
|
|
|---|
The accurate prediction of higher eukaryotic gene structures and regulatory elements directly from genomic sequences is an important early step in the understanding of newly assembled contigs and finished genomes. As more new genomes are sequenced, comparative approaches are becoming increasingly practical and valuable for predicting genes and regulatory elements. We demonstrate the effectiveness of a comparative method called pattern filtering; it utilizes synteny between two or more genomic segments for the annotation of genomic sequences. Pattern filtering optimally detects the signatures of conserved functional elements despite the stochastic noise inherent in evolutionary processes, allowing more accurate annotation of gene models. We anticipate that pattern filtering will facilitate sequence annotation and the discovery of new functional elements by the genetics and genomics communities.
| INTRODUCTION |
|---|
|
|
|---|
The increasing diversity of metazoan and other eukaryotic genomes is a major opportunity for the comparative genomics community. Two principle approaches are used to predict protein-coding regions in genomic sequences (15). Ab initio methods analyze codon usage, potential splice site sequences, exon length and other features to distinguish coding regions from non-coding regions and thereby construct gene models (68). Extrinsic methods compare genomic sequences with those of known proteins at either the amino acid or nucleotide level (911). Ab initio methods can detect proteins for which there is no known homolog, while extrinsic methods cannot. However, ab initio methods are trained on limited data sets, making them apt to predict genes structurally resembling those in their training sets while missing others (12).
As more genomes of closely related organisms are sequenced (1315), another approach is becoming increasingly valuable (16,17). In this approach, long homologous sequences, also called syntenic sequences, are compared, and less diverged regions are assumed to be functional elements since these elements are generally subject to significant selection. This approach identifies not only potential coding regions, but also non-coding regions which can regulate the expression of genes or which serve as templates for non-coding RNAs. In addition to the manual use of such an approach (1824), in the last few years several gene prediction programs have been created to exploit these comparative approaches (2528).
Here, we describe the implementation of a method called pattern filtering for comparative gene finding and demonstrate its capacity to identify gene structures and putative regulatory elements (29). It is based on a fundamental evolutionary model which has two parts and has been used previously for gene finding by others (30). First, coding exons are generally more conserved than neighboring sequences. Secondly, the first and second codon positions are more conserved than the third, or wobble, position. Thirdly, regulatory elements are also more conserved than neighboring sequences but, unlike coding exons, they do not show the same distinctive triplet pattern found in coding sequences.
The core of pattern filtering is a Wiener filter, or optimal linear filter (31). This technique optimally separates the signals desired from the noise which obscures them. In our case, the signals correspond to the spatial distributions of sequence variation, while the noise comes primarily from the stochastic nature of a mutation corresponding to discrete change at an alignment position. By eliminating this noise, we generate estimates of the evolutionary distance at each site, thereby making regions containing coding regions readily apparent.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Our process for annotating genetic structures and identifying putative regulatory elements has two steps. First, from an alignment of syntenic sequences, we compute quantities we call the filtered distances and the filtered coding bias. Next, gene models are constructed from the interpretation of the filtered distances, filtered coding bias, possible splice sites and possible peptide sequences. Here we primarily illustrate these methods for two sequences, though they are extended to more than two.
Distance maps
In order to compute the filtered distances, which are measures of sequence divergence, one first needs to convert an alignment of symbols (A, C, G, T, ) to a series of numbers suitable for filtering. The two alternative maps that we use to create this numeric function are called the 1-D (one-dimensional) and 5-D (five-dimensional) maps.
The 1-D map is a simple distance function. Each alignment position is given the value 0 or 1, depending on whether homologous sites share the same or different nucleotide states, respectively (Fig. 1a) (29). All alignment positions gapped in the sequence we wish to annotate, which we call the reference sequence, are omitted from the function, making the function the same length as the reference sequence. If more than two sequences are analyzed, then the function is typically the sum of all pairwise distances between sequences. This function is then floated and padded with zeroes to increase the number of points to twice the next greater power of 2 for the subsequent fast Fourier transforms (32).
|
To construct the 5-D map, one first produces a sequence of joint probability matrices, one for each pair of homologous sites, resulting in a three-dimensional array (Fig. 1b). Each of the dimensions that are four long are then rearranged, creating two dimensions that are two long, one corresponding to purines versus pyrimidines and the other corresponding to G/C versus A/T; this produces a 2 x 2 x 2 x 2 hypercube in place of the joint probability matrix (Fig. 1c). These rearranged joint probability matrices maintain information which we will later use to construct distances at each site based on general evolutionary models. Each of the 16-long sections of the function are floated individually, and the whole is padded as in the 1-D map.
The 5-D map is superior for gene finding in gene-dense regions, while the 1-D map is better in regions that are gene poor. This is because the 1-D map concentrates all the signal from the codons into one peak, making the signal easy to identify and describe even when there is very little signal; however, the 5-D map spreads this signal among 16 potential peaks, making this identification and description difficult in gene-poor regions.
For three sequences, the 1-D map is logically extended by taking the average of the three pair-wise distances, yielding values of 0, 2/3 or 1. Likewise, the 5-D function is extended to seven dimensions (7-D). Generalizations to four or more sequences are also implemented though not demonstrated.
The Wiener filter and pattern filtered distances
The resulting numeric function is filtered by an optimal linear filter, also known as a Wiener filter (31). In brief, spatially varying signals will have well-defined frequency bands, while stochastic noise will not; the Wiener filter optimally eliminates most of the noise in frequency space where it is easily recognizable, and then the result is transformed back to real space to see the filtered signals. Details of the filter are given below; see Lake (29) and Press et al. (32) for additional descriptions of the Wiener filter.
Fourier transforms of the numeric function are performed using a real fast Fourier transform (32,33). Power spectra are calculated by windowing the sequence alignment positions using a Parzen window in order to improve the estimate of the variance and minimize leakage (32). The resulting one-sided power spectra are between 32 and 512 long, increasing as the length of the reference sequence increases; e.g. the
200 kb sequence of our test set created a power spectrum 128 long.
The noise component, |N|2, of the power spectrum is approximated by a constant plus a sum of cosine curves fit through the points away from the (possible) signal peaks. The possible signal peaks are at or near frequencies of 1/(3 bp), 1/(2 bp) and 0/(bp). To estimate the signal component, |S|2, an extrapolation of the noise is subtracted from the power spectra. For each peak which is present, the estimated |S|2 is fitted around 0/(bp) and 1/(3 bp) with a sum of either one or two Gaussians. The sometimes present signal peak at 1/(2 bp) is not included because it does not correlate with coding and non-coding regions.
The formula for the Wiener filter is
|S|2 / (|S2| + |N|2).
The Fourier transform of the numeric function created above is multiplied by the Wiener filter to separate the signal from the noise; this product is inverse Fourier transformed, yielding the optimal estimate of the signal. In the case of the 1-D map, the end result is a filtered distance at each site. Yet in the case of the 5-D map, this estimate is a sequence of joint probability matrices. To more easily interpret these, paralinear, also called LogDet, distances are then calculated from the joint probability matrices, yielding generally additive distances which more accurately reflect the extent of evolution (34,35).
The 7-D map for three sequences is also a sequence of three-dimensional joint probabilty matrices. Columns of these joint probability matrices are summed three times to create three two-dimensional joint probability matrices at each position. These are then used to calculate paralinear/LogDet distances, and the three distances at each position are averaged. As before, this is easily generalized to four or more sequences.
Coding bias
In addition to filtered distance, we also use a filtered single-sequence content measure which we call the coding bias to additionally aid in the identification of coding regions. In order to calculate the coding bias, we first classify hexamers from the May 19, 2000 Sanger Center human chromosome 22 sequence into coding and non-coding according to the annotation at the time [(36) http://www.sanger.ac.uk/HGP/Chr22a]; from the analysis, we omitted hexamers overlying codingnon-coding boundaries, lying in partial genes or ambiguously annotated segments, or containing ambiguous nucleotides.
Next we constructed a set of 2 x 2080 = 4160 hexamer bins. The 2 comes from the two possible states, coding and non-coding, and 2080 is the number of possible unique hexamers when the reverse complement is considered [2080 = (46 43 palindromes)/2 + 43 palindromes]. We next sort each of the 406 041 coding hexamers and 32 877 917 non-coding hexamers into these bins according to their sequences and coding status. We calculate the coding bias by the formula:
[NonCodingBinn / (80 x CodingBinn + NonCodingBinn)]
Values appreciably less than 0.5 represent hexanucleotides that are systematically favored for coding, while those above are unfavored.
The distribution of values for the hexanucleotide bias is wide, robust and statistically meaningful. The minimum and maximum values are 0.0585 and 0.9711. No bin has too few counts, since the minimum number of counts in any bin is 8. The standard deviation of the coding bias of the bins is 0.1895. To test whether this is statistically different from random, we performed 1000 simulations with the same hexanucleotide distribution, but now randomly distributed into exon and intron bins. The standard deviations of these simulations were approximately normally distributed, with a mean of 0.0291 and standard deviation of
= 0.0011. Since the standard deviation of the coding bias of the bins is 141
from the mean standard deviation, it is clearly statistically different from random.
To calculate the coding bias function of an alignment, we first take the coding bias of each hexanucleotide of both sequences and average the results; if gaps are present, only one hexanucleotide bias is used. The resulting function is subjected to a Wiener filter analysis as in the 1-D map above but with any peak at the frequency 1/(3 bp) ignored.
Annotation
The annotation is done by the user with the GeneGrabber program, a viewer which displays the filtered distances, filtered coding bias, possible splice sites and potential peptide sequences. New exons are included by a combination of seven criteria: (i) the overall conservation of a segment; (ii) the presence of a pattern where every third position is appreciably less well conserved than the others; (iii) a filtered coding bias favoring coding; (iv) no stop codons in the favored reading frames; (v) the presence of strong flanking splice sites or in-frame start or stop codons; (vi) the agreement of the frames of adjacent exons; and (vii) previously described length distributions of introns, exons and genes. By clicking near putative exon ends and choosing a gene model, one inserts an exon into that gene model. The sequences and structures of gene models as well as the sequence of interesting conserved regions can be saved to files.
Regulatory regions are identified as highly conserved regions where all three possible codon positions are approximately equally conserved; for our purposes, highly conserved is defined as a segment of 30 or more nucleotides with filtered distances below 0.20. Their positions and sequences can also be saved to files.
Potential 3' splice sites are scored by a weight matrix based on the results of Senapathy et al. (37). 5' Splice sites are predicted either by a similar metric or by the maximal dependence decomposition method (6).
The black traces in GeneGrabber plots are the filtered distances averaged over a 3 bp window, which are calculated by the formula Dn = (dn i + dn + dn + i)/3, where di is filtered distance at position i. The multicolored trace shows the relative differences between the filtered distances and these averages, which are calculated by the formula (dn Dn)/Dn. Positions 0 modulo 3 are red, positions 1 modulo 3 are green, and positions 2 modulo 3 are colored blue to show frameshifts.
| RESULTS |
|---|
|
|
|---|
We aligned the syntenic sequences of the CD4 region of mouse and human (NCBI accession numbers AC002397 [GenBank] and U46924 [GenBank] ) (19,38) with the set of programs PickAl and COMGAP (unpublished); other genomic length alignment algorithms could also be used (39,40). We then calculated the filtered coding bias, and the filtered evolutionary distances from both the 1-D and 5-D maps. The intermediate power spectra are shown in Figure 2.
|
Though not utilized in subsequent analyses because of the superior quality of the 5-D results, the power spectra of the 1-D map best illustrate the stochastic noise and the two signal peaks (Fig. 2A). The relatively flat portion extending across most of the plot comes from stochastic noise. The peak at very low frequency corresponds to long alternating conserved and non-conserved structures, such as entire exons and introns as well as genic and intergenic regions. The peak at the 1/(3 bp), henceforth called the triplet peak, comes from the codons of the coding regions. The first and second positions of codons tend to be conserved, since changing them will usually change the amino acids which are coded, and the third positions tend to be more divergent, since changing them usually will not change the amino acids coded or will minimally impact function. This alternating conservedconserveddivergent pattern creates waves of period three nucleotides, resulting in the triplet peak. The small size of this peak is due to the small fraction of overall sequence which is coding; very gene-sparse data sets, such as the piebald region (24), have very small triplet peaks, while those of very gene-dense regions, such as mitochondrial genomes, rival the low frequency peak in size. Most of the 16 one-dimensional segments of the 5-D maps power spectrum show the same structure (Fig. 2B).
Using only GeneGrabber with the 5-D filtered distances, we annotated the mouse sequence. Mouse was chosen because Ansari-Lari et al. (19) originally annotated it with the assistance of the human sequence and cDNAs, while the human sequences original annotation did not have the benefit of the mouse. To serve as a comparison, both sequences were separately analyzed using the GENSCAN web server at http://genes.mit.edu/GENSCAN.html (6). [We attempted comparisons with other programs, but technical errors or limitations of the length of the sequence allowed to be analyzed prohibited these attempts; note that our methods analyze the sequences of Peterson et al. (24), which exceed 4 Mb.]
As one can see from Figure 3 and Table 1, our methods perform well in all measures of prediction accuracy, and appreciably surpass the predictions of GENSCAN. Two statistics are particularly noteworthy. Note how close the sensitivities and specificities are to 1.0 for pattern filtering and GeneGrabber. Secondly, there are no wrong exons, i.e. exons predicted that do not overlap part of an existing exon.
|
|
One could imagine attributing the success of pattern filtering to simply the inclusion of two sequences in the analysis. In order to test this possibility, we performed two combination analyses using GENSCAN. In the first, called GENSCAN ord, a mouse nucleotide is considered coding if either it or the human nucleotide to which it is aligned is predicted to be coding. In GENSCAN andd, a mouse nucleotide is considered coding if both it and the human nucleotide to which it is aligned are predicted to be coding; if a mouse nucleotide is aligned with a gap, the decision is based only on the designation of the mouse nucleotide.
GENSCAN ord should have a greater sensitivity relative to the single-sequence analyses, since it now has two opportunities to predict a nucleotide or exon as coding. Likewise, GENSCAN andd should have a greater specificity, since it requires a predicted coding sequence in both cases. Both of these hold true, but pattern filtering still outperforms each method in both statistics (Table 1). Therefore, the success of pattern filtering comes not just from the use of multiple sequences, but also from the noise-filtered comparison of these sequences.
These statistics also surpass the published accuracies of ROSETTA, TWINSCAN and DOUBLESCAN, other programs that also exploit homology between two genomes. Direct comparisons between these are not possible since the sequences analyzed are different (25,27,28); however, note that only Korf et. al. (27) also analyzed sequences containing more than one gene.
We also documented 41 conserved regions that did not fit the coding region model. We learned from the cDNA annotations that 11 of these largely overlap either the 3'- or 5'-untranslated regions, leaving 30 putative regulatory elements. Eighteen of these lie between 2.5 kb upstream and 0.5 kb downstream of a genes transcription start site, seven others lie within introns, and five lie outside of these regions altogether. (Four of these five lie clustered within a single 2 kb region.) The distribution of these 30 putative regulatory elements suggests that most of them are indeed transcriptional or splicing regulatory elements.
The filtered distances can also reveal some sequencing errors, and the discovery of a particular error allowed us to substantially correct the previously published biological annotation. In Figure 4a, the long region between positions 39 200 and 39 660 of the mouse CD4 region strongly resembles a coding region, except that around 39 590, two codon positions cross in the conservation plot. Observing that this apparent frameshift could result from a sequencing error and that 45 out of the 50 nearest positions are G or C, making such an error very possible, we submitted the region to various NCBI databases and the Celera human database looking for near matches. We found five matches spanning the whole region without gaps, three of which are from the original papers supplying the genomic sequence and its annotation (19,38). Yet, we also found 15 mammalian matches spanning the whole region with only one gap from a G inserted at position 39 593 (Table 2). The P-value for this distribution is 0.021 calculated by the binomial distribution, implying that it is significantly more likely that the sequence with the inserted G is correct.
|
|
Assuming that the missing G was a sequencing error, we inserted the G into Ansari-Lari et al.s human and mouse sequences (19,38) and input this new alignment into pattern filtering, which leads to Figure 4b. The troublesome event around 39 590 has now gone away, and we annotate the gene as indicated by the green row of bars at the top of Figure 4c. This is the same annotation that GENSCAN produces from the human sequence (the red row of bars in the figure).
The original annotation of the sequence by Ansari-Lari et al. (19) is indicated by the purple row of bars. Their original cDNA data showed a spliced intron between nucleotides 39 671 and 40 771, the same as we predicted. We presume that the sequencing error at position 39 593 led them to conclude its segment was part of a 5'-untranslated region, since predicting the translation start site at 39 181 would have resulted in a stop codon shortly after the 39 593 sequencing error. This same crossing of codon traces should also be observed when comparing sequences that undergo translational frameshift (41).
Additional power in such analyses can be gained by using more than two syntenic sequences. In order to demonstrate this, we aligned the human, rat (contig NW_043769.1) and mouse CD4 segments. From this alignment, we calculated the filtered coding bias, and the filtered evolutionary distances from the 7-D map. Since mouse and rat are such evolutionarily close relatives, we did not expect an enormous change in the resulting GeneGrabber plots. However, somewhat difficult segments of the plots became significantly easier to interpret (Figure 5). We anticipate that the inclusion of a sequence from a different mammalian order would greatly enhance this approach.
|
| DISCUSSION |
|---|
|
|
|---|
Our analysis demonstrates that pattern filtering is an effective and accurate comparative method for the annotation and prediction of coding genes in syntenic DNA segments. In addition, pattern filtering identifies conserved non-coding sequence elements. The results are straightforward for a user to interpret, and our approach allows valuable flexibility when faced with more challenging aspects of annotation such as the detection of sequencing errors like the example above, alternative splice models, overlapping genes and difficult to detect exons which can precipitate a cascade of errors as a gene finder attempts to construct a full gene model (8).
The mathematics behind pattern filtering is well grounded in its proof, and decades of empirical experience have demonstrated the mathematics power; to our knowledge, it is only the second gene finding approach to use spectral analysis, and the first of these to utilize filtering or to use comparative information (42). Pattern filterings utilization of the three-nucleotide conserved pattern within codons is nearly unique among gene finders (30). Additionally, pattern filtering effectively uses the comparative information from two or more sequences. These are the three greatest strengths of pattern filtering.
Many closely related eukaryotic genomes are presently being sequenced, or their sequencing is being planned. These include human, mouse, cow, rat, dog, cat and other upcoming vertebrate genomes, as well as multiple angiosperm and insect genomes (43). Not only will this provide a greater quantity of sequence for comparative analyses, but it should also lead to a higher quality of comparisons, since the optimal evolutionary distance differs depending on the task at hand (44). Because of this, methods such as the one described here for the analysis of syntenic segments will become increasingly important and more powerful in the annotation of genomes and the discovery of new genes and regulatory elements.
Availability
To aid in its distribution and widespread use, we are making applications, manuals and examples available at http://genomics.ucla.edu/patfilt/.
| ACKNOWLEDGEMENTS |
|---|
We would like to thank Maria Rivera, Anne Simonson and Theresa Lynn for thoughtful reading of the manuscript, and Genevieve Erwin for helpful advice on the user interface. This research was funded by grants DE-FG03-99ER62759 from the Department of Energy and DEB-9726480 from the National Science Foundation. In addition, J.M. was supported by the UCLA IGERT Bioinformatics program funded by NSF DGE9987641, USPHS National Research Service Award GM07185, and the UCLA dissertation year fellowship.
| REFERENCES |
|---|
|
|
|---|
- Stormo,G. (2000) Gene-finding approaches for eukayotes. Genome Res., 10, 394397.
[Free Full Text] - Claverie,J.-M. (1997) Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet., 6, 17351744.
[Abstract/Free Full Text] - Fickett,J.W. (1996) Finding genes by computer: the state of the art. Trends Genet., 12, 316320.[CrossRef][ISI][Medline]
- Haussler,D. (1998) Computational genefinding. Trends Guide Bioinformatics (Suppl.), 1215.
- Burge,C.B. and Karlin,S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol., 8, 346354.[CrossRef][ISI][Medline]
- Burge,C.B. and Karlin,S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 7894.[CrossRef][ISI][Medline]
- Lukashin,A.V. and Borodovsky,M. (1998) GenMark.hmm: new solutions for gene finding. Nucleic Acids Res., 26, 11071115.
[Abstract/Free Full Text] - Ubacher,E.C. and Mural,R.J. (1991) Locating protein-coding regions in human DNA sequences by a multiple sensorneural network approach. Proc. Natl Acad. Sci. USA, 88, 1126111265.
[Abstract/Free Full Text] - Snyder,E.E. and Stormo,G.D. (1995) Identification of protein coding regions in genomic DNA. J. Mol. Biol., 248, 118.[CrossRef][ISI][Medline]
- Gelfand,M.S., Mironov,A.A. and Pevzner,P.A. (1996) Gene recognition via spliced sequence alignment. Proc. Natl Acad. Sci. USA, 93, 90619066.
[Abstract/Free Full Text] - Birney,E. and Durbin,R. (2000) Using GeneWise in the Drosophila annotation experiment. Genome Res., 10, 547548.
[Abstract/Free Full Text] - Burset,M. and Guigó,R. (1996) Evaluation of gene structure prediction programs. Genomics, 34, 353367.[CrossRef][ISI][Medline]
- International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921.[CrossRef][Medline]
- Venter,C.J., Adams,M.D., Myers,E.W., Li,P.W., Mural,R.J., Sutton,G.G., Smith,H.O., Yandell,M., Evans,C.A. and Holt,R.A. (2001) The sequence of the human genome. Science, 291, 13041351.
[Abstract/Free Full Text] - Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the the mouse genome. Nature, 420, 520562.[CrossRef][Medline]
- Miller,W. (2001) Comparison of genomic DNA sequences: solved and unsolved problems. Bioinformatics, 17, 948949.
- OBrien,S.J., Menotti-Raymond,M., Murphy,W.J., Nash,W.G., Wienberg,J., Stanyon,R., Copeland,N.G., Jenkins,N.A., Womack,J.A. and Marshall Graves,J.A. (1999) The promise of comparative genomics in mammals. Science, 286, 458481.
[Abstract/Free Full Text] - Lane,R.P., Roach,J.C., Lee,I.Y., Boysen,C., Smit,A., Trask,B.J. and Hood,L. (2002) Genomic analysis of the olfactory receptor region of the mouse and human T-cell receptor alpha/delta loci. Genome Res., 12, 8187.
[Abstract/Free Full Text] - Ansari-Lari,M.A., Oeltjen,J.C., Schwartz,S., Zhang,Z., Muzny,D.M., Lu,J., Gorrell,J.H., Chinault,C.A., Belmont,J.W., Miller,W. and Gibbs,R.A. (1998) Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. Genome Res., 8, 2940.
[Abstract/Free Full Text] - Dehal,P., Predki,P., Olsen,A.S., Kobayashi,A., Folta,P., Lucas,S., Land,M., Terry,A., Zhou,C.L.E., Rash,S., Zhang,Q., Gordon,L., Kim,J., Elkin,C., Pollard,M.J., Richardson,P., Rokhsar,D., Uberbacher,E., Hawkins,T., Branscomb,E. and Stubbs,L. (2001) Human chromosome 19 and related regions in mouse: conservative and lineage-specific evolution. Science, 293, 104111.
[Abstract/Free Full Text] - Lane,R.P., Cutforth,T., Young,J., Athanasiou,M., Friedman,C., Rowen,L., Evans,G., Axel,R., Hood,L. and Trask,B.J. (2001) Genomic analysis of orthologous mouse and human olfactory receptor loci. Proc. Natl Acad. Sci. USA, 98, 73907395.
[Abstract/Free Full Text] - Jang,W., Hua,A., Spilson,S.V., Miller,W., Roe,B.A. and Meisler,M.H. (1999) Comparative sequence of human and mouse BAC clones from the mnd region of chromosome 2p13. Genome Res., 9, 815824.
[Abstract/Free Full Text] - Mural,R.J., Adams,M.D., Meyers,E.W., Smith,H.O., Miklos,G.L.G., Wides,R., Halpern,A., Li,P.W., Sutton,G.G., Nadeau,J. et al. (2002) A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science, 296, 16611671.
[Abstract/Free Full Text] - Peterson,K.A., King,B.L., Hagge-Greenberg,A., Roix,J.J., Bult,C.J. and OBrien,T.P. (2002) Functional and comparative genomic analysis of the piebald deletion region of mouse chromosome 14. Genomics, 80, 172184.[CrossRef][ISI][Medline]
- Batzoglou,S., Pachter,L., Mesirov,J.P., Berger,B. and Lander,E.S. (2000) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res., 10, 950958.
[Abstract/Free Full Text] - Bafna,V. and Huson,D.H. (2002) Proceedings from the Eighth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, pp. 312.
- Korf,I., Flicek,P., Duan,D. and Brent,M.R. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics, 17, 140S-148S.
- Meyer,I.M. and Durbin,R. (2002) Comparitive ab initio prediction of gene structures using pair HMMs. Bioinformatics, 18, 13091318.
[Abstract/Free Full Text] - Lake,J.A. (1998) Optimally recovering rate variation information from genomes and sequences: pattern filtering. Mol. Biol. Evol., 15, 12241231.[Abstract]
- Rogozin,I.B., DAngelo,D. and Luciano,M. (1999) Protein-coding regions prediction combining similarity searches and conservative evolutionary properties of protein-coding sequences. Gene, 226, 129137.[CrossRef][ISI][Medline]
- Wiener,N. (1948) Cybernetics. John Wiley and Sons, New York.
- Press,W.H., Flannery,B.E., Teokolsky,S.A. and Vetterling,W.T. (1986) Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, New York, NY.
- Elliot,D.F. and Rao,K.R. (1982) Fourier Transforms and Their Physical Applications. Academic Press, New York, NY.
- Lake,J.A. (1994) Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl Acad. Sci. USA, 91, 14551459.
[Abstract/Free Full Text] - Lockhart,P.J., Steel,M.A., Hendy,M.D. and Penny,D. (1994) Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol., 11, 615612.
- Dunham,I., Hunt,A.R., Collins,J.E., Bruskiewich,R., Ruskiewich,D.M., Bear,D.M., Clamp,M., Smink,L.J., Ainscough,R., Almeida,J.P. et al. (1999) The DNA sequence of human chromosome 22. Nature, 402, 489495.[CrossRef][Medline]
- Senapathy,P., Shapiro,M.B. and Harris,N.L. (1990) Splice junctions, branch point sites and exonssequence statistics, identification and aplications to genome project. Methods Enzymol., 183, 252278.[ISI][Medline]
- Ansari-Lari,M.A., Shen,Y., Muzny,D.M., Lee,W. and Gibbs,R.A. (1997) Large-scale sequencing in human chromosome 12p13: experimental and computational gene structure determination. Genome Res., 7, 268280.
[Abstract/Free Full Text] - Morgenstern,B., Rinner,O., Abdeddaïm,S., Haase,D., Mayer,K.F.X., Dress,A.W.M. and Mewes,H.-W. (2002) Exon discovery by genomic sequence alignment. Bioinformatics, 18, 777787.
[Abstract/Free Full Text] - Jareborg,N., Birney,E. and Durbin,R. (1999) Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res., 9, 815824.
[Abstract/Free Full Text] - Baranov,P.V., Gurvich,O.L., Fayet,O., Prere,M.F., Miller,W.A., Gesteland,R.F., Atkins,J.F. and Giddings,M.C. (2001) RECODE: a database of frameshifting, bypassing and codon redefinition utilized for gene expression. Nucleic Acids Res., 29, 264267.
[Abstract/Free Full Text] - Tiwari,S., Ramachandran,S., Bhattacharya,A., Bhattacharya,S. and Ramaswamy,R. (1997) Prediction of probable genes by Fourier analysis of genomic sequences. CABIOS, 18, 263270.
- Powell,K. (2002) Second round of gene sequencing goes down to the farm. Nature, 419, 237.[Medline]
- Miller,W. (2000) So many genomes, so little time. Genome Res., 18, 148149.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





