Nucleic Acids Research, 2003, Vol. 31, No. 3 1067-1074
© 2003 Oxford University Press
A novel algorithm for computational identification of contaminated EST libraries
1 Compugen Ltd, 72 Pinchas Rosen Street, Tel Aviv 69512, Israel and 2 Department of Human Genetics and Molecular Medicine, Sackler School of Medicine, Tel Aviv University, Ramat Aviv 69978, Israel
*To whom correspondence should be addressed at Compugen Ltd, 72 Pinchas Street, Tel Aviv 69512, Israel. Tel: +972 3765 8536; Fax: +972 3 765 8555; Email: rotem{at}compugen.co.il
Present address:
Hershel Safer, Zetiq Technologies Ltd, PO Box 2047, Ness Ziona 70400, Israel
A key goal of the Human Genome Project was to understand the complete set of human proteins, the proteome. Since the genome sequence by itself is not sufficient for predicting new genes and alternative splicing events that lead to new proteins, expressed sequence tags (ESTs) are used as the primary tool for these purposes. The high prevalence of artifacts in dbEST, however, often leads to invalid predictions. Here we describe a novel method for recognizing genomic DNA contamination and other artifacts that cannot be identified using current EST cleaning techniques. Our method uses the alignment of the entire set of ESTs to the human genome to identify highly contaminated EST libraries. We discovered 53 highly contaminated libraries and a subset of 24 766 ESTs from these libraries that probably represent contamination with genomic DNA, pre-mRNA, and ESTs that span non-canonical introns. Although this is only a small fraction of the entire EST dataset, each contaminating sequence could create a spurious transcript prediction. Indeed, in the clustering and assembly tool that we used, these sequences would have caused incorrect inference of 9575 new splice variants and 6370 new genes. Conclusions based on EST analysis, including prediction of alternative splicing, should be re-evaluated in light of these results. Our method, along with the identified set of contaminated sequences, will be essential for applications that depend on large EST datasets.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
D. Baek, C. Davis, B. Ewing, D. Gordon, and P. Green Characterization and predictive discovery of evolutionarily conserved mammalian alternative promoters Genome Res., February 1, 2007; 17(2): 145 - 155. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. A. Gray, A. Wilson, P. J. Fortin, and R. D. Nicholls The putatively functional Mkrn1-p1 pseudogene is neither expressed nor imprinted, nor does it regulate its source gene in trans PNAS, August 8, 2006; 103(32): 12039 - 12044. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Xing, T. Yu, Y. N. Wu, M. Roy, J. Kim, and C. Lee An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs Nucleic Acids Res., June 6, 2006; 34(10): 3150 - 3160. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. V. Bowman, A. J. McCooey, A. A. Merchant, C. A. Ramos, P. Fonseca, A. Poindexter, S. B. Bradfute, D. M. Oliveira, R. Green, Y. Zheng, et al. Differential mRNA Processing in Hematopoietic Stem Cells Stem Cells, March 1, 2006; 24(3): 662 - 670. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Zhang, L. Zhang, and K. R. Coombes Gene sequence signatures revealed by mining the UniGene affiliation network Bioinformatics, February 15, 2006; 22(4): 385 - 391. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Akiva, A. Toporik, S. Edelheit, Y. Peretz, A. Diber, R. Shemesh, A. Novik, and R. Sorek Transcription-mediated gene fusion in the human genome Genome Res., January 1, 2006; 16(1): 30 - 36. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Dixon, I. C. Eperon, L. Hall, and N. J. Samani A genome-wide survey demonstrates widespread non-linear mRNA in expressed sequences from multiple species Nucleic Acids Res., October 19, 2005; 33(18): 5904 - 5913. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Xing and C. Lee Assessing the application of Ka/Ks ratio test to alternatively spliced exons Bioinformatics, October 1, 2005; 21(19): 3701 - 3703. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Baek and P. Green Sequence conservation, relative isoform frequencies, and nonsense-mediated decay in evolutionarily conserved alternative splicing PNAS, September 6, 2005; 102(36): 12813 - 12818. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. D. Wu and C. K. Watanabe GMAP: a genomic mapping and alignment program for mRNA and EST sequences Bioinformatics, May 1, 2005; 21(9): 1859 - 1875. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Kim, S. Shin, and S. Lee ECgene: Genome-based EST clustering and gene modeling for alternative splicing Genome Res., April 1, 2005; 15(4): 566 - 576. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Sorek, R. Shemesh, Y. Cohen, O. Basechess, G. Ast, and R. Shamir A Non-EST-Based Method for Exon-Skipping Prediction Genome Res., August 1, 2004; 14(8): 1617 - 1623. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Meshorer, D. Toiber, D. Zurel, I. Sahly, A. Dori, E. Cagnano, L. Schreiber, D. Grisaru, F. Tronche, and H. Soreq Combinatorial Complexity of 5' Alternative Acetylcholinesterase Transcripts and Protein Products J. Biol. Chem., July 9, 2004; 279(28): 29740 - 29751. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. ITOH, T. WASHIO, and M. TOMITA Computational comparative analyses of alternative splicing regulation using full-length cDNA of various eukaryotes RNA, July 1, 2004; 10(7): 1005 - 1018. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Kim, S. Shin, and S. Lee ASmodeler: gene modeling of alternative splicing from genomic alignment of mRNA, EST and protein sequences Nucleic Acids Res., July 1, 2004; 32(suppl_2): W181 - W186. [Abstract] [Full Text] [PDF] |
||||
![]() |
The Ludwig-FAPESP Transcript Finishing Initiative, M. C. Sogayar, and A. A. Camargo A Transcript Finishing Initiative for Closing Gaps in the Human Transcriptome Genome Res., July 1, 2004; 14(7): 1413 - 1423. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Sorek, O. Basechess, H. M. Safer, Z. Wang, and M. P. Lee Expressed Sequence Tags: Clean before Using. Correspondence re: Z. Wang et al., Computational Analysis and Experimental Validation of Tumor-associated Alternative RNA Splicing in Human Cancer. Cancer Res., 63: 655-657, 2003. Cancer Res., October 15, 2003; 63(20): 6996 - 6997. [Full Text] [PDF] |
||||
![]() |
R. Sorek and G. Ast Intronic Sequences Flanking Alternatively Spliced Exons Are Conserved Between Human and Mouse Genome Res., July 1, 2003; 13(7): 1631 - 1637. [Abstract] [Full Text] [PDF] |
||||







