Skip Navigation

This Article
Right arrow Full Text Freely available
Right arrow Print PDF (154K) Freely available
Right arrow Supplementary Material
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (42)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Sorek, R.
Right arrow Articles by Safer, H. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sorek, R.
Right arrow Articles by Safer, H. M.
Related Collections
Right arrow Computational methods
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2003, Vol. 31, No. 3 1067-1074
© 2003 Oxford University Press

A novel algorithm for computational identification of contaminated EST libraries

Rotem Sorek*,1,2 and Hershel M. Safer1

1 Compugen Ltd, 72 Pinchas Rosen Street, Tel Aviv 69512, Israel and 2 Department of Human Genetics and Molecular Medicine, Sackler School of Medicine, Tel Aviv University, Ramat Aviv 69978, Israel

*To whom correspondence should be addressed at Compugen Ltd, 72 Pinchas Street, Tel Aviv 69512, Israel. Tel: +972 3765 8536; Fax: +972 3 765 8555; Email: rotem{at}compugen.co.il
Present address:
Hershel Safer, Zetiq Technologies Ltd, PO Box 2047, Ness Ziona 70400, Israel

A key goal of the Human Genome Project was to understand the complete set of human proteins, the proteome. Since the genome sequence by itself is not sufficient for predicting new genes and alternative splicing events that lead to new proteins, expressed sequence tags (ESTs) are used as the primary tool for these purposes. The high prevalence of artifacts in dbEST, however, often leads to invalid predictions. Here we describe a novel method for recognizing genomic DNA contamination and other artifacts that cannot be identified using current EST cleaning techniques. Our method uses the alignment of the entire set of ESTs to the human genome to identify highly contaminated EST libraries. We discovered 53 highly contaminated libraries and a subset of 24 766 ESTs from these libraries that probably represent contamination with genomic DNA, pre-mRNA, and ESTs that span non-canonical introns. Although this is only a small fraction of the entire EST dataset, each contaminating sequence could create a spurious transcript prediction. Indeed, in the clustering and assembly tool that we used, these sequences would have caused incorrect inference of 9575 new splice variants and 6370 new genes. Conclusions based on EST analysis, including prediction of alternative splicing, should be re-evaluated in light of these results. Our method, along with the identified set of contaminated sequences, will be essential for applications that depend on large EST datasets.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
B. Lee and G. Shin
CleanEST: a database of cleansed EST libraries
Nucleic Acids Res., October 2, 2008; (2008) gkn648v1.
[Abstract] [Full Text] [PDF]


Home page
Clin. Cancer Res.Home page
Z. Tiran, A. Oren, C. Hermesh, G. Rotman, Z. Levine, H. Amitai, T. Handelsman, M. Beiman, A. Chen, D. Landesman-Milo, et al.
A Novel Recombinant Soluble Splice Variant of Met Is a Potent Antagonist of the Hepatocyte Growth Factor/Scatter Factor-Met Pathway
Clin. Cancer Res., July 15, 2008; 14(14): 4612 - 4621.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
D. Baek, C. Davis, B. Ewing, D. Gordon, and P. Green
Characterization and predictive discovery of evolutionarily conserved mammalian alternative promoters
Genome Res., February 1, 2007; 17(2): 145 - 155.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
T. A. Gray, A. Wilson, P. J. Fortin, and R. D. Nicholls
The putatively functional Mkrn1-p1 pseudogene is neither expressed nor imprinted, nor does it regulate its source gene in trans
PNAS, August 8, 2006; 103(32): 12039 - 12044.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y. Xing, T. Yu, Y. N. Wu, M. Roy, J. Kim, and C. Lee
An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs
Nucleic Acids Res., June 6, 2006; 34(10): 3150 - 3160.
[Abstract] [Full Text] [PDF]


Home page
Stem CellsHome page
T. V. Bowman, A. J. McCooey, A. A. Merchant, C. A. Ramos, P. Fonseca, A. Poindexter, S. B. Bradfute, D. M. Oliveira, R. Green, Y. Zheng, et al.
Differential mRNA Processing in Hematopoietic Stem Cells
Stem Cells, March 1, 2006; 24(3): 662 - 670.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Zhang, L. Zhang, and K. R. Coombes
Gene sequence signatures revealed by mining the UniGene affiliation network
Bioinformatics, February 15, 2006; 22(4): 385 - 391.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
P. Akiva, A. Toporik, S. Edelheit, Y. Peretz, A. Diber, R. Shemesh, A. Novik, and R. Sorek
Transcription-mediated gene fusion in the human genome
Genome Res., January 1, 2006; 16(1): 30 - 36.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
R. J. Dixon, I. C. Eperon, L. Hall, and N. J. Samani
A genome-wide survey demonstrates widespread non-linear mRNA in expressed sequences from multiple species
Nucleic Acids Res., October 19, 2005; 33(18): 5904 - 5913.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Xing and C. Lee
Assessing the application of Ka/Ks ratio test to alternatively spliced exons
Bioinformatics, October 1, 2005; 21(19): 3701 - 3703.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
D. Baek and P. Green
Sequence conservation, relative isoform frequencies, and nonsense-mediated decay in evolutionarily conserved alternative splicing
PNAS, September 6, 2005; 102(36): 12813 - 12818.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. D. Wu and C. K. Watanabe
GMAP: a genomic mapping and alignment program for mRNA and EST sequences
Bioinformatics, May 1, 2005; 21(9): 1859 - 1875.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
N. Kim, S. Shin, and S. Lee
ECgene: Genome-based EST clustering and gene modeling for alternative splicing
Genome Res., April 1, 2005; 15(4): 566 - 576.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
R. Sorek, R. Shemesh, Y. Cohen, O. Basechess, G. Ast, and R. Shamir
A Non-EST-Based Method for Exon-Skipping Prediction
Genome Res., August 1, 2004; 14(8): 1617 - 1623.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
E. Meshorer, D. Toiber, D. Zurel, I. Sahly, A. Dori, E. Cagnano, L. Schreiber, D. Grisaru, F. Tronche, and H. Soreq
Combinatorial Complexity of 5' Alternative Acetylcholinesterase Transcripts and Protein Products
J. Biol. Chem., July 9, 2004; 279(28): 29740 - 29751.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
H. ITOH, T. WASHIO, and M. TOMITA
Computational comparative analyses of alternative splicing regulation using full-length cDNA of various eukaryotes
RNA, July 1, 2004; 10(7): 1005 - 1018.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
N. Kim, S. Shin, and S. Lee
ASmodeler: gene modeling of alternative splicing from genomic alignment of mRNA, EST and protein sequences
Nucleic Acids Res., July 1, 2004; 32(suppl_2): W181 - W186.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
The Ludwig-FAPESP Transcript Finishing Initiative, M. C. Sogayar, and A. A. Camargo
A Transcript Finishing Initiative for Closing Gaps in the Human Transcriptome
Genome Res., July 1, 2004; 14(7): 1413 - 1423.
[Abstract] [Full Text] [PDF]


Home page
Cancer Res.Home page
R. Sorek, O. Basechess, H. M. Safer, Z. Wang, and M. P. Lee
Expressed Sequence Tags: Clean before Using. Correspondence re: Z. Wang et al., Computational Analysis and Experimental Validation of Tumor-associated Alternative RNA Splicing in Human Cancer. Cancer Res., 63: 655-657, 2003.
Cancer Res., October 15, 2003; 63(20): 6996 - 6997.
[Full Text] [PDF]


Home page
Genome ResHome page
R. Sorek and G. Ast
Intronic Sequences Flanking Alternatively Spliced Exons Are Conserved Between Human and Mouse
Genome Res., July 1, 2003; 13(7): 1631 - 1637.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.