Skip Navigation

This Article
Right arrow Full Text Freely available
Right arrow Print PDF (88K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (32)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Korning, P.
Right arrow Articles by Brunak, S
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Korning, P.
Right arrow Articles by Brunak, S
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, Vol 24, Issue 2 316-320, Copyright © 1996 by Oxford University Press


ARTICLES

Cleaning the GenBank Arabidopsis thaliana data set

PG Korning, SM Hebsgaard, P Rouze and S Brunak
Center for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark.

Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate. More than 15% of the most important entries extracted did contain erroneous information. In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing. In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common. It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated--also at the submitter level--to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
E. Eden and S. Brunak
Analysis and recognition of 5' UTR intron splice sites in human pre-mRNA
Nucleic Acids Res., February 11, 2004; 32(3): 1131 - 1142.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Foissac, P. Bardou, A. Moisan, M.-J. Cros, and T. Schiex
EUGENE'HOM: a generic similarity-based gene finder using multiple homologous sequences
Nucleic Acids Res., July 1, 2003; 31(13): 3742 - 3745.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
S. K. Lal, M. J. Giroux, V. Brendel, C. E. Vallejos, and L. C. Hannah
The Maize Genome Contains a Helitron Insertion
PLANT CELL, February 1, 2003; 15(2): 381 - 391.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
S. Lal, J.-H. Choi, J. R. Shaw, and L. C. Hannah
A Splice Site Mutant of Maize Activates Cryptic Splice Sites, Elicits Intron Inclusion and Exon Exclusion, and Permits Branch Point Elucidation
Plant Physiology, October 1, 1999; 121(2): 411 - 418.
[Abstract] [Full Text]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.