Nucleic Acids Research, Vol 24, Issue 2 316-320, Copyright © 1996 by Oxford University Press
PG Korning, SM Hebsgaard, P Rouze and S Brunak
Data driven computational biology relies on the large quantities of genomic
data stored in international sequence data banks. However, the
possibilities are drastically impaired if the stored data is unreliable.
During a project aiming to predict splice sites in the dicot Arabidopsis
thaliana, we extracted a data set from the A.thaliana entries in GenBank. A
number of simple 'sanity' checks, based on the nature of the data, revealed
an alarmingly high error rate. More than 15% of the most important entries
extracted did contain erroneous information. In addition, a number of
entries had directly conflicting assignments of exons and introns, not
stemming from alternative splicing. In a few cases the errors are due to
mere typographical misprints, which may be corrected by comparison to the
original papers, but errors caused by wrong assignments of splice sites
from experimental data are the most common. It is proposed that the level
of error correction should be increased and that gene structure sanity
checks should be incorporated--also at the submitter level--to avoid or
reduce the problem in the future. A non-redundant and error corrected
subset of the data for A.thaliana is made available through anonymous FTP.
ARTICLES
Cleaning the GenBank Arabidopsis thaliana data set
Center for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
E. Eden and S. Brunak Analysis and recognition of 5' UTR intron splice sites in human pre-mRNA Nucleic Acids Res., February 11, 2004; 32(3): 1131 - 1142. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Foissac, P. Bardou, A. Moisan, M.-J. Cros, and T. Schiex EUGENE'HOM: a generic similarity-based gene finder using multiple homologous sequences Nucleic Acids Res., July 1, 2003; 31(13): 3742 - 3745. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. K. Lal, M. J. Giroux, V. Brendel, C. E. Vallejos, and L. C. Hannah The Maize Genome Contains a Helitron Insertion PLANT CELL, February 1, 2003; 15(2): 381 - 391. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Lal, J.-H. Choi, J. R. Shaw, and L. C. Hannah A Splice Site Mutant of Maize Activates Cryptic Splice Sites, Elicits Intron Inclusion and Exon Exclusion, and Permits Branch Point Elucidation Plant Physiology, October 1, 1999; 121(2): 411 - 418. [Abstract] [Full Text] |
||||


