Skip Navigation


Nucleic Acids Research Advance Access originally published online on October 14, 2008
Nucleic Acids Research 2009 37(Database issue):D951-D953; doi:10.1093/nar/gkn650
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (1399K) Freely available
Right arrow Screen PDF (283K) Freely available
Right arrowOA All Versions of this Article:
37/suppl_1/D951    most recent
gkn650v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Duran, C.
Right arrow Articles by Edwards, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Duran, C.
Right arrow Articles by Edwards, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2009, Vol. 37, Database issue D951-D953
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article appears in the following Nucleic Acids Research issue: Database issue [View the issue table of contents]

Articles

AutoSNPdb: an annotated single nucleotide polymorphism database for crop plants

Chris Duran1,2, Nikki Appleby1,2, Terry Clark1,2, David Wood1,3, Michael Imelfort1,2, Jacqueline Batley1,4 and David Edwards1,2,*

1Australian Centre for Plant Functional Genomics, Brisbane, School of Land, Crop and Food Sciences, 2The Institute for Molecular Bioscience, 3Queensland Facility for Advanced Bioinformatics, The Institute for Molecular Bioscience, ARC Centre of Excellence in Bioinformatics and 4ARC Centre of Excellence for Integrative Legume Research, University of Queensland, Brisbane, QLD 4072, Australia

*To whom correspondence should be addressed. Tel: +61 (0)7 3346 2615; Fax: +61 (0)7 3346 2101; Email: Dave.Edwards{at}uq.edu.au

Received August 12, 2008. Revised September 15, 2008. Accepted September 18, 2008.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
Single nucleotide polymorphisms (SNPs) may be considered the ultimate genetic marker as they represent the finest resolution of a DNA sequence (a single nucleotide), are generally abundant in populations and have a low mutation rate. Analysis of assembled EST sequence data provides a cost-effective means to identify large numbers of SNPs associated with functional genes. We have developed an integrated SNP discovery pipeline, which identifies SNPs from assembled EST sequences. The results are maintained in a custom relational database along with EST source and annotation information. The current database hosts data for the important crops rice, barley and Brassica. Users may rapidly identify polymorphic sequences of interest through BLAST sequence comparison, keyword searches of annotations derived from UniRef90 and GenBank comparisons, GO annotations or in genes corresponding to syntenic regions of reference genomes. In addition, SNPs between specific varieties may be identified for targeted mapping and association studies. SNPs are viewed using a user-friendly graphical interface. The database is freely accessible at http://autosnpdb.qfab.org.au/.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
Molecular genetic markers describe genetic variations and provide a link between observed phenotypes and the underlying genotype. The development of high-throughput methods for the detection of single nucleotide polymorphisms (SNPs) and small insertion/deletions (indels) has led to a revolution in their use as molecular markers. SNPs may be considered the ultimate genetic marker as they represent the finest resolution of a DNA sequence, are generally abundant in populations and have a low mutation rate (1). However, SNP markers can be costly to develop, especially where re-sequencing from multiple individuals is required. The mining of readily available sequence data significantly reduces the costs associated with SNP discovery (2). The principal challenge in SNP discovery remains the discrimination between true genetic polymorphisms and the often more abundant sequence errors. Where sequence trace files are available to filter polymorphisms in traces of dubious quality, these can be used to differentiate between true SNPs and sequence error (3). Where trace files are unavailable, the identification of sequence errors can be based on two further methods to determine SNP confidence: redundancy of the polymorphism in a sequence alignment, and co-segregation of putative SNPs with haplotype. The frequency of occurrence of a polymorphism at a particular locus provides a measure of confidence that the SNP represents a true polymorphism; this is referred to as the SNP redundancy score. In addition, true SNPs that represent divergence between homologous genes co-segregate to define a conserved haplotype. A co-segregation score based on whether a SNP position contributes to defining a haplotype provides a second independent measure of SNP confidence. The SNP redundancy score and co-segregation score together provide an effective means for estimating confidence in the validity of SNPs independently of sequence trace files (4–6).

We have combined SNP discovery software and sequence annotation within the relational database schema of autoSNPdb to enable the efficient identification of SNP and indel polymorphisms related to specific genes or traits. Here, we present the application of autoSNPdb to barley, rice and Brassica species. AutoSNPdb has a flexible interface facilitating a variety of queries. Users may search for SNPs within genes of predicted function, and through sequence identity with known genes. In addition, it is possible to add additional levels of annotation and novel queries specific to areas of interest. In the current version, we include plant cultivar information to allow the identification of SNPs that discriminate between plant cultivars.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
Data processing
Brassica, rice and barley expressed sequences were downloaded from GenBank release 159. RepeatMasker (www.repeatmasker.org) was used to identify and mask repeats prior to assembly using CAP3 (7) with the parameters –p 90, –o 50. The resulting assemblies and singleton sequences were parsed into a MySQL database. SNP discovery used the autoSNP method (4) implemented with custom Perl scripts, and the results were parsed to the database. Assemblies containing four sequences or more were examined for polymorphisms, with gaps created during the assembly process classified as indels. The minimum redundancy score defining a polymorphism was varied in proportion to the number of sequences in the assembly at the SNP position. A minimum redundancy score of 2 was required for up to seven sequences; 3 for between 8 and 11 sequences; 4 for between 12 and 19 sequences, and a minimum redundancy of 5 was required for predicting SNPs represented by 20 or more sequence reads. Each SNP was compared to all SNPs in an assembly to calculate the SNP co-segregation score, with the weighted co-segregation score calculated according to the proportion of missing data at that position in the assembly. Input sequences were annotated with cultivar type, tissue source and developmental stage where available. Consensus and singleton sequences were annotated based on sequence alignment using BLAST (8) against GenBank and UniRef90 databases. Gene Ontology (GO) annotations were derived from UniRef90 annotations. Comparative rice and Arabidopsis genome positions were derived by WU-BLAST comparison with TIGR rice pseudo-chromosomes (version 5) and TAIR Arabidopsis pseudo-chromosomes (v01222004), respectively.

Database content, access and interface
Barley, rice and Brassica sequences were downloaded from GenBank and processed through the autpSNPdb pipeline. A custom web interface allows users to query and visualize the SNP and annotation data (Figure 1). The maintenance of these data within a relational database enables numerous query options. Sequence annotations may be searched by gene keyword, sequence ID, GO term or through similarity to defined regions of the rice or Arabidopsis genome. A BLAST interface enables identification by sequence similarity. SNPs may be retrieved that differentiate between cultivars, providing a valuable resource for genetic mapping and association studies. To aid interpretation of the predicted SNP data, SNPs are viewed graphically as vertical bars, where the position of the bar along the x-axis reflects the relative position of the SNP in the consensus sequence; the height of the bar represents the SNP redundancy score; and the bar colour reflects the SNP-weighted co-segregation score. Information about each SNP is displayed by moving the cursor over the bar, while selecting a bar centres the sequence assembly at that position. The sequence assembly may be moved using the scroll bar and can be toggled between the full sequence assembly and a SNP summary. Labels to the left of the sequence may also be toggled between cultivars, GenBank accession numbers, tissue type and development stage for the respective sequences. The interface is documented with help pages and database build information.


Figure 1
View larger version (61K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. The autoSNPdb web interface displaying the sequence assembly, predicted SNPs as vertical bars and details presented in a mouse over box.

 

    FUTURE DIRECTIONS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
The autoSNPdb system was developed for flexible use and permits extension to a broad range of annotation and species. We plan to extend this system for other crops, including wheat and next-generation Roche 454 sequence data.


    FUNDING
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
Funding for open access charge: the Australian Research Council.

Conflict of interest statement. None declared.


    ACKNOWLEDGEMENTS
 
Support from The National Computing Infrastructure (NCI) and the Queensland Facility for Advanced Bioinformatics (QFAB) is gratefully acknowledged. This research was supported under Australian Research Council's Linkage Projects funding scheme.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 

  1. Syvanen AC. Accessing genetic variation: Genotyping single nucleotide polymorphisms. Nat. Rev. Genet. (2001) 2:930–942.[CrossRef][Web of Science][Medline]

  2. Taillon-Miller P, Gu ZJ, Li Q, Hillier L, Kwok PY. Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. (1998) 8:748–754.[Abstract/Free Full Text]

  3. Marth GT, Korf I, Yandell MD, Yeh RT, Gu ZJ, Zakeri H, Stitziel NO, Hillier L, Kwok PY, Gish WR. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. (1999) 23:452–456.[CrossRef][Web of Science][Medline]

  4. Barker G, Batley J, O'Sullivan H, Edwards KJ, Edwards D. Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics (2003) 19:421–422.[Abstract/Free Full Text]

  5. Batley J, Barker G, O'Sullivan H, Edwards KJ, Edwards D. Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol. (2003) 132:84–91.[Abstract/Free Full Text]

  6. Savage D, Batley J, Erwin T, Logan E, Love CG, Lim GAC, Mongin E, Barker G, Spangenberg GC, Edwards D. SNPServer: a real-time SNP discovery tool. Nucleic Acids Res. (2005) 33:W493–W495.[Abstract/Free Full Text]

  7. Huang XQ, Madan A. CAP3: a DNA sequence assembly program. Genome Res. (1999) 9:868–877.[Abstract/Free Full Text]

  8. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. (1990) 215:403–410.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (1399K) Freely available
Right arrow Screen PDF (283K) Freely available
Right arrowOA All Versions of this Article:
37/suppl_1/D951    most recent
gkn650v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Duran, C.
Right arrow Articles by Edwards, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Duran, C.
Right arrow Articles by Edwards, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?