Nucleic Acids Research Advance Access published online on October 9, 2009
Nucleic Acids Research, doi:10.1093/nar/gkp807
© The Author(s) 2009. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
BeetleBase in 2010: revisions to provide comprehensive genomic information for Tribolium castaneum
Hee Shin Kim1,
Terence Murphy2,
Jing Xia3,
Doina Caragea1,3,
Yoonseong Park4,
Richard W. Beeman5,
Marcé D. Lorenzen5,
Stephen Butcher6,
J. Robert Manak6,7 and
Susan J. Brown1,*
1KSU Bioinformatics Center, Division of Biology, Kansas State University, Manhattan, KS 66506, 2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, 3Department of Computer and Information Science, KSU, Manhattan, KS 66506, 4Department of Entomology, KSU, Manhattan, KS 66506, 5USDA-ARS Grain Marketing and Production Research Center, 1515 College Avenue, Manhattan KS 66502, 6Department of Biology, University of Iowa, Iowa City, IA 52242 and 7Roy J. Carver Center for Genomics, UI, Iowa City, IA 52242
*To whom correspondence should be addressed. Tel: +1 785 532 3935; Fax: +1 785 532 6653; Email: sjbrown{at}ksu.edu
Received August 14, 2009. Revised September 11, 2009. Accepted September 14, 2009.
 |
ABSTRACT
|
|---|
BeetleBase (
http://www.beetlebase.org) has been updated to provide
more comprehensive genomic information for the red flour beetle
Tribolium castaneum. The database contains genomic sequence
scaffolds mapped to 10 linkage groups (genome assembly release
Tcas_3.0), genetic linkage maps, the official gene set, Reference
Sequences from NCBI (RefSeq), predicted gene models, ESTs and
whole-genome tiling array data representing several developmental
stages. The database was reconstructed using the upgraded Generic
Model Organism Database (GMOD) modules. The genomic data is
stored in a PostgreSQL relatational database using the Chado
schema and visualized as tracks in GBrowse. The updated genetic
map is visualized using the comparative genetic map viewer CMAP.
To enhance the database search capabilities, the BLAST and BLAT
search tools have been integrated with the GMOD tools. BeetleBase
serves as a long-term repository for Tribolium genomic data,
and is compatible with other model organism databases.
 |
INTRODUCTION
|
|---|
The red flour beetle,
Tribolium castaneum, is a sophisticated
genetic model organism for studies of insect development and
pest biology as well as comparative genomics (
1,
2). A burgeoning
wealth of genomic information, including a whole genome shotgun
(WGS) assembly of the genome sequence, has become available
in recent years (
2).
BeetleBase was developed to provide a centralized database to serve the growing needs of the Tribolium research community. The first version of BeetleBase (3) was based on genome release Tcas_1.0. The database provided access to unmapped scaffolds, Fgenesh predicted genes, genetic markers, BAC-end sequences and ESTs. The database was implemented using an early version of the Chado schema, GBrowse, and CMAP developed by the Generic Model Organism Database (GMOD) project (http://www.gmod.org/).
Additional data has become available since the initial release of BeetleBase, including updates to the genome assembly and annotation, necessitating updates to the BeetleBase schema and data content. Here we present an updated version of BeetleBase (http://www.beetlebase.org), implemented using recent versions of Generic Model Organism Database (GMOD) tools, and integrated with BLAST (4) and BLAT (5) alignment tools. Data content has been expanded to include the latest version of the genome assembly (Tcas_3.0), a comprehensive collection of gene sets including the first unified release of the Tribolium Official Gene Set (OGS), EST alignments to the genome, and whole-genome transcriptional information from DNA tiling array experiments. These updates will provide a valuable resource for the Tribolium community, and serve as a foundation for integration of future datasets.
 |
DATA ACQUISITION
|
|---|
Genome sequence
The initial assembly of the Tribolium genome (Tcas_1.0) was
composed of 1107 unmapped genomic sequence scaffolds. For the
second version of the assembly (Tcas_2.0), 70% of the genome
sequence was mapped to 10 linkage groups corresponding to nine
autosomes and the X chromosome (
2,
6). Subsequently, an additional
forty-two of the largest unmapped scaffolds have been integrated
into the genetic linkage map, and used by the Baylor College
of Medicine to create the third version of the assembly (Tcas_3.0).
The Tribolium genome sequence is assembled into 9686 contigs
of 156 Mb in combined length. When these are assembled into
scaffolds containing captured gaps of estimated length, the
genome assembly is

160 Mb. Scaffolds and contigs containing
more than 90% of the sequenced genome have been assembled into
10 chromosome builds. The chromosome build statistics for Tcas_3.0
are summarized in
Table 1. The GenBank accession numbers of
the 9686 contigs that have been assembled into ten chromosome
builds, 305 unmapped scaffolds and 1848 unmapped single contigs
are AAJJ01000001–AAJJ01009708 (22 of which have been suppressed
since their original submission), CM000276
[GenBank]
–CM000285
[GenBank]
and
DS497665
[GenBank]
–DS497969
[GenBank]
and GG694051
[GenBank]
–GG695897
[GenBank]
, respectively.
The genetic map has also been updated to include the additional
42 markers used to anchor scaffolds in the Tcas_3.0 assembly.
Gene models
Several gene prediction programs were previously used to annotate
the Tcas_2.0 assembly, and were combined into a consensus GLEAN
gene set (
2,
7). More than 2000 genes were manually curated by
members of the Tribolium Genome Sequencing Consortium (
2); however,
new and updated gene models were not combined into a unified
gene set. We generated the first Tribolium Official Gene Set
(OGS) by merging the GLEAN gene set with the manually curated
gene annotations. First, each manually curated gene was mapped
to the Tcas_3.0 assembly, and the validity of the mRNA, CDS,
and/or peptide sequence for each of the manually curated genes
was checked to ensure that the peptide sequence represented
the same exons/splice sites as the mRNA and CDS sequences. Incorrectly
annotated exons were replaced with exon coordinates determined
by Splign (
8). In some cases, genes required manual curation
to determine the correct or most likely gene structure. A non-redundant
Official Gene Set was constructed by merging the GLEAN and manually
curated gene sets, automatically replacing GLEAN models with
overlapping, manually curated models. Finally, unique identifiers
were assigned to each gene to facilitate communication of Tribolium
genes in research publications. A total of 16 561 official genes
were generated, assigned identifiers such as TC######,
and submitted to NCBI (
Table 2).
Since some researchers may benefit from access to other gene
sets, we migrated the AUGUSTUS, Ensembl, Fgenesh, NCBI supported
and ab initio gene models, and the combined GLEAN genes from
the Genboree database hosted at Baylor College of Medicine (
http://www.genboree.org),
and converted their coordinates into the genome coordinates
of Tcas_3.0 (
Table 2). Only a few of the predicted genes were
lost when the scaffolds were reorganized. In addition, the latest
RefSeq annotation from NCBI (build 2.1, based on the same Tcas_3.0
assembly), which includes 3613 protein accessions that are new
or changed from the original annotation run used for the GLEAN
set, was imported into BeetleBase. Each gene set is available
for viewing in a separate track in GBrowse (
Figure 1).

View larger version (94K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 1. Tribolium gene models. Various gene models are shown in different tracks for easy comparison. By clicking one of the gene models, detailed information can be retrieved. The RefSeq gene models are linked to NCBI Entrez Gene report pages. The Tribolium official gene models and manually-curated gene models link to the BeetleBase gene report pages. Other gene models link to detailed pages provided by GBrowse.
|
|
BeetleBase provides a detailed gene report page for each gene
in the OGS, which can be accessed and managed through the web
interface (
Figure 2). The new BeetleBase database provides more
comprehensive gene information than was previously available,
including detailed information on the gene structure, nomenclature,
and additional annotation data provided during the manual curation
process. Links are also provided to overlapping RefSeq Gene
and transcript records at NCBI (
9,
10) to facilitate use of data
from both databases.

View larger version (76K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 2. Official gene report page. The report page provides detailed information for an official gene. The information can be modified by clicking Edit after obtaining a password from the BeetleBase webmaster. The report page links to GBrowse by clicking GBrowse. The corresponding region will be highlighted in GBrowse.
|
|
EST and BAC-end sequences
EST and BAC-end sequences were downloaded from the dbEST and
GSS databases, respectively, at NCBI. 55 616 ESTs from five
different tissue- and stage-specific cDNA libraries (
11) were
cleaned and polyA sequences removed using in-house software
tools (
http://bioinformatics.ksu.edu/ArthropodEST) and aligned
to the genome using the Exonerate algorithm (
12). A total of
50 277 EST-genome alignments were generated. BAC-end sequences
from the
Tribolium BAC library (constructed by Exelixis, Inc,
South San Francisco, CA and archived for distribution by the
Clemson University Genomics Institute (
https://www.genome.clemson.edu/)
were mapped to the genome and added to the database. Out of
28 788 BAC-end sequences, 27 810 were aligned to the genome
using BLAST.
Tiling arrays
Using Roche NimbleGen HD2 whole genome DNA tiling arrays, whole genome expression data have been collected for several developmental stages including three embryonic, the last larval, three pupal and two adult stages. Briefly, fluorescently labeled cDNAs were hybridized to the custom-designed microarrays; GFF files were constructed using the signal intensities from each feature on the array that were quantified directly from the scanned array images without any data management such as background subtraction. This was done to provide immediate access to the data to help verify computed gene models and assist manual annotation efforts while the data are processed and further analyzed. Figure 3 shows empirically derived transcriptome tiling array data generated from several developmental stages for a portion of the Tribolium genome.

View larger version (76K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 3. Tribolium genome tiling arrays. Tracks for 11 developmental time points are shown. Blue vertical bars of each track represent fluorescence intensity values for oligonucleotide probes tiled across a 54 kb region on the array. The official gene mRNA track (above time point tracks) contains purple gene structures. Note the different expression patterns of the annotated genes across development.
|
|
FTP site
Datasets that can be downloaded from BeetleBase (
ftp://bioinformatics.ksu.edu/pub/BeetleBase/3.0/)
include the contig sequences, GFF (.gff3) and assembly files
(.agp) for Tcas_3.0. Sequences of GLEAN genes, GLEAN cDNAs and
GLEAN peptides as well as the corresponding files for the OGS
are also available.
 |
IMPLEMENTATION
|
|---|
All the genomic information was compiled into Genetic Feature
Format Version 3 (GFF3), which is the most common extension
of GFF. The compiled information was implemented using GMOD
tools (
http://www.gmod.org) such as PostgreSQL-based Chado 1.0,
GBrowse 1.68, and CMAP 1.0. To query sequences against the
Tribolium genome, we also installed stand alone BLAST and BLAT servers.
In this release of BeetleBase, we improved the usability by
integrating these components.
 |
FUNDING
|
|---|
Grant number P20 RR016475 from the National Center for Research
Resources (NCRR), a component of the National Institutes of
Health. Work at NCBI was supported by the Intramural Research
Program of the NIH, National Library of Medicine. Funding for
open access charge: National Center for Research Resources National
Institutes of Health (P20 RR016475).
Conflict of interest statement. None declared.
 |
Footnotes
|
|---|
Present address: Marcé D. Lorenzen, Department of Entomology,
North Carolina State University, Raleigh, NC 27695, USA
 |
REFERENCES
|
|---|
- Roth S, Hartenstein V. Development of Tribolium castaneum. Dev. genes Evol. (2008) 218:115–118.[CrossRef][Web of Science][Medline]
- Tribolium Genome Sequencing Consortium. The genome of the model beetle and pest Tribolium castaneum. Nature (2008) 452:949–955.[CrossRef][Web of Science][Medline]
- Wang LJ, Wang S, Li Y, Paradesi MS, Brown SJ. Beetlebase: the model organism database for Tribolium castaneum. Nucleic Acid Res. (2007) 35:D476–D479.[Abstract/Free Full Text]
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]
- Kent WJ. BLAT-The BLAST-like alignment tool. Genome Res. (2002) 12:656–664.[Abstract/Free Full Text]
- Lorenzen MD, Doyungan Z, Savard J, Snow K, Crumly LR, Shippy TD, Stuart JJ, Brown SJ, Beeman RW. Genetic linkage maps of the red flour beetle, Tribolium castaneum, based on bacterial artificial chromosomes and expressed sequence tags. Genetics (2005) 170:741–747.[Abstract/Free Full Text]
- Elsik CG, Worley KC, Zhang L, Milshina NV, Jiang H, Reese JT, Childs KL, Venkatraman A, Dickens CM, Weinstock GM, et al. Community annotation: Procedures, protocols, and supporting tools. Genome Res. (2007) 16:1329–1333.[CrossRef][Web of Science]
- Kapustin Y, Souvorov A, Tatusova T, Lipman D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct (2008) 3:20.[CrossRef][Medline]
- Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. (2007) 35:D61–D65.[Abstract/Free Full Text]
- Maglott DR, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. (2007) 35:D26–D31.[Abstract/Free Full Text]
- Park Y, Aikins J, Wang LJ, Beeman RW, Oppert B, Lord JC, Brown SJ, Lorenzen MD, Richards S, et al. Analysis of transcriptome data in the red flour beetle, Tribolium castaneum. Insect Biochem. Mol. Biol. (2008) 38:380–386.[CrossRef][Web of Science][Medline]
- Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics (2005) 6:31.[CrossRef][Medline]

CiteULike
Connotea
Del.icio.us What's this?