Nucleic Acids Research Advance Access published online on November 1, 2009
Nucleic Acids Research, doi:10.1093/nar/gkp910
© The Author(s) 2009. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
FlyTF: improved annotation and enhanced functionality of the Drosophila transcription factor database
Ulrike Pfreundt1,2,
Daniel P. James1,
Susan Tweedie2,
Derek Wilson3,
Sarah A. Teichmann3 and
Boris Adryan1,2,*
1Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, 2Department of Genetics, University of Cambridge CB2 3EH and 3Laboratory of Molecular Biology, Hills Road, Cambridge CB2 0QH, UK
*To whom correspondence should be addressed. Tel: +44 1223 760209; Fax: +44 1223 760241; Email: b.adryan{at}gen.cam.ac.uk
Received August 14, 2009. Revised September 21, 2009. Accepted October 7, 2009.
 |
ABSTRACT
|
|---|
FlyTF (
http://www.flytf.org) is a database of computationally
predicted and/or experimentally verified site-specific transcription
factors (TFs) in the fruit fly
Drosophila melanogaster. The
manual classification of TFs in the initial version of FlyTF
that concentrated primarily on the DNA-binding characteristics
of the proteins has now been extended to a more fine-grained
annotation of both DNA binding and regulatory properties in
the new release. Furthermore, experimental evidence from the
literature was classified into a defined vocabulary, and in
collaboration with FlyBase, translated into Gene Ontology (GO)
annotation. While our GO annotations will also be available
through FlyBase as they will be incorporated into the genes
official GO annotation in the future, the entire evidence used
for classification including computational predictions and quotes
from the literature can be accessed through FlyTF. The FlyTF
website now builds upon the InterMine framework, which provides
experimental and computational biologists with powerful search
and filter functionality, list management tools and access to
genomic information associated with the TFs.
 |
INTRODUCTION
|
|---|
Site-specific transcription factors (TFs) are proteins that
bind to specific DNA sequences or DNA conformations, and that
confer regulatory information to the basal transcription machinery.
While they play a key role in gene regulation in general, TFs
are of special interest to developmental biologists as their
presence at
cis-regulatory elements in the genome determines
important developmental decisions in processes such as axis
formation and morphogenesis (
1). It may therefore seem surprising
that almost a decade after the availability of the
Drosophila melanogaster genome (
2) there is still no definitive answer
as for the number of site-specific TFs, let alone a comprehensive
list of TFs from an authoritative community resource like FlyBase
(
3).
FlyTF (http://www.flytf.org) has stepped in to fill this gap by integrating both computationally predicted as well as experimentally verified TFs. The first version of FlyTF (4) provided information about the curation of 1052 candidate TFs [selected for the presence of a canonical DNA-binding domain using the pipeline of the DBD transcription factor database (5) or a set of suitable Gene Ontology terms (6)], and yielded a repertoire of 753 site-specific fly TFs, about two-thirds of which were called with a high degree of confidence. The website has had
4000 visitors since publication, with the majority of users bulk-downloading our annotations.
 |
IMPROVED ANNOTATIONS
|
|---|
The initial release of FlyTF was based on
D. melanogaster release
3.1 gene annotations, and manual curation was based on GO annotations
and literature published by December 2005. The candidate proteins
were primarily assessed for their capability to bind to DNA
(yes/maybe/no) and confer a regulatory function. While a strict
set of rules was applied for the DNA-binding property, all regulatory
proteins ranging from canonical site-specific TFs to insulators
and those involved in chromatin-mediated maintenance of transcription
were treated alike. This was identified as a major limitation
in computational studies that focussed on classical TFs. Furthermore,
gene annotations in
D. melanogaster are currently in their 5.19
release, meaning that many novel or modified gene models were
not present in the initial dataset.
We have addressed these shortcomings in the current release of FlyTF. First, we generated a novel candidate gene list by incorporating the initial FlyTF gene set, DBD searches on the FlyBase release 5.8 gene annotations (all translations), and GO searches with a set of TF-related GO terms. This yielded a non-redundant set of 1162 candidate TFs. Two human curators (one general curator at FlyTF, one GO curator at FlyBase) assessed this list, taking all experimental evidence published by December 2008 into account.
Each candidate TF was characterised both for its DNA-binding as well as regulatory characteristics. A verdict for DNA-binding can now be
- yes (clear evidence for sequence-specific binding),
- yes (homolog) (property experimentally shown for a homolog),
- yes (DNA binding, no sequence-specificity determined),
- yes (heterodimer) (if the factor alone is not capable of binding DNA),
- maybe (none or no convincing evidence found) and
- no (experimental evidence against DNA-binding).
As in the previous version, where available, quotations from the literature were extracted along with an associated PubMed ID. To allow users a more fine-grained selection of evidence, experiments regarding the DNA-binding characteristics of the proteins were categorised into eight different groups of varying quality, each of which can now be queried or filtered for at FlyTF (Table 1). While a DNA-binding protein in the original version automatically became a bona fide TF if the DBD pipeline identified a domain frequently found in TFs, we now provide a more detailed categorisation of the regulatory property of the candidate protein. A verdict for this can be
- yes (a true site-specific TF),
- yes (heterodimer) (as before, but only as a heterodimer),
- maybe (if a canonical DNA-binding domain was found, but no experimental evidence) or
- no (not a site-specific TF).
View this table:
[in this window]
[in a new window]
|
Table 1. Experimental procedures accepted to confirm DNA-binding property of candidate proteins in FlyTF, and GO terms assigned on their basis (as IDA)
|
|
The maybe and no categories are
frequently associated with free text, describing further characteristics
where the information was easily accessible. Useful information
in this context could be, for example, chromatin-remodelling,
TBP-associated factor (TAF), inhibitor
or insulator. This verdict is supported by quotations
from the literature as well as a discrete categorisation of
the experimental evidence, which can be used for user-defined
queries (
Table 2).
View this table:
[in this window]
[in a new window]
|
Table 2. Experimental procedures accepted to confirm transcriptional regulatory property of candidate proteins, and evidence codes in support of GO terms dealing with regulatory function
|
|
Ultimately, in collaboration with FlyBase, any supporting experimental
data was translated into GO annotation, combining the expertise
of the FlyTF and FlyBase curators (the rules for the translation
of experimental evidence into GO terms can be found in
Tables 1 and
2). At the same time, each candidate TF received a final
score based on its DNA-binding domain, and the experimental
evidence found for both DNA-binding and transcriptional regulatory
function (
Table 3).
View this table:
[in this window]
[in a new window]
|
Table 3. FlyTF score based on computational predictions (DBD) and novel GO annotation (based on experimental data)
|
|
 |
ENHANCED FUNCTIONALITY AND ACCESSIBILITY
|
|---|
The initial FlyTF website was a collection of static HTML pages
and a few dynamically generated lists. A search tool to find
individual genes or all TFs with a certain DNA-binding domain
was the only means of user interaction. However, most visitors
chose to download our annotations in bulk. We suspect this is
because traditional
Drosophila geneticists often prefer to retrieve
information about their favourite gene directly
from FlyBase, the authoritative community resource. Also, researchers
utilising genomics or computational methods are likely to query
large batches of identifiers, and their analysis is often based
on further list operations, neither of which were catered for
by FlyTF.
We assessed a variety of options to allow non-specialist users easy access to our annotations and at the same time provide computational biologists with some basic analysis tools. The FlyTF database is now based on the InterMine framework (http://www.intermine.org), the backbone of biological data warehouses such as FlyMine (7) or modMine (8). This now enables different usage scenarios, which we will illustrate below.
The simplest scenario is the search for a single gene of interest. The query form accepts any identifier for a given TF (gene name, symbol, unique ID or even rarely used synonyms) and displays general gene information as well as our transcription factor annotations (Figure 1A).

View larger version (37K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 1. Screenshots from the FlyTF web site. (A) Transcription factor summary information for gene hunchback. The left panel provides basic gene information and serves as a starting point for the retrieval of DNA or protein sequences. The right panel focuses on transcription factor annotation and is divided into three main sections: our general verdict, and two sections providing details on the DNA-binding and regulatory capabilities (and the associated experimental evidence thereof). Further, there is a direct link to the appropriate REDfly page, detailing transcriptional regulatory relationships for TFs where they are known. (B) An exemplary widget for a list of transcription factors. Here, the enrichment of PFAM domain assignments for proteins of the genes in the list is shown in comparison to the rest of the genome. In the example there is a clear over-representation of the Homeobox domain. (C) The entire data model behind FlyTF is accessible through the QueryBuilder, allowing the definition of complicated filters for the retrieval of TF subsets. The displayed example was chosen for its relative complexity and may not be trivial for novel users to setup. However, building a query in QueryBuilder is without doubt easier than issuing the respective SQL command in a database.
|
|
A novel feature is of special interest for users with a genome-wide
perspective: it is possible to upload extensive gene lists,
from which the genes encoding TFs will be recognised and marked,
and can be saved for further analysis on the website. This enables,
for example, the one-step identification and characterisation
of TFs contained in candidate gene lists from genomics experiments.
Analysis tools available at the FlyTF website comprise widgets
to report GO term enrichment or over-representation of certain
structural domains (
Figure 1B). It is noteworthy that some of
these statistics are calculated in a transcription factor background,
which may be helpful in the determination of differences between
sets of TFs (rather then comparing TFs against the entire genome).
Users can also choose to register at the FlyTF website, and
store and compare their TF lists at a later stage.
A third usage scenario addresses the needs of the computational biologists. Lists of TFs fulfilling specific criteria can easily be created using the FlyTF QueryBuilder (Figure 1C), and customisable output formats allow the swift integration of FlyTF in many bioinformatics workflows. For example, it is possible to search for all TFs that (i) contain zinc finger domains, (ii) for which a position weight matrix is known and (iii) whose transcriptional regulatory function was shown in a reporter assay in the fly. In this case, only one gene (hunchback) fulfils these criteria. The genes genomic coordinates can be exported in GFF3 format and the translations are available in FASTA format. It should be mentioned that through the customisable output generator, it is possible to export the entire FlyTF dataset as one tab-delimited file.
 |
FUTURE DIRECTIONS
|
|---|
The comparative sequencing and genome annotation of closely
related
Drosophila species (
9) has provided the community with
the gene repertoires of a dozen flies. Experimental data for
individual genes of these non-
D. melanogaster flies is still
sparse, yet researchers interested in their TFs can use FlyTF
as a starting point to identify homologous proteins using the
built-in orthology mapping.
The next-generation of InterMine-based databases will enable researchers to share gene lists and analysis tools across species and data mines, and we are looking forward to assist TF researchers in other model organisms with our dataset.
 |
COLLABORATION BETWEEN TWO COMMUNITY RESOURCES
|
|---|
FlyTF and FlyBase both deal with the functional annotation of
fly genes, and have pooled resources for this work. While FlyTF
focuses on manual curation and only on TF genes, FlyBase is
the community resource for all things
Drosophila. Although the
information content of each database is distinct, both use GO
terms for functional annotation and a key aim of this project
was to improve GO annotation consistency between these databases,
based on both computational predictions and experimental evidence
using the combined expertise of the TF specialists at FlyTF
and the FlyBase GO curator. We believe our collaboration can
be a model for many niche databases that are maintained
on a sporadic basis, which can benefit from both the experience
and the resources of an established community portal.
 |
SUPPLEMENTARY DATA
|
|---|
Supplementary Data are available at NAR Online.
 |
FUNDING
|
|---|
FlyBase grant (National Human Genome Research Institute at the
US NIH P41 HG000739 to FlyBase); EPSRC MPhil studentship (to
D.P.J.); Medical Research Council (to S.T. and D.W.); Royal
Society University Research Fellowship (to B.A.). Funding for
open access charge: The Royal Society.
Conflict of interest statement. None declared.
 |
ACKNOWLEDGEMENTS
|
|---|
The authors thank Nick Brown and Steven Marygold at FlyBase
for enabling UP to participate in this collaboration and for
comments on the manuscript. They also wish to thank Goran Nenadic
and Casey Bergman for the provision of computationally marked-up
TF literature (
10,
11), and Richard Smith, Julie Sullivan and
Gos Micklem for helpful comments and technical assistance in
the customization of the InterMine system.
 |
Footnotes
|
|---|
The authors wish it to be known that, in their opinion, the
first two authors should be regarded as joint First Authors.
 |
REFERENCES
|
|---|
- Levine M, Davidson E. Gene regulatory networks for development. Proc. Natl Acad. Sci. USA (2005) 102:4936–4942.[Abstract/Free Full Text]
- Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. The genome sequence of Drosophila melanogaster. Science (2000) 287:2185–2195.[Abstract/Free Full Text]
- Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, et al. FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res. (2009) 37:D555–D559.[Abstract/Free Full Text]
- Adryan B, Teichmann SA. FlyTF: a systematic review of site-specific transcription factors in the fruit fly Drosophila melanogaster. Bioinformatics (2006) 22:1532–1533.[Abstract/Free Full Text]
- Wilson D, Charoensawan V, Kummerfeld SK, Teichmann SA. DBD–taxonomically broad transcription factor predictions: new content and functionality. Nucleic Acids Res. (2008) 36:D88–D92.[Abstract/Free Full Text]
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. (2000) 25:25–29.[CrossRef][Web of Science][Medline]
- Lyne R, Smith R, Rutherford K, Wakeling M, Varley A, Guillier F, Janssens H, Ji W, Mclaren P, North P, et al. FlyMine: an integrated database for Drosophila and Anopheles genomics. Genome Biol. (2007) 8:R129.[CrossRef][Medline]
- Celniker S, Dillon L, Gerstein M, Gunsalus K, Henikoff S, Karpen G, Kellis M, Lai E, Lieb J, MacAlpine D, et al. Unlocking the secrets of the genome. Nature (2009) 459:927–930.[CrossRef][Web of Science][Medline]
- Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature (2007) 450:203–218.[CrossRef][Medline]
- Yang H, Nenadic G, Keane JA. Identification of transcription factor contexts in literature using machine learning approaches. BMC Bioinformatics (2008) 9(Suppl. 3):S11.
- Yang H, Keane J, Bergman CM, Nenadic G. Assigning roles to protein mentions: the case of transcription factors. J. Biomed. Inform. (2009) 42:887–894.[CrossRef][Web of Science][Medline]

CiteULike
Connotea
Del.icio.us What's this?