TRANSFAC: a database on transcription factors and their DNA binding sites
TRANSFAC: a database on transcription factors and their DNA binding sites
E.
Wingender*
,
P.
Dietze
,
H.
Karas
and
R.
Knüppel
Gesellschaft für Biotechnologische Forschung mbH, Department of Genome Analysis,
Mascheroder Weg 1, D-38124
Braunschweig
,
Germany
Received September 5, 1995
;
Accepted October 2, 1995
ABSTRACT
TRANSFAC is a database about eukaryotic transcription regulating DNA sequence
elements and the transcription factors binding to and acting through them. This
report summarizes the present status of this database and accompanying
retrieval tools.
INTRODUCTION
To render raw genomic sequence data into usable biological information requires
a lot of experimental work and, thus, depends on additional data. However,
there are efforts to circumvent this bottleneck of functional sequence analysis
by developing sophisticated computational tools that allow to deduce biological
function from mere DNA sequences. The function of genes is to code for specific
products, but to know them and to predict their putative biological role is
only half of the whole task. The other is to decipher the regulatory code, i.e.
to disclose under which conditions this genomic information is expressed. Thus,
studying gene expression mechanisms is one of the major tasks of todays
molecular biology, giving an enormous and still increasing output of data. This
amount of information can only be handled by storing it in an appropriate
database system.
TRANSFAC is a database that collects data which are relevant for gene expression
at the transcriptional level. Very early, several collections have been
published describing transcription factors and the sequences they interact with
(
1
-
4
). Our own attempts started from a simple compilation of
cis
-acting DNA elements and the proteins (transcription factors) binding to
them in two tables (
5
). From this it was obvious that this kind of data can be optimally managed by a
relational model database as was realized for the first time for the
Transcription Factor Database (TFD) by D. Ghosh (
6
,
7
). For TRANSFAC, this was implemented after an electronically readable ASCII
flat file version had been established (
8
,
9
). We now present a survey about the progress of the TRANSFAC database, its
content as well as the database management systems (DBMSs) which are presently
available.
STRUCTURE OF THE DATABASE
The basic mechanism of transcriptional control operates through sequence-specific interactions of a special class of proteins, the transcription
factors, with relatively short DNA elements of ~5-25 bp (
10
). When transferring this knowledge into an appropriate relational data model,
information about the two basic constituents appears in two tables, SITES and
FACTORS (Fig.
1
). They are connected by a many-to-many relation since many sites can interact with several factors,
and all known factors bind to more than just one site. The SITES table gives
the position of a particular regulatory site, the gene this site belongs to,
the biological species this gene has been derived from, and as a free text
field some additional, unstructured information such as dissociation constants
or inducibility by certain agents is included. The methods by which each
regulatory site has been identified are connected since they give a hint on the
reliability of this characterization and on the stringency of the sequence
displayed. This is also reflected by an assigned `Quality' parameter (see
below). Moreover, published regulatory sites have been identified as being
functional in a specific cellular context, therefore information about the cell
lines used is given as well. These data are stored in two separate tables,
METHODS and CELLS, both of which are linked to SITES (Fig.
1
). Similarly, the sequences of the sites are stored in a separate connected
table, since some sites may comprise more than one binding sequence, depending
on the methods applied. Wherever possible, each sequence is individually linked
to the EMBL data library. It has been inserted for users who are interested in
the sequence context of a single site. These properties and links are only
partially applicable to synthetic sequences and consensus sequences in IUPAC
nomenclature which are also part of the SITES table.
FUTURE PROSPECTS
There are some extensions of the database which have to be done in the future.
First of all, enhanced cross-referencing between databases is required to construct a well-working net of information resources for molecular biologists. For
instance, pointers to EPD have to be included into TRANSFAC.
Some other changes, however, will require a change in its structure. To improve
the access to transcription factor data, we shall abstract relevant information
from biological species redundancy in a `factors summary table' where data
about obviously identical (or strictly orthologous) factors are accumulated.
This will be the first step towards a systematic factor classification we are
preparing.
Finally, we intend to link the relational and the WWW version of the database to
a more comprehensive sequence analysis program package which includes pattern
matching routines (involving genomic sites as well as IUPAC consensus strings),
matrix search routines (
11
) and the ConsInspector program (
11
).
ACCESS TO THE TRANSFAC DATABASE
The TRANSFAC database is accessible as ASCII flat file. It is distributed in
this format on the CD ROM of the EBI along with the EMBL data library or can be
downloaded by anonymous ftp from ftp.gbf-braunschweig.de (IP 193.175.244.2) or from ftp. ebi.ac.uk. Moreover, there is access through WWW either directly
(http://transfac.gbf-braunschweig.de) or through database access networks such as SRS
(http://www.embl-heidelberg.de/srs/srsc) (
13
,
14
) or WebDBGET (http://www.genome.ad.jp/htbin/bfind_transfac). The browsing tools
can also be obtained at the above-mentioned ftp site.
ACKNOWLEDGEMENTS
The authors wish to thank J. Collins for his continuous support, as well as T.
Werner, K. Frech, K. Quandt, N. Kolchanov, A. Kel. and O. Kel for critical
discussions and many helpful comments. Moreover, we are indebted to our
colleagues at the EBI, in particular B. Shomer, R. Apweiler, and M. Ashburner
for their support in mutual cross-referencing the databases. We also thank Borland International, Inc. for
making available to us the runtime module of Paradox for Windows 4.5. This work
was funded by the Bundesministerium für Bildung, Wissenschaft, Forschung und Technologie (Project No. 01 IB 306
A).
REFERENCES
1 Johnson, P. F., and McKnight, S. L. (1989) Annu. Rev. Biochem.58, 799-839.MEDLINE Abstract
8 Wingender, E., Heinemeyer, T., and Lincoln, D. (1991) Genome Analysis-From Sequence to Function in BioTechForum-Advances in Molecular Genetics (J. Collins and A.J. Driesel, eds) Vol. 4, 95-108.
9 Wingender, E. (1994) J. Biotechnol. 35, 273-280.
10 Wingender, E. (1993) Gene Regulation in Eukaryotes. VCH Weinheim.
11 Frech, K., Herrmann, G., and Werner, T. (1993) Nucleic Acids Res. 21, 1655-1664.MEDLINE Abstract
12 Knüppel, R., Dietze, P., Lehnberg, W., Frech, K., and Wingender, E. (1994) J. Comput. Biol. 1, 191-198.
13 Etzold, T., and Argos, P. (1993) Comput. Appl. Biosci. 9, 49-57.MEDLINE Abstract
14 Etzold, T., and Argos, P. (1993) Comput. Appl. Biosci. 9, 59-64.MEDLINE Abstract