| Nucleic Acids Research | Pages |
The Eukaryotic Promoter Database EPD
Background
Leading Concepts
Promoter definition
Entry concept
Organisation as a functional position set
Physiological viewpoint
Relationship to primary information sources
Maintenance policy
Priorities
Contents
Admission criteria
Information content of an entry
Format
Access
Acknowledgement
References
The Eukaryotic Promoter Database EPD
ABSTRACT
BACKGROUND
EPD originated as a by-product of a comparative sequence analysis project aimed at characterising transcriptional control signals near transcription start sites. Efforts to systematically compile data for such a project began in 1981 starting with a published list of 60 promoters (1). In 1986, an early version of this collection containing 173 entries appeared in this Journal (2). The first machine-readable version was released two years later in a format jointly designed with the EMBL Data Library (3) staff.
EPD was conceptually defined as a database of gene function, not as a sequence database. An important consideration was that promoter sequences (as any other type of sequences) will become available in the nucleotide sequence database anyway, and therefore need not be duplicated in a specialised database. However, it was also assumed that promoters would not be automatically annotated as promoter-defining evidence is often not published together with the sequences. Therefore, there is a need for a database which keeps track of promoter-defining evidence and links this information to sequences. These considerations led to the design of EPD as an annotated list of machine-readable pointers to transcription initiation sites in the EMBL Data Library. Obviously, such a design required co-ordinated updating procedures by the two databases involved, as the position-sensitive sequence pointers in EPD had to be adjusted, each time a corresponding EMBL sequence was modified. In this respect, EPD was an early experiment of biomolecular database interconnection which went beyond mere cross-referencing of documents from different collections (4).
EPD has been designed as a resource for comparative sequence analysis and so far has mainly served this purpose. For instance, it has played an instrumental role in the development of eukaryotic promoter prediction algorithms (5). Very recently, the scope of EPD has been extended in the framework of a European collaboration (the TRADAT project funded by the EU Biotechnology programme) aimed at developing an integrated sequence analysis system for transcription regulatory regions. This development also led to the design of a new format productively used since June 1997. The extensions include new links to other databases which can be exploited by novel data access procedures and user interfaces. With these improvements, EPD may serve a more diverse user community in the future.
LEADING CONCEPTS
Promoter definition
The underlying promoter definition of EPD is that of a transcription start site. Possible alternatives would have been to define promoters genetically as cis-acting elements determining the site and rate of transcription initiation, or biochemically as target sites of transcription factors. There are other databases providing information about promoters in the latter sense, for instance TRANSFAC and COMPEL (6). The experimental evidence required for a transcription start site consists of data that characterise the structure of the 5[prime] end of an RNA in enough detail such that it can unambiguously be mapped onto the genomic sequence. The underlying and broadly accepted assumption is that 5[prime] ends of eukaryotic transcripts are generated by transcription initiation rather than endonucleolytic cleavage.
Entry concept
An entry in EPD corresponds to a single biological object, not to an individual data report. The redundancy policy applied permits only one entry per genetic map position and organism. The taxonomic resolution is generally at the species level. All information pertaining to the same transcription initiation site in a genome is thus combined in a single entry. Note that there is no one-to-one relationship between EPD promoter entries and genes. In eukaryotic organisms, the same initiation site may be used for transcription of multiple genes and vice versa.
Organisation as a functional position set
A functional position set (FPS) is a machine readable list of pointers to positions in DNA sequences stored elsewhere, which can be used for automatic retrieval of fixed-length sequence segments around physiological sites (7). There were several motivations for such an approach. For one thing, it avoids data redundancy and thereby helps to maintain global data coherence. Verifying and adjusting position pointers to a new sequence database release is a relatively straightforward way to keep track of corrections and extensions of the promoter sequence data. More importantly, incorporation of sequence data into a promoter database would have implied an arbitrary choice of the 5[prime] and 3[prime] borders of a promoter region not based on experimental criteria. Access to sequences through an FPS enables the user to customise the length and location of the extracted sequence segments with regard to a particular biological question or data analysis procedure.
Physiological viewpoint
EPD is a database on gene function. Promoters are viewed as physiological sequence objects dependent on the correct interpretation by a trans-acting environment. The content of an entry is therefore restricted to information connected to the transcription initiation process. In accordance with this policy, the literature references given in EPD refer to transcript mapping data or reports on regulatory properties of the promoter, but never to mere sequence data. Furthermore, promoters are classified according to their cognate trans-acting environments (which make them promoters) rather than by the organisms which replicate them. Therefore, a promoter on a viral genome may be classified as a vertebrate promoter if it is interpreted by the host transcription machinery rather than virus-encoded factors.
Relationship to primary information sources
All information in EPD derives from independent examination and interpretation of experimental data presented in the cited research publications. The same standards are applied to data from different sources. As a consequence, the interpretations presented in EPD may differ from the conclusions reached by the authors of the data. Note that many transcription initiation sites described in the literature have not been included in EPD because the underlying experimental evidence did not meet the minimal requirement for inclusion, as stated in the user manual.
Maintenance policy
EPD entries are dynamic by design, not only because the pointers to transcription initiation sites in EMBL need to be verified and adjusted on a regular basis, but also because biological information on particular promoters accumulates continually. Changes in existing EPD entries are therefore not confined to error corrections. With the recent format extensions, maintenance of complete lists of cross-references to other databases has become a major objective.
Priorities
Being maintained with very limited resources, completeness was never considered a realistic objective. It is estimated that EPD presently contains only about half of the reported promoters formally qualifying for inclusion. By contrast, a great effort has been made to ensure accuracy of all information in the data collection. This remains a promise of EPD.
CONTENTS
In order to maintain a high quality standard, the scope of EPD has been limited in two ways: (i) by excluding certain classes of eukaryotic promoters, and (ii) by covering only certain aspects of promoters.
Admission criteria
To be included in EPD, a promoter has to fulfil all of the following five conditions:
The reasons for some of the restrictions may not be immediately obvious. The decision to exclude promoters from lower eukaryotes was taken because transcript mapping experiments were rarely carried out in these organisms, especially in yeast. The exclusion of POL I and POL III promoters seemed reasonable 10 years ago but is no longer justified now as we know that the same type of cis-acting elements may interact with more than one polymerase system. The experimental data criterion is strictly handled. For instance, single nuclease protection or primer extension data are generally not accepted, unless they are supported by additional experiments, e.g. demonstration of promoter activity in transiently transfected cells. On the other hand, data pertaining to the orthologous promoter of a closely related species is accepted, if the homologous promoters meet a stringent sequence similarity criterion. The requirement of function helps to deal with a number of pathological situations such as transcribed pseudogenes and promoters accidentally created by insertion of retroviral DNA.
Information content of an entry
An EPD entry contains the following types of information:
Each entry has an ID and an accession number. The description typically includes a gene or a gene product name. The exceptions are promoters used for transcription of several genes, such as the Adenovirus major late promoter.
Table 1
| Total # of entries (# of independent entries) | 1308 | (861) |
| 1. Plant promoters | 179 | (128) |
| 1.1. Chromosomal genes | 167 | (117) |
| 1.2. Prokaryotic plasmid DNA | 8 | (7) |
| 1.3. Viral genes | 4 | (4) |
| 2. Nematode promoters | 11 | (10) |
| 2.1. Chromosomal genes | 11 | (10) |
| 3. Arthropod promoters | 163 | (108) |
| 3.1. Chromosomal genes | 156 | (101) |
| 3.2. Transposable elements and retroviruses | 2 | (2) |
| 3.3. Viral genes | 5 | (5) |
| 4. Mollusc promoters | 3 | (3) |
| 4.1. Chromosomal genes | 3 | (3) |
| 5. Echinoderm promoters | 42 | (24) |
| 5.1. Chromosomal genes | 42 | (24) |
| 6. Vertebrate promoters | 910 | (588) |
| 6.1. Chromosomal genes | 750 | (472) |
| 6.2. Transposable elements and retroviruses | 31 | (13) |
| 6.3. Viral genes | 129 | (103) |
The machine-readable pointers to sequence data have four parts: a reference to an EMBL entry, a sign indicating plus or minus strand, a symbol indicating the topology of the sequence (linear or circular), and a sequence position number (the sequence topology is not redundant because the notion of a circular sequence is not exactly the same as in EMBL).
Based on the initiation patterns, three types of promoters are distinguished: (i) single initiation sites, (ii) clustered multiple initiation sites, and (iii) transcription initiation regions. Over 90% of all entries belong to the single initiation site class. Groups of promoters sharing over 50% sequence identity among each other are identified by a so-called homology group number. A subset of not closely related promoters (less than 50% identity) recommended for statistical analyses is marked by a special flag. All entries are embedded in a hierarchical classification system which helps to locate promoters of interest. The top level distinguishes between phyla (vertebrates, echinoderms, etc.); the second level between replicon types (chromosomes, transposable elements, viruses, see Table 1).
The information on regulatory properties was never collected in a systematic way and thus remained fragmentary. With the old format, there was space only for one database cross-reference, the machine-readable pointer to the corresponding EMBL sequence entry. A major objective of the current reorganisation is the introduction of many new cross-references to other data collections. So far, we have incorporated links to TRANSFAC (6), SWISS-PROT (8), Flybase (9) and MIM (10). In addition, MEDLINE identifiers have been added to the bibliographic references. In the future, we plan also to provide exhaustive cross-referencing to sequences in the EMBL Library as already exemplified by the entry shown in Figure 1.
Figure

FORMAT
EPD is distributed and maintained as a single ASCII flatfile. The format resembles those of the EMBL and SWISS-PROT sequence files. Each line starts with a line code identifying the type of information presented. The original EPD format, in which the database was distributed for almost 10 years, was very concise, relying on many abbreviations and providing certain types of information by alphanumeric codes rather then free text. As mentioned before, this format has recently been extended to allow representation of new types of information, but also to make it more human readable. An example of an entry in the new format is shown in Figure 1.
The old representation of an entry, which comprises the lines starting with FP, DO and RF, is included at the bottom of the new format ensuring that all software written for the old format will continue to work. The line types of the new format will briefly be explained now. (For a description of the old format, see the EPD user manual.)
The ID line type contains a unique entry identifier, specification of the initiation site type (single, multiple or region), and a taxonomic division code (e.g. VRT for vertebrates). The AC, DE and OS lines, as well as the line types containing the bibliographic references, carry the same type of information as in EMBL or SWISS-PROT sequence entries. Note that an EPD accession number consists of the character string `EP' followed by 5 digits. The HG line is optional. It contains a homology group number that allows identification of all sequence-wise similar promoters in EPD. The AP line provides information on alternative promoters of the same gene.
The DR lines contain cross-references to other databases. The precise format of these lines depends on the target database. Note that some cross-references include numbers indicating the relative position of a linked sequence object, or keywords characterising the nature of the relationship between the entries. For instance, the ranges associated with cross-references to EMBL entries define the extensions of the EMBL sequences relative to the initiation site described by the EPD entry. The multiplicity of EMBL cross-references in the example shown mirrors the redundancy of the sequence database. The positional information given on DR lines can be used for graphic display of the various sequence objects along the chromosome axis (Fig. 2).
Figure The lines starting with ME describe experiments defining the transcription initiation site. In the new format, the experiments are individually linked to bibliographic references. The SE line shows a short sequence segment corresponding to the -49 to +10 region of the promoter. Transcribed and untranscribed nucleotides are represented by upper and lower case characters, respectively. This newly introduced line type is not meant to provide sequence data but serves as a control string for sequence extraction. The TX lines define a promoter's location within EPD's hierarchical classification system. In the old format, all information that could be used for selection and retrieval of biologically meaningful promoter sequence subsets was given on a single line starting with the FP code. The current software behind our web pages still uses this part of the entry for FPS-dependent sequence retrieval. Future versions will use the new EMBL cross-references directly. The line on regulation and expression is the only part of the old format not yet replaced by a new representation. During the last 5 years, EPD has not only been distributed in its original format but also in various alternative views. A `view' is defined as a file that can be automatically derived from EPD and other public databases, and thus does not contain any genuinely new information. The distributed views included the widely used promoter sequence data files containing the -499 to +100 regions of all promoters in EPD.

ACCESS
EPD can be obtained via anonymous ftp from ftp.epd.unilich (directory: /pub/databases/epd). The following files are available:
The URL for online access to EPD is http://cmpteam4.unil.ch/epd/ This site offers the following services:
ACKNOWLEDGEMENT
EPD is funded in part by grant 95.0236-1 of the Swiss Federal Office for Education and Science.
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals Comments and feedback: www-admin{at}oup.co.uk
Last modification: 17 Dec 1997
Copyright© Oxford University Press, 1998.
This article has been cited by other articles:
![]() |
C. H. Choi, Z. Rapti, V. Gelev, M. R. Hacker, B. Alexandrov, E. J. Park, J. S. Park, N. Horikoshi, A. Smerzi, K. O. Rasmussen, et al. Profiling the Thermodynamic Softness of Adenoviral Promoters Biophys. J., July 15, 2008; 95(2): 597 - 608. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Cai, B. Hartnett, C. Gustafsson, and J. Peccoud A syntactic model to design and verify synthetic genetic constructs derived from standard biological parts Bioinformatics, October 15, 2007; 23(20): 2760 - 2767. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Cooper, N. D. Trinklein, E. D. Anton, L. Nguyen, and R. M. Myers Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome Genome Res., January 1, 2006; 16(1): 1 - 10. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. D. Schmid, R. Perier, V. Praz, and P. Bucher EPD in its twentieth year: towards complete promoter coverage of selected model organisms Nucleic Acids Res., January 1, 2006; 34(suppl_1): D82 - D85. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Chen, J.-m. Wu, K. Hornischer, A. Kel, and E. Wingender TiProD: the Tissue-specific Promoter Database Nucleic Acids Res., January 1, 2006; 34(suppl_1): D104 - D107. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Fukue, N. Sumida, J.-i. Tanase, and T. Ohyama A highly distinctive mechanical property found in the majority of human promoters and its transcriptional relevance Nucleic Acids Res., July 13, 2005; 33(12): 3821 - 3827. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Qian, N. Esumi, Y. Chen, Q. Wang, I. Chowers, and D. J. Zack Identification of regulatory targets of tissue-specific transcription factors: application to retina-specific gene regulation Nucleic Acids Res., June 20, 2005; 33(11): 3479 - 3491. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kamalakaran, S. K. Radhakrishnan, and W. T. Beck Identification of Estrogen-responsive Genes Using a Genome-wide Analysis of Promoter Elements for Transcription Factor Binding Sites J. Biol. Chem., June 3, 2005; 280(22): 21491 - 21497. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. I. Gershenzon and I. P. Ioshikhes Synergy of human Pol II core promoter elements revealed by statistical sequence analysis Bioinformatics, April 15, 2005; 21(8): 1295 - 1300. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Gangal, P. Sharma, R. Gangal, and P. Sharma Human pol II promoter prediction: time series descriptors and machine learning Nucleic Acids Res., March 24, 2005; 33(5): 1739 - 1739. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Gangal and P. Sharma Human pol II promoter prediction: time series descriptors and machine learning Nucleic Acids Res., March 1, 2005; 33(4): 1332 - 1336. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. D. Smith, P. Sumazin, and M. Q. Zhang Identifying tissue-selective transcription factor binding sites in vertebrate promoters PNAS, February 1, 2005; 102(5): 1560 - 1565. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Sumazin, G. Chen, N. Hata, A. D. Smith, T. Zhang, and M. Q. Zhang DWE: Discriminating Word Enumerator Bioinformatics, January 1, 2005; 21(1): 31 - 38. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Fukue, N. Sumida, J.-i. Nishikawa, and T. Ohyama Core promoter elements of eukaryotic genes have a highly distinctive mechanical property Nucleic Acids Res., November 1, 2004; 32(19): 5834 - 5840. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. D. Schmid, V. Praz, M. Delorenzi, R. Perier, and P. Bucher The Eukaryotic Promoter Database EPD: the impact of in silico primer extension Nucleic Acids Res., January 1, 2004; 32(90001): D82 - 85. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Liu and D. J. States Consensus Promoter Identification in the Human Genome Utilizing Expressed Gene Markers and Gene Modeling Genome Res., March 1, 2002; 12(3): 462 - 469. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Praz, R. Perier, C. Bonnard, and P. Bucher The Eukaryotic Promoter Database, EPD: new entry types and links to gene expression data Nucleic Acids Res., January 1, 2002; 30(1): 322 - 324. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Baker, A. van den Broek, E. Camon, P. Hingamp, P. Sterk, G. Stoesser, and M. A. Tuli The EMBL Nucleotide Sequence Database Nucleic Acids Res., January 1, 2000; 28(1): 19 - 23. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. C. Perier, V. Praz, T. Junier, C. Bonnard, and P. Bucher The Eukaryotic Promoter Database (EPD) Nucleic Acids Res., January 1, 2000; 28(1): 302 - 303. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Voz, N. S. Agten, W. J. M. Van de Ven, and K. Kas PLAG1, the Main Translocation Target in Pleomorphic Adenoma of the Salivary Glands, Is a Positive Regulator of IGF-II Cancer Res., January 1, 2000; 60(1): 106 - 113. [Abstract] [Full Text] |
||||
![]() |
I. Ioshikhes, E. N. Trifonov, and M. Q. Zhang Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure PNAS, March 16, 1999; 96(6): 2891 - 2895. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Papatsenko, V. J. Makeev, A. P. Lifanov, M. Regnier, A. G. Nazina, and C. Desplan Extraction of Functional Binding Sites from Unique Regulatory Regions: The Drosophila Early Developmental Enhancers Genome Res., March 1, 2002; 12(3): 470 - 481. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






