| Nucleic Acids Research | Pages |
Histone Sequence Database: new histone fold family members
Introduction
Database Content
The Histone Fold Motif
Database Availability
Acknowledgements
References
Histone Sequence Database: new histone fold family members
ABSTRACT
INTRODUCTION
The histone proteins play a critical role in the compaction of DNA into nucleosomes as well as in the overall organization of eukaryotic chromosomes (1,2). The four core histones (H2A, H2B, H3 and H4) form a tripartite, octameric assembly (3). The basic nature of these proteins, which have a high proportion of lysine and arginine residues, facilitates the wrapping of 146 bp of DNA around the octamer to form the elementary unit of compaction in eukaryotic DNA, the nucleosomal core particle (4). In turn, the linker histones (H1 and H5) can bind to the DNA between nucleosomes (5), thereby stabilizing the nucleosomes and promoting the formation of higher-order chromatin structures. Due to the central role of the histones within the cell, these proteins have been very highly conserved throughout evolution. This conservation is best illustrated by the >95% identity across all known H4 sequences. The core histones exhibit this degree of conservation throughout their entire length, while the linker histones are only conserved as such within their central, globular domain.
The folding and association of each of the core histones within the octamer complex is driven by a common structural motif called the histone fold (6). This motif, which has been defined as an extended helix-loop-helix domain, is also found in a number of non-histone proteins that, like the histones, are involved in protein-protein and protein-DNA interactions (7). Both the existence and conservation of an extensive protein domain such as the histone fold across a wide span of taxonomic groups argues for an evolutionary relationship between the histone fold proteins, as well as for a critical role of this domain in cellular metabolism.
In this paper, we describe the Histone Sequence Database, a compilation of all of the histone and histone fold protein sequences and structures available as of October 1997. This database is intended to be a source of sequence information for these chromosomal proteins, with particular reference to conflicts between similar sequence entries in different source databases. The database, which will be updated as new histone sequence entries are processed, represents the most comprehensive collection and annotation of histone primary sequences and histone fold-containing sequences assembled to date.
DATABASE CONTENT
The histone protein sequences were compiled by searching the non-redundant (nr) protein sequence database at NCBI using both the BLASTP (8) and PSI-BLAST (9) algorithms. The nr database is a compilation of entries from SWISS-PROT (10), the Protein Identification Resource (PIR) (11), the Protein Data Bank (PDB) and CDS translations from GenBank (12). In each case, the sequences from chicken and human sources were used as the basis for comparison. In the case of H5, no such histone exists in human, so only the chicken sequence was used. Manually added to the histone H1 sequence set was the sequence of Hh01p, a sequence discovered by TBLASTN searches against the complete yeast genome sequence in the Saccharomyces Genome Database. This sequence resides in an open reading frame on yeast chromosome XVI (13) and its ability to adopt the H1 structure has been confirmed by homology model building (unpublished results).
For each of the five histone classes, there are two protein sequence files (see Table 1 for statistics). The first file type contains all of the sequences found for that histone and is, therefore, redundant. The sequence data is presented in FASTA format, with the definition line for each entry containing, in order, the name of the source database, the accession number, the locus name or SWISS-PROT ID (as appropriate to the source database), a word description and a histone code. Each element on the definition line is separated by a vertical bar. The second file type contains only one entry for each unique sequence from a particular organism or variant thereof, making these files non-redundant. These sequences are also in FASTA format, with only a histone code appearing in the definition line. The histone codes can be used to cross-reference entries in the complete, redundant set of protein sequences.
In the course of the database searches, cases were noted where there were conflicts between the individual sequence entries for a given histone. In citing sequence conflicts, a majority-rule approach was used. In cases where there was no clear majority among the sequences, the differences are noted with respect to the entry in SWISS-PROT, where available. A pairwise sequence alignment, generated using CLUSTAL W (14), along with the sequence conflict information is presented for all entries where discrepancies were noted. Cases where the protein sequences are in fact correct but have been incorrectly identified are also noted.
Multiple sequence alignments for each histone protein are available in PostScript format for downloading or viewing with an appropriate PostScript translator. In each alignment, the major human histone sequence (chicken in the case of H5) is shown at the top of the alignment. The region of the histone fold motif, as described above, is boxed; alignment within the histone fold region was done manually. Regions outside the histone fold motif were aligned using CLUSTAL W (14). In the case of H1 and H5, the central, globular domain is boxed instead.
A search engine has been added with the current release of the database. The search engine allows users to amass entries present within different files by histone type, organism, or data set. A free-text search is also available for the complete (redundant) data set, allowing users to search for any text present on the definition line of the individual entries. Search results are returned in FASTA format.
Table
Total
Non-Redundant
Structure
sequence set
sequence set
Histone H1
208
77
1
Histone H2A
224
78
0
Histone H2B
207
72
0
Histone H3
226
82
0
Histone H4
179
59
0
Histone H5
13
5
1
Histone Octamer
1
Nucleosomal Core Particle
1
Total Histone Entries
1057
373
4
Histone Fold Proteins
47
2
THE HISTONE FOLD MOTIF
The original sequence-based identification of non-histone proteins containing the histone fold motif was based on the Motif Search Tool (MoST) (15). Recently, a new method called PROBE (16) has been developed which, through a different algorithm, also detects subtly-conserved sequence patterns. Both of these methods were used to re-examine the protein databases to detect new members of the histone fold family. As a result of these new searches, several new and biologically interesting proteins have been added to the histone fold family (Fig. 1).
Figure
Among the new family members is DRAP1, which associates with the TATA-binding protein-associated phosphoprotein DR1, itself a member of the histone fold family. DRAP1 and DR1 are capable of forming heterodimers, potentially through their histone fold motifs, and the association of DRAP1 with DR1 stabilizes the entire DRAP1-DR1-TATA complex, blocking the entry of TFIIA and/or TFIIB to preinitiation complexes (17). The chromatin-associated protein CSE4 from Saccharomyces cerevisiae has also been added to the alignment. CSE4 is essential for cell division, and mutations in CSE4 have been shown to increase non-disjunction of chromosomes bearing mutant centromeric DNA sequences (18). High-copy CSE4 has also been found to suppress the temperature sensitivity of lethal H4 mutants defective in mitotic chromosome transmission (19). The sequence for transcription factor IIB (TFIIB) was manually deleted from the sequence set returned by the motif search methods. Recently, the crystal structure of a preinitiation complex from the archaean Pyrococcus woesei was determined at a resolution of 2.1 Å (PDB:1AIS) (20). Based on comparisons with previously-solved structures of other histone fold proteins (21-23), it was determined that, despite sequence information to the contrary, TFIIB does not form the histone fold structure. A pair of Web pages containing sequence data from non-histone proteins identified as containing the histone fold motif can be found within the Histone Fold section of the Database. One page contains the complete sequence of the protein, while the second contains only the histone fold motif portion of the sequence. These files are both in FASTA format. In addition, multiple sequence alignments of these histone fold motif are available in PostScript format. With respect to structures, information is provided for all histone and histone fold proteins for which three-dimensional coordinate data has been deposited and is available through PDB. If the coordinate data has been released, users can link to both MMDB and PDB to retrieve the files, as described below.

DATABASE AVAILABILITY
The Histone Sequence Database is available through the World Wide Web at either:
http://www.nhgri.nih.gov/DIR/GTB/HISTONES or
http://www.ncbi.nlm.nih.gov/Baxevani/HISTONES
A menu bar appears to the left of each page, allowing users to easily navigate the Web site without having to return to the home page to examine different parts of the site. In order to increase the utility of the database, hyperlinks have been integrated into all of the FASTA-formatted sequence files comprising the complete (redundant) data set. Clicking on the accession number for a particular entry allows the user to view the NCBI Entrez document report for that entry. In most cases, these document reports include links to other relevant data, such as to literature citations in MEDLINE (PubMed), related sequence entries in GenBank, and the Molecular Modeling Database (MMDB). Hyperlinks from the table of structures connect to the MMDB and PDB structure entries for that protein. From the MMDB entry, users can view the structure itself using Cn3D (24), a molecular viewing application that is bundled with Network Entrez and can be downloaded by following hyperlinks on any structure entry page. In this fashion, users can take advantage of the integrated nature of the Entrez retrieval system to gather large amounts of information on a particular sequence or set of sequences (25).
Database flatfiles can also be downloaded directly from the public FTP site at NCBI (ncbi.nlm.nih.gov, directory /pub/baxevanis/histones). There are two FASTA-format protein files for each major histone type, corresponding to the complete or redundant sequence set (*.raw) and to the non-redundant sequence set (*.nr). The format of the definition lines for each entry are as described for the Web site, above. The histone codes used in the definition lines of these entries are in a text file (codes.txt) in the same directory. Two FASTA-format files of the histone fold protein sequences are also in this directory: hf_seqs.txt contains the complete sequence of each protein, while hf_motif.txt contains only the histone fold motif portion of each sequence.
Studies utilizing the data within this database, obtained either through the World Wide Web site or the anonymous FTP site, should cite this paper as the primary reference.
ACKNOWLEDGEMENTS
We would like to thank Erik Ferlanti for his assistance in designing the new graphical front-end for the Database and developing the newly-added sequence search engine.
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals Comments and feedback: www-admin{at}oup.co.uk
Last modification: 17 Dec 1997
Copyright© Oxford University Press, 1998.
This article has been cited by other articles:
![]() |
Y. Zhang, J. Lv, H. Liu, J. Zhu, J. Su, Q. Wu, Y. Qi, F. Wang, and X. Li HHMD: the human histone modification database Nucleic Acids Res., November 5, 2009; (2009) gkp968v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Tsunaka, N. Kajimura, S.-i. Tate, and K. Morikawa Alteration of the nucleosomal DNA path in the crystal structure of a human nucleosome core particle Nucleic Acids Res., June 10, 2005; 33(10): 3424 - 3434. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.A. FRANCO and P.D. KAUFMAN Histone Deposition Proteins: Links between the DNA Replication Machinery and Epigenetic Gene Silencing Cold Spring Harb Symp Quant Biol, January 1, 2004; 69(0): 201 - 208. [Abstract] [PDF] |
||||
![]() |
S. Muratoglu, S. Georgieva, G. Papai, E. Scheer, I. Enunlu, O. Komonyi, I. Cserpan, L. Lebedeva, E. Nabirochkina, A. Udvardy, et al. Two Different Drosophila ADA2 Homologues Are Present in Distinct GCN5 Histone Acetyltransferase-Containing Complexes Mol. Cell. Biol., January 1, 2003; 23(1): 306 - 321. [Abstract] [Full Text] |
||||
![]() |
W. Song, H. Solimeo, R. A. Rupert, N. S. Yadav, and Q. Zhu Functional Dissection of a Rice Dr1/DrAp1 Transcriptional Repression Complex PLANT CELL, January 1, 2002; 14(1): 181 - 195. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Hernández-Hernández and A. Ferrús Prodos Is a Conserved Transcriptional Regulator That Interacts with dTAFII16 in Drosophila melanogaster Mol. Cell. Biol., January 15, 2001; 21(2): 614 - 623. [Abstract] [Full Text] |
||||
![]() |
F. Bolognese, C. Imbriano, G. Caretti, and R. Mantovani Cloning and characterization of the histone-fold proteins YBL1 and YCL1 Nucleic Acids Res., October 1, 2000; 28(19): 3830 - 3838. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. Thompson, F. Plewniak, J.-C. Thierry, and O. Poch DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches Nucleic Acids Res., August 1, 2000; 28(15): 2919 - 2926. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



