Abstract

Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.

INTRODUCTION

Pfam is a database of curated protein families, each of which is defined by two alignments and a profile hidden Markov model (HMM). Profile HMMs are probabilistic models used for the statistical inference of homology (1,2) built from an aligned set of curator-defined family-representative sequences. A high-quality seed alignment is essential, as it provides the basis for the position-specific amino-acid frequencies, gap and length parameters in the profile HMM. In Pfam, the profile HMM is searched against a large sequence collection, based on UniProt Knowledgebase (UniProtKB) (3), to find all instances of the family. Sequence regions that score above the curated threshold that is set for each family to eliminate false positives (the so-called gathering threshold) are aligned to the profile HMM to produce the full alignment. Curated entries are referred to as Pfam-A entries. The profile HMMs are built and searched using the HMMER software suite (http://hmmer.janelia.org) (4,5).

Sometimes, a single profile HMM cannot detect all homologues of a diverse superfamily, so multiple entries may be built to represent different sequence families in the superfamily. Such related Pfam-A entries are grouped into clans (6). In an effort to be comprehensive, automatically generated entries, called Pfam-B, are built from sequence clusters not currently covered by Pfam-A entries.

Pfam data are available in a variety of formats, which include flatfiles (derived from the MySQL database) and relational table dumps, both of which can be downloaded from the FTP site (ftp://ftp.sanger.ac.uk/pub/databases/Pfam). The Pfam website (available at http://pfam.sanger.ac.uk/andhttp://pfam.janelia.org/) provides different ways to access the database content, providing both graphical representations of and interactive access to the data.

In the 2012 article (7), much of the content was focused on curation details. In this article, we focus on describing the new and updated data features provided in the database and by the website. Besides adding new features, it is also important to indicate those that are no longer available, many of which have been removed due to our drive to scale with the growing influx of new sequences.

PFAM STATISTICS

The current release of Pfam, version 27.0, contains 14 831 Pfam-A families. Of these families, 4563 have been classified into 515 clans. Compared with Pfam 26.0, there has been an increase of 1159 families (1182 new entries have been added and 22 entries have been removed) and 16 new clans, with an additional 320 families having been classified into clans. The Pfam-A families in release 27.0 match 79.9% of the 23.2 million sequences and 58% of the 7.6 billion residues in the underlying sequence database. This corresponds to a negligible percentage increase in sequence and residue coverage (<0.5%), but reflects a significant amount of curation effort. These statistics mask the fact that the underlying sequence database has increased by 7.3 million sequences, a number greater than the entire sequence database of Pfam 23.0, which contained 5.3 million sequences.

Two of the main sources for generating the new families added to release 27.0 were Protein Data Bank (PDB) structures (8) and human sequences. We have made a concerted effort to build families from CATH domains (http://www.cathdb.info/) (9) that did not match a Pfam family in Pfam 26.0. To do so, we used jackhmmer, a program within the HMMER3 software that allows a sequence to be iteratively searched against a sequence database. One hundred new Pfam-A families were built using the sequence of a CATH domain to initiate a jackhmmer search against our underlying sequence database (three iterations were run using an E-value threshold of 0.001). Our curators then used the output from the last iteration of the jackhmmer program as the basis for generating the seed alignment of a new Pfam-A entry. We have also built families for Homo sapiens sequences that did not have a match in Pfam 26.0. By taking the Swiss-Prot collection of human sequences (∼20 000 sequences) and excluding those sequences matched by a Pfam-A entry, each remaining sequence was used to initiate a jackhmmer search. Again, Pfam-A entries were built from the jackhmmer output. By building families in this way, we have increased the sequence coverage of the Swiss-Prot set of human sequences by almost 5% and the residue coverage by 2.2%. The Pfam 27.0 sequence coverage of Swiss-Prot human sequences is now 90.5% and the residue coverage is 45.1%. We will continue to work on incorporating more human regions into Pfam-A, as there is still much to be gained at the residue level. However, attaining high residue coverage in human is complicated by the large fraction of intrinsic disorder found in the regions that are not currently covered by Pfam-A families [discussed further in (10)]. In addition to using CATH domains and human sequences as starting points for new Pfam families, we continue to add families built from Pfam-B entries, as well as from community submissions received via our helpdesk. We have received 135 direct submissions from our seven registered external contributors, who have our database curation tools installed locally to facilitate automated deposition.

In 2012, we described the introduction of Wikipedia as a platform for community-based functional annotation (7). Since release 26.0, the first to include links to Wikipedia articles, we have tried to link as many Pfam-A families as possible to those articles that best describe their biology. The number of families linking to a Wikipedia article increased from 4942 in 26.0 to 5663 families in release 27.0, an increase of 721. Of these 721 new links, 391 were added to old families and 330 were added to new families in Pfam 27.0. Some articles may be linked to many Pfam-A families, but the number of unique Wikipedia articles also rose by 311, from 1016 in 26.0 to 1327 in 27.0. As described previously, we operate a manual approval system that allows us to view all changes to our linked articles. Although the number of newly linked articles has increased, we have also observed a steady stream of edits to many of the linked articles. Most edits are simple format or typographic improvements, but many have also provided valuable scientific content, including significant improvements to and expansion of important articles. For example the Wikipedia article on EGF-like domains was significantly expanded in October 2012.

RECENT CHANGES TO THE DATABASE CONTENT

Removing dubious sequences from the underlying database

Each Pfam release is calculated against a fixed sequence database, called pfamseq, which is derived from UniProtKB (3). At the beginning of a release cycle, we take a copy of the current version of UniProtKB and process it in two ways, the second of which is a novel addition for release 27.0. First, we remove sequences that contain non-consecutive regions. The linear sequence-information in these proteins will be inaccurate, as adjacent residues in the sequence can flank an intervening number of unsequenced residues. There are currently <1000 UniProt entries that contain non-consecutive sequence regions. The second, new processing step is the removal of sequences derived from spurious open reading frames, which are identified by searching AntiFam (11) models against the sequence database. In release 27.0, the models from AntiFam version 2.0 identified 2829 sequences for removal.

Family full alignments and trees

When building a Pfam release, we aim to ensure that the same set of post-processing operations are performed on all families regardless of size, thereby providing consistency both to the database and to the website. One of the distinguishing features of Pfam compared with most other protein family databases is our provision of full alignments. Unsurprisingly, however, with the exponential growth of the underlying sequence database, we have observed a similar dramatic increase in the size of our full alignments. Although generation of these alignments does not currently present a scalability problem, aiding human interpretation through visualization has become increasingly difficult. Most approaches for facilitating alignment visualization natively in the browser do not scale well. Applets, such as the Jalview alignment viewer (12), partly solve the problem, but require Java to be installed and coupled to the browser.

For example, the largest Pfam-A family (version 27.0) with >363 000 matches to the profile HMM is the ABC transporters family (ABC_tran, accession PF00005)—its full alignment is thus too large to be useful for most purposes. The seed alignment, by contrast, contains just 55 representative sequences, which may be an insufficient number to represent the sequence diversity within the family. To provide more useable samples of the sequence diversity within a family, we now calculate model-matches for four additional sequence sets, based on ‘Representative Proteomes’ (RPs) (13). For the ABC_tran family, the RP alignments range in size from approximately a quarter of the size of the full alignment to less than one tenth.

In an RP set, each member proteome is selected from a grouping of similar proteomes. The selected proteome is chosen to best represent the set of grouped proteomes in terms of both sequence and annotation information. The grouping of proteomes is based on a clustering of UniProt, UniRef50, and includes all complete proteome sequences. In each cluster, sequences have ≥50% identity and have at least an 80% overlap with the longest sequence. The similarity of two proteomes is determined by considering just the clusters containing sequences from either of the two proteomes. The two proteomes are grouped when the fraction of clusters that contain sequences from both proteomes out of the subset of proteome-specific clusters exceeds a given threshold. This threshold is termed the co-membership threshold. The percentage threshold of co-membership (or common clusters) can be adjusted down to produce larger groupings, and hence less redundant sequence sets.

We use the RP sequence sets constructed using co-membership thresholds of 75, 55, 35 and 15%, giving a range of sequence redundancy for each family. Using representative proteomes has the advantage that it still allows for organism-specific copy numbers to be assessed, a feature that can be lost when using global non-redundancy thresholds on an entire sequence database. However, the major advantage for Pfam is the dramatic reduction in the size of the family full alignments, as shown in Table 1, which illustrates the reductions with increasingly redundant RPs for the 10 biggest families in Pfam. The RP sets do not currently include viruses, and so for some families such as GP120, there may not be a match to the RP sets.

Table 1.

The reduction in size of RP versus full alignments

Family identifier (accession)SeedFullRP75RP55RP35RP15
ABC_tran (PF00005)55363 40926% (93 265)21% (77 150)16% (57 358)8% (28 903)
COX1 (PF00115)94254 3511% (2006)0.7% (1661)0.4% (1218)0.2% (538)
zf-H2C2_2 (PF13465)163227 89861% (138 033)27% (60 664)15% (34 039)9% (21 562)
WD40 (PF00400)1804193 25265% (125 805)52% (100 531)36% (69 386)23% (21 562)
MFS_1 (PF07690)195181 66830% (55 719)25% (55 719)17% (55 719)8% (55 719)
RVT_1 (PF00078)152172 3605% (8257)4% (6662)3% (5373)2% (3604)
BPD_transp_1 (PF00528)81156 33923% (36 523)19% (29 422)14% (22 134)7% (10 630)
Response_reg (PF00072)57151 33729% (44 329)25% (37 848)20% (29 453)10% (15 208)
GP120 (PF00516)24146 453N/AN/AN/AN/A
HATPase_c (PF02518)659129 38628% (36 085)24% (30 935)19% (24 121)10% (12 473)
Family identifier (accession)SeedFullRP75RP55RP35RP15
ABC_tran (PF00005)55363 40926% (93 265)21% (77 150)16% (57 358)8% (28 903)
COX1 (PF00115)94254 3511% (2006)0.7% (1661)0.4% (1218)0.2% (538)
zf-H2C2_2 (PF13465)163227 89861% (138 033)27% (60 664)15% (34 039)9% (21 562)
WD40 (PF00400)1804193 25265% (125 805)52% (100 531)36% (69 386)23% (21 562)
MFS_1 (PF07690)195181 66830% (55 719)25% (55 719)17% (55 719)8% (55 719)
RVT_1 (PF00078)152172 3605% (8257)4% (6662)3% (5373)2% (3604)
BPD_transp_1 (PF00528)81156 33923% (36 523)19% (29 422)14% (22 134)7% (10 630)
Response_reg (PF00072)57151 33729% (44 329)25% (37 848)20% (29 453)10% (15 208)
GP120 (PF00516)24146 453N/AN/AN/AN/A
HATPase_c (PF02518)659129 38628% (36 085)24% (30 935)19% (24 121)10% (12 473)

The seed alignment is used to construct the profile HMM and contains a representative set of sequences of the family. The full alignment contains all hits in pfamseq scoring above the gathering threshold. In Pfam 27.0, we have introduced four additional alignments based on RPs, which contain decreasing amounts of sequence redundancy from RP75 to RP15. For each RP data set, the percentage reduction in the size of the full alignment is shown, with the number of sequences given in brackets.

Table 1.

The reduction in size of RP versus full alignments

Family identifier (accession)SeedFullRP75RP55RP35RP15
ABC_tran (PF00005)55363 40926% (93 265)21% (77 150)16% (57 358)8% (28 903)
COX1 (PF00115)94254 3511% (2006)0.7% (1661)0.4% (1218)0.2% (538)
zf-H2C2_2 (PF13465)163227 89861% (138 033)27% (60 664)15% (34 039)9% (21 562)
WD40 (PF00400)1804193 25265% (125 805)52% (100 531)36% (69 386)23% (21 562)
MFS_1 (PF07690)195181 66830% (55 719)25% (55 719)17% (55 719)8% (55 719)
RVT_1 (PF00078)152172 3605% (8257)4% (6662)3% (5373)2% (3604)
BPD_transp_1 (PF00528)81156 33923% (36 523)19% (29 422)14% (22 134)7% (10 630)
Response_reg (PF00072)57151 33729% (44 329)25% (37 848)20% (29 453)10% (15 208)
GP120 (PF00516)24146 453N/AN/AN/AN/A
HATPase_c (PF02518)659129 38628% (36 085)24% (30 935)19% (24 121)10% (12 473)
Family identifier (accession)SeedFullRP75RP55RP35RP15
ABC_tran (PF00005)55363 40926% (93 265)21% (77 150)16% (57 358)8% (28 903)
COX1 (PF00115)94254 3511% (2006)0.7% (1661)0.4% (1218)0.2% (538)
zf-H2C2_2 (PF13465)163227 89861% (138 033)27% (60 664)15% (34 039)9% (21 562)
WD40 (PF00400)1804193 25265% (125 805)52% (100 531)36% (69 386)23% (21 562)
MFS_1 (PF07690)195181 66830% (55 719)25% (55 719)17% (55 719)8% (55 719)
RVT_1 (PF00078)152172 3605% (8257)4% (6662)3% (5373)2% (3604)
BPD_transp_1 (PF00528)81156 33923% (36 523)19% (29 422)14% (22 134)7% (10 630)
Response_reg (PF00072)57151 33729% (44 329)25% (37 848)20% (29 453)10% (15 208)
GP120 (PF00516)24146 453N/AN/AN/AN/A
HATPase_c (PF02518)659129 38628% (36 085)24% (30 935)19% (24 121)10% (12 473)

The seed alignment is used to construct the profile HMM and contains a representative set of sequences of the family. The full alignment contains all hits in pfamseq scoring above the gathering threshold. In Pfam 27.0, we have introduced four additional alignments based on RPs, which contain decreasing amounts of sequence redundancy from RP75 to RP15. For each RP data set, the percentage reduction in the size of the full alignment is shown, with the number of sequences given in brackets.

The reduction in the size of the full alignments varies from family to family, reflecting in part the bias in the sequence database. Overall, across the whole of the database, using RP at 75, 55, 35 and 15% co-membership thresholds results in average alignment sizes that are, respectively, 38.8, 29.7, 20.4 and 11.6% of the full alignment size. As the number of sequences in the sequence database increases, we anticipate that the alignments based on RPs will grow at a more linear rate and provide a more convenient way of sampling the full alignment sequence diversity.

As illustrated in Table 1, the full alignment size for the top 10 families ranges from 129 000 to 363 000 sequences. With alignments of this size, it is no longer practical to calculate the neighbour-joining trees provided in previous Pfam releases. Before release 27.0, these approximate neighbour-joining phylogenetic trees (with bootstrapping values based on 100 replicas) were used to order the alignments, such that phylogenetically related sequences would be grouped together. From release 27.0 onwards, the full alignments are ordered according to the HMMER bit score of the match, with the highest scoring sequence found at the top of the alignment. The same phylogenetic trees are still provided for the seed alignments, but are merely a guide as they are calculated with the FastTree approximation algorithm (14). The seed alignment sequences remain ordered according to the calculated tree.

In the Pfam website, we use two different colouring schemes when displaying our alignments in a web browser: the Clustal scheme (15), based on the chemical properties of the amino acids found in the column, and a heat-map scheme that reflects the posterior probability of alignment confidence (16). However, the complexity of the large multiple sequence alignments, in terms of gaps and variation, can result in vast numbers of HTML elements being generated to mark up an entire alignment. The maximum number of elements that can be displayed depends on the user’s browser and hardware, but, in an effort to protect users from attempting to view alignments that are unlikely ever to be rendered, we only make HTML versions of alignments that contain 5000 sequences or fewer. In an effort to convey which options for viewing an alignment are available for a given family via the website, we present a table indicating the availability of the alignment view option (Figure 1).

Figure 1.

Table from the ‘Alignments’ tab of the family page for COX1 (PF00115), showing the availability of different views and different alignments for COX1. The posterior probability-based alignment is only available for the full alignments as it is derived from the alignment of a sequence to the HMM, as indicated by the subscript 1 in the corresponding seed alignment cell.

SEARCH-INTERFACE DEVELOPMENTS

As the volume of data in Pfam increases, it is important to make that data even more discoverable. Before Pfam 27.0, keyword searches were performed via the backend MySQL database, using the ‘fulltext’ indexing method offered by the database engine. However, the performance of this search was deteriorating as the database grew with each release, particularly when queried with common words. To ensure future scalability, keyword searches are performed outside of the database, using Apache Lucy (http://lucy.apache.org), a tool specifically designed for full-text indexing. This has allowed us to tailor the searches to improve specificity (any query term of ≥2 characters will be used as a query), such that all query-matching strings, including substrings, are found for text associated with a Pfam-A family, structures and ontology; the sequence-annotations are also indexed, but, due to the quantity of text, this index is built only to match complete words. Results from the different text indexes are amalgamated and ordered, based on the index—prioritized in the following order: Pfam, sequence annotation, structure, Gene Ontology and InterPro—and the query term score. Keyword searches are now interactive, typically returning in <100 ms.

Faster interactive DNA searches

Pfam has provided an asynchronous DNA search tool since 2000 (17). The function of this tool is to try to identify the presence of Pfam-A families on an input DNA sequence, with results emailed to the user. Currently, it is not possible to compare directly a protein profile HMM against a DNA sequence using HMMER3. The previously described search was constructed around the GeneWise software (18), and would compare the DNA sequence to the protein profile HMMs via a gene model. The GeneWise software was originally written for profile HMMs built using the HMMER2 software suite, and although it is possible to back-convert HMMER3 models to HMMER2 format, we found that there was a significant loss in sensitivity for these searches. HMMER3 models tend to have lower relative entropy per position due to the altered prior weighting, compared with HMMER2. This, coupled with the tuning of GeneWise specificity, could account for the loss of sensitivity. However, the increased speed of HMMER3 presented an alternative approach for the detection of Pfam matches on DNA sequences. As opposed to the more sophisticated gene structure-aware approach used previously, we now can perform a standard six-frame translation on the DNA, and search each of the resulting ‘protein’ sequences against the Pfam-A library. This brute-force approach with HMMER3 is sufficiently quick to allow the use of the same interface as we use for the interactive protein sequence searches, thus unifying the sequence search interface for both protein and DNA. In the DNA search results page (Figure 2), each open reading frame is represented graphically, with the positions of the stop codons in the reading frame highlighted by red square lollipops and the positions of any domains represented using the standard Pfam domain representations. The DNA search functionality has also been incorporated into pfam_scan.pl, our downloadable tool for performing sequence searches against Pfam.

Figure 2.

Results from searching Pfam with the Hepatitis B virus isolate G376-7, complete genome (GenBank accession AF384371.1), providing a striking example of overlapping genes. The six reading frames are displayed graphically in the top box of the results page. All three reading frames from the positive strand contain matches to Pfam-A, which are tabulated below. The positions of stop codons are indicated by the square lollipops. The results are shown with the ‘protein’ coordinates of the open reading frame, but it is also possible to toggle this to DNA sequence coordinates. This search tool accepts sequences up to 80 000 nucleotides in length, and searches the Pfam-A HMM library using the gathering threshold.

Changes associated with alternative target sequence databases

To streamline the production of the database, we no longer store the matches to the NCBI NR (non-redundant) protein sequence database (19) or our metagenomics sequence collection. We still provide Stockholm formatted alignments of all matches to each family found in these two sequence databases as well as retrieval of sequences via accession (e.g. 22125853 or EBH56784.1). However, the data for non-UniProt sequence pages come from an on-the-fly search of the sequence against the Pfam-A HMM library. Generating the data in this manner not only reduces the time required to populate the database, but also provides a more coherent view of the Pfam match data: overlapping matches arising from other clan families can be removed (previously all matches were reported for the NR and metagenomics sets) using the same rules that are used for UniProtKB sequences. As a result, the view is identical to the UniProt sequence page, where the data are retrieved from the database.

DEPRECATED FEATURES

In our 2004 article (20), we described the introduction of contextual domain-hits, which used language-modeling techniques to identify weak domain hits that fell just below the gathering threshold but had support from surrounding domains (or contextual information) (21). Unfortunately, the third-party software used to generate such matches in Pfam is no longer supported and the existing implementation fails to scale, both in terms of time and memory, when presented with the tens of millions of matches now reported by Pfam. Although there is merit in providing additional functional annotations via contextual domain-hits, the improved sensitivity offered by HMMER3, the introduction of clans (which allows us to build multiple models for ubiquitous domains that cannot readily be matched by a single model) and/or simply improved models, means that many of these contextual domains are now reported by standard Pfam-A matches (Table 2). Since the last time it was calculated, in 2007, 37% of the previously identified contextual hits (10 559) are now covered by Pfam entries. The majority of contextual hits were for Pfam-A entries of type ‘Repeat’ and the highest proportion of unidentified hits belong to this entry type. This reflects the difficulty we have in generating profile HMMs that are able to detect all instances of a short degenerate, repeating sequence motif. Table 2 summarizes the breakdown of context hits that are now matched in Pfam 27.0.

Table 2.

Breakdown of contextual hits that are reported by Pfam entries in Pfam 27.0, according to the protein family type

Entry type% Context regions reported in Pfam 27.0% Context regions not reported in Pfam 27.0
Family47
Domain1313
Motif<12
Repeat2041
All3763
Entry type% Context regions reported in Pfam 27.0% Context regions not reported in Pfam 27.0
Family47
Domain1313
Motif<12
Repeat2041
All3763

The percentage reported for each entry type is the fraction out of all of the 10 559 contextual domains, with the total for all domains shown at the bottom of the table.

Table 2.

Breakdown of contextual hits that are reported by Pfam entries in Pfam 27.0, according to the protein family type

Entry type% Context regions reported in Pfam 27.0% Context regions not reported in Pfam 27.0
Family47
Domain1313
Motif<12
Repeat2041
All3763
Entry type% Context regions reported in Pfam 27.0% Context regions not reported in Pfam 27.0
Family47
Domain1313
Motif<12
Repeat2041
All3763

The percentage reported for each entry type is the fraction out of all of the 10 559 contextual domains, with the total for all domains shown at the bottom of the table.

In addition to removing features based on scalability issues, we also routinely analyze the web server access logs, to assess how the site is used. From such analyses, we have identified that the functional similarity search, which used a similarity tool (22) to identify sets of related Pfam-A families based on functional annotation (Gene Ontology terms), was not being used. We have removed this search facility from the site.

IMPROVING ACCESS TO PROTEOME DATA

Before release 27.0, Pfam proteome data came from Integr8, a project that has now closed and whose data have been distributed to other EBI resources. We now obtain our complete proteome data directly from UniProt, at the beginning of the release cycle when the sequence database is retrieved. This has resulted in better consistency between the sequence sets, with 40% (9 423 167 sequences) of the 23 193 494 sequences in pfamseq belonging to a complete proteome. Over the past few years, we have received an increasing demand for proteome-centric Pfam data. The data-interface to the proteome data is an area of future development but, to satisfy one of our most common user queries, we now provide a list of all Pfam-A matches per proteome on our FTP site (ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/proteomes). Each list can also be accessed from the corresponding proteome's ‘domain composition’ tab on the proteome-pages in the website.

REPRESENTING INTRINSIC SEQUENCE DISORDER

Pfam often quotes ‘sequence coverage’ and ‘residue coverage’ as statistics for tracking the extent of annotation provided by the database. We have previously noted that achieving 100% residue coverage is an unrealistic goal, as every residue in a sequence does not form part of a conserved globular domain (23), such as signal peptides and domain linker regions (short regions are essential for interdomain interactions, folding and stability) (24–27). To aid in the identification of non-globular domain regions, we have displayed the predictions of signal peptides (28), low complexity (29) and coiled-coils (http://www.russelllab.org/cgi-bin/coils/coils-svr.pl) for many years. As part of recent, focused curation efforts aimed at increasing the Pfam-A coverage of the human proteome (10), it became apparent that many regions not covered by Pfam-A are predicted to be intrinsically disordered. Disorder is not an indicator of a lack of function; on the contrary, it has been shown to be involved in cell signaling, protein interactions and regulation (30–33). Some disordered regions are conserved and are found within existing domains, e.g. in PF03250 (Tropomodulin), but they generally appear to be less conserved and/or shorter than globular domains (10), making them more elusive to modeling in a conventional Pfam-A entry. Therefore, to provide a means of identifying more disordered regions in Pfam, we have incorporated IUPred predictions (34,35) (using the long disorder prediction option) for all pfamseq sequences. These data are stored in the MySQL database, and displayed graphically as grey boxes on the website graphical representation of a sequence, as in Figure 3. The IUPred disorder predictions supplement those already produced by SEG (29), which predict a single class of disorder. Although more common to eukaryotes, disordered regions are widespread in UniProtKB. In Pfam 27.0, there were 5.5 million IUPred disorder regions of 50 amino acids or more in length, corresponding to 5.6% of the 7.6 billion sequence residues in the database.

Figure 3.

Graphical representation of the Pfam sequence annotations for human tyrosine-protein kinase ABL1 sequence (UniProtKB accession P00519). This sequence matches four different Pfam-A entries, SH3_1 (PF00018), SH2 (PF00017), Pkinase_Tyr (PF007714) and F_actin-bind (PF08919). Between the Pkinase_Tyr and F_actin_bind families is a long region of disorder, indicated by the presence of the grey boxes on the sequence. A disorder prediction does not necessarily mean that the sequence is not conserved, highlighted by the presence of an overlapping Pfam-B region (striped box).

MAPPING PFAM-A ENTRIES TO PROTEIN STRUCTURES

A recurring issue, and one which is often raised in the literature (36) and by Pfam users, is the mapping of Pfam-A entries to PDB entries, a process that can provide 3D structural information for a protein family. This may seem like a trivial task, whereby one simply extracts all of the protein chains in all of the PDB entries and searches them against Pfam-A. However, although this approach works in principle, in practice it results in many omissions from the mapping. PDB entries frequently include only part of a sequence and the visible fragments are often simply too short to have matches to Pfam profile HMMs that are significant. For example, the crystal structure of the murine class I major histocompatibility antigen H-2D(B) has been determined in complex with a nine amino acid peptide derived from the LCMV gp33 protein (PDB identifier 1S7W) (37). Searching just the gp33 fragment against the Pfam-A models finds no hits. However, by using the residue mapping between PDB structures and UniProtKB entries provided by the SIFTS resource (38), we find that the fragment comes from a larger sequence, UniProtKB accession P07399, in a region that matches the Arena_glycoprot family (Pfam accession PF00798). This demonstrates the importance of using a comprehensive and accurate structure-to-sequence mapping, such as SIFTS, to unify structural and sequence information.

The caveat to the approach described earlier in the text is that structure, mapping and sequence data, from PDB, SIFTS and Pfam, respectively, must be time-synchronized. All resource providers are aware of the issues generated by multiple release cycles and our pipeline has been modified to ensure that, at the point of data acquisition, PDB, SIFTS and UniProt are as tightly synchronized as possible. However, as there is a steady flow of structures into the PDB every week and, since our data are often downloaded and frozen months before a release, it will almost always appear out of date. During the lifetime of a Pfam release, the disparity will become increasingly wide. One solution would be to pull this data in dynamically during a Pfam release, but we are opposed to this approach because we believe that the data in a given Pfam release should be fixed, to provide a stable data source for the community to cite. Should obtaining the latest Pfam-PDB annotation-mapping be paramount, both PDBe (39) and RCSB (40) offer tab-delimited files with the latest mappings (ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/flatfiles/csv/pdb_chain_pfam.csv.gz or http://www.rcsb.org/pdb/rest/hmmer?file=hmmer_pdb_all.txt). A better solution might be to make more frequent Pfam releases, thereby minimizing the data synchronization lags. Continued improvements in our release pipeline are designed to facilitate shorter release cycles in the future.

CONCLUSIONS

The core aim of Pfam is to produce protein families that reliably classify as much of sequence space as possible. The database continues to grow and evolve during 2013, with efforts concentrated on adding new families and improving existing ones, while also trying to make the core family data as accessible as possible. The growing sequence database is competing with this effort. We continue to focus attention on meeting the needs of our users, which are often highlighted by recurring user requests. Part of this effort is to identify and remove features that have not been useful to users. It is always tempting to add progressively more features to the database, but this would make it impossible to keep Pfam maintainable in the long term. However, we still encourage the Pfam user community to ask for data sets that are either not provided or not easily accessible. We are committed to producing more frequent releases, a process which may result in further changes to the database and website.

FUNDING

Howard Hughes Medical Institute Janelia Farm Research Campus (to R.D.F., J.C. and S.R.E); the European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) (to A.B. and J.M.); Wellcome Trust [WT077044/Z/05/Z to P.C., R.Y.E., K.H., J.T. and M.P.]. Funding for open access charge: HHMI Janelia Farm Research Campus.

Conflict of interest statement. None declared.

REFERENCES

1
Krogh
A
Brown
M
Mian
IS
Sjölander
K
Haussler
D
Hidden Markov models in computational biology. Applications to protein modeling
J. Mol. Biol.
1994
, vol. 
235
 (pg. 
1501
-
1531
)
2
Eddy
SR
Profile hidden Markov models
Bioinformatics
1998
, vol. 
14
 (pg. 
755
-
763
)
3
UniProt Consortium
Reorganizing the protein space at the Universal Protein Resource (UniProt)
Nucleic Acids Res.
2012
, vol. 
40
 (pg. 
D71
-
D75
)
4
Eddy
SR
A new generation of homology search tools based on probabilistic inference
Genome Inform.
2009
, vol. 
23
 (pg. 
205
-
211
)
5
Eddy
SR
Accelerated profile HMM searches
PLoS Comput. Biol.
2011
, vol. 
7
 pg. 
e1002195
 
6
Finn
RD
Mistry
J
Schuster-Böckler
B
Griffiths-Jones
S
Hollich
V
Lassmann
T
Moxon
S
Marshall
M
Khanna
A
Durbin
R
, et al. 
Pfam: clans, web tools and services
Nucleic Acids Res.
2006
, vol. 
34
 (pg. 
D247
-
D251
)
7
Punta
M
Coggill
PC
Eberhardt
RY
Mistry
J
Tate
J
Boursnell
C
Pang
N
Forslund
K
Ceric
G
Clements
J
, et al. 
The Pfam protein families database
Nucleic Acids Res.
2012
, vol. 
40
 (pg. 
D290
-
D301
)
8
Bernstein
FC
Koetzle
TF
Williams
GJ
Meyer
EF
Brice
MD
Rodgers
JR
Kennard
O
Shimanouchi
T
Tasumi
M
The Protein Data Bank: a computer-based archival file for macromolecular structures
J. Mol. Biol.
1977
, vol. 
112
 (pg. 
535
-
542
)
9
Sillitoe
I
Cuff
AL
Dessailly
BH
Dawson
NL
Furnham
N
Lee
D
Lees
JG
Lewis
TE
Studer
RA
Rentzsch
R
, et al. 
New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures
Nucleic Acids Res.
2013
, vol. 
41
 (pg. 
D490
-
D498
)
10
Mistry
J
Coggill
P
Eberhardt
RY
Deiana
A
Giansanti
A
Finn
RD
Bateman
A
Punta
M
The challenge of increasing Pfam coverage of the human proteome
Database
2013
, vol. 
2013
 pg. 
bat023
 
11
Eberhardt
RY
Haft
DH
Punta
M
Martin
M
O'Donovan
C
Bateman
A
AntiFam: a tool to help identify spurious ORFs in protein annotation
Database
2012
, vol. 
2012
 pg. 
bas003
 
12
Waterhouse
AM
Procter
JB
Martin
DMA
Clamp
M
Barton
GJ
Jalview Version 2—a multiple sequence alignment editor and analysis workbench
Bioinformatics
2009
, vol. 
25
 (pg. 
1189
-
1191
)
13
Chen
C
Natale
DA
Finn
RD
Huang
H
Zhang
J
Wu
CH
Mazumder
R
Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation
PLoS One
2011
, vol. 
6
 pg. 
e18910
 
14
Price
MN
Dehal
PS
Arkin
AP
FastTree 2—approximately maximum-likelihood trees for large alignments
PLoS One
2010
, vol. 
5
 pg. 
e9490
 
15
Thompson
JD
Gibson
TJ
Higgins
DG
Multiple sequence alignment using ClustalW and ClustalX
Curr. Protoc. Bioinformatics
2002
 
Chapter 2, Unit 2.3
16
Finn
RD
Mistry
J
Tate
J
Coggill
P
Heger
A
Pollington
JE
Gavin
OL
Gunasekaran
P
Ceric
G
Forslund
K
, et al. 
The Pfam protein families database
Nucleic Acids Res.
2010
, vol. 
38
 (pg. 
D211
-
D222
)
17
Bateman
A
Birney
E
Durbin
R
Eddy
SR
Howe
KL
Sonnhammer
EL
The Pfam protein families database
Nucleic Acids Res.
2000
, vol. 
28
 (pg. 
263
-
266
)
18
Birney
E
Clamp
M
Durbin
R
GeneWise and genomewise
Genome Res.
2004
, vol. 
14
 (pg. 
988
-
995
)
19
NCBI Resource Coordinators
Database resources of the National Center for Biotechnology Information
Nucleic Acids Res.
2013
, vol. 
41
 (pg. 
D8
-
D20
)
20
Bateman
A
Coin
L
Durbin
R
Finn
RD
Hollich
V
Griffiths-Jones
S
Khanna
A
Marshall
M
Moxon
S
Sonnhammer
ELL
, et al. 
The Pfam protein families database
Nucleic Acids Res.
2004
, vol. 
32
 (pg. 
D138
-
D141
)
21
Coin
L
Bateman
A
Durbin
R
Enhanced protein domain discovery using taxonomy
BMC Bioinformatics
2004
, vol. 
5
 pg. 
56
 
22
Schlicker
A
Huthmacher
C
Ramírez
F
Lengauer
T
Albrecht
M
Functional evaluation of domain-domain interactions and human protein interaction networks
Bioinformatics
2007
, vol. 
23
 (pg. 
859
-
865
)
23
Sammut
SJ
Finn
RD
Bateman
A
Pfam 10 years on: 10,000 families and still growing
Brief. Bioinformatics
2008
, vol. 
9
 (pg. 
210
-
219
)
24
Gokhale
RS
Khosla
C
Role of linkers in communication between protein modules
Curr. Opin. Chem. Biol.
2000
, vol. 
4
 (pg. 
22
-
27
)
25
George
RA
Heringa
J
An analysis of protein domain linkers: their classification and role in protein folding
Protein Eng.
2002
, vol. 
15
 (pg. 
871
-
879
)
26
Wriggers
W
Chakravarty
S
Jennings
PA
Control of protein functional dynamics by peptide linkers
Biopolymers.
2005
, vol. 
80
 (pg. 
736
-
746
)
27
Reddy Chichili
VP
Kumar
V
Sivaraman
J
Linkers in the structural biology of protein-protein interactions
Protein Sci.
2013
, vol. 
22
 (pg. 
153
-
167
)
28
Käll
L
Krogh
A
Sonnhammer
ELL
A combined transmembrane topology and signal peptide prediction method
J. Mol. Biol.
2004
, vol. 
338
 (pg. 
1027
-
1036
)
29
Wootton
JC
Non-globular domains in protein sequences: automated segmentation using complexity measures
Comput. Chem.
1994
, vol. 
18
 (pg. 
269
-
285
)
30
Midic
U
Oldfield
CJ
Dunker
AK
Obradovic
Z
Uversky
VN
Unfoldomics of human genetic diseases: illustrative examples of ordered and intrinsically disordered members of the human diseasome
Protein Pept. Lett.
2009
, vol. 
16
 (pg. 
1533
-
1547
)
31
Babu
MM
van der Lee
R
de Groot
NS
Gsponer
J
Intrinsically disordered proteins: regulation and disease
Curr. Opin. Struct. Biol.
2011
, vol. 
21
 (pg. 
432
-
440
)
32
Tantos
A
Han
KH
Tompa
P
Intrinsic disorder in cell signaling and gene transcription
Mol. Cell. Endocrinol.
2012
, vol. 
348
 (pg. 
457
-
465
)
33
Buljan
M
Chalancon
G
Dunker
AK
Bateman
A
Balaji
S
Fuxreiter
M
Babu
MM
Alternative splicing of intrinsically disordered regions and rewiring of protein interactions
Curr. Opin. Struct. Biol.
2013
, vol. 
23
 (pg. 
443
-
450
)
34
Dosztányi
Z
Csizmok
V
Tompa
P
Simon
I
The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins
J. Mol. Biol.
2005
, vol. 
347
 (pg. 
827
-
839
)
35
Dosztányi
Z
Csizmok
V
Tompa
P
Simon
I
IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content
Bioinformatics
2005
, vol. 
21
 (pg. 
3433
-
3434
)
36
Xu
Q
Dunbrack
RL
Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB
Bioinformatics
2012
, vol. 
28
 (pg. 
2763
-
2772
)
37
Velloso
LM
Michaëlsson
J
Ljunggren
HG
Schneider
G
Achour
A
Determination of structural principles underlying three different modes of lymphocytic choriomeningitis virus escape from CTL recognition
J. Immunol.
2004
, vol. 
172
 (pg. 
5504
-
5511
)
38
Velankar
S
Dana
JM
Jacobsen
J
van Ginkel
G
Gane
PJ
Luo
J
Oldfield
TJ
O'Donovan
C
Martin
MJ
Kleywegt
GJ
SIFTS: structure integration with function, taxonomy and sequences resource
Nucleic Acids Res.
2013
, vol. 
41
 (pg. 
D483
-
D489
)
39
Velankar
S
Alhroub
Y
Best
C
Caboche
S
Conroy
MJ
Dana
JM
Fernandez Montecelo
MA
van Ginkel
G
Golovin
A
Gore
SP
, et al. 
PDBe: Protein Data Bank in Europe
Nucleic Acids Res.
2012
, vol. 
40
 (pg. 
D445
-
D452
)
40
Rose
PW
Bi
C
Bluhm
WF
Christie
CH
Dimitropoulos
D
Dutta
S
Green
RK
Goodsell
DS
Prlić
A
Quesada
M
, et al. 
The RCSB Protein Data Bank: new resources for research and education
Nucleic Acids Res.
2013
, vol. 
41
 (pg. 
D475
-
D482
)
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.