Article |
DOUTfinderidentification of distant domain outliers using subsignificant sequence similarity
1 Research Institute of Molecular Pathology (IMP), Dr Bohr-Gasse 7 A-1030 Vienna, Austria 2 Institute of Virology, Medical University of Vienna Kinderspitalgasse 15, A-1095 Vienna, Austria
*To whom correspondence should be addressed. Tel: +43 1 7973 0556; Fax: +43 1 7987 153; Email: novatchkova{at}imp.univie.ac.at
Received February 15, 2006. Revised March 4, 2006. Accepted April 14, 2006.
| ABSTRACT |
|---|
|
|
|---|
DOUTfinder is a web-based tool facilitating protein domain detection among related protein sequences in the twilight zone of sequence similarity. The sequence set required for this analysis can be provided by the user or will be collected using PSI-BLAST if a single sequence is given as an input. The obtained sequence family is analyzed for known Pfam and SMART domains, and the thereby identified subsignificant domain similarities are evaluated further. Domains with several subthreshold hits in the query set are ranked based on a sum-score function and likely homologous domains are suggested according to established cut-offs. By providing a post-filtering procedure for subsignificant domain hits DOUTfinder allows the detection of non-trivial domain relationships and can thereby lead to new insights into the function and evolution of distantly related sequence families. DOUTfinder is available at http://mendel.imp.ac.at/dout/.
| INTRODUCTION |
|---|
|
|
|---|
Domains are evolutionary conserved building blocks within protein sequences, which typically represent discrete structural and functional units therein. As domains have been repeatedly duplicated and reused during evolution two-third of all known proteins can be reliably assigned to at least one of several thousand already characterized domain families thereby providing an initial indication on molecular or cellular function (1).
Nonetheless the theoretically possible coverage of domain based annotation is likely not yet fully exploited. It has been suggested that around 90% of whole proteome residues are participating in globular domains (2). This is opposed to 50% of residues from all proteins that can be assigned to a known domain to date (1). The globular regions, which remain unannotated by domain based searches, are in part distantly related to known domains, and are therefore distant outliers of characterized domain families. Such domain outliers (DOUTs) represent true homologs of those families but have diverged too far away from the described consensus in order to be significantly hit in profile-based searches of a sequence against Pfam (3) or SMART (4) domain database. DOUTs are often found as false negative similarities in the twilight zone of homology searches. It is therefore common practice to analyze subsignificant domain hits individually and evaluate them on the basis of additional knowledge. DOUTfinder was developed to facilitate this latter analysis step by providing a homology-backed procedure for post-filtering of relevant subthreshold hits. In the following we demonstrate the ability of this tool to efficiently separate a fraction of potential true domain similarities from the noise in the twilight zone of similarity searches. For this purpose we introduce a scoring scheme for the evaluation of subsignificant domain hits in a group of homologs and calibrate it using a widely applied distant homology control set: the Astral SCOP 1.69 database (5).
| THE METHOD: SPOTTING DISTANT DOMAIN HOMOLOGS |
|---|
|
|
|---|
Commonly used domain database search facilities, such as Pfam, SMART and CDD (6) provide extremely reliable domain annotations when run with default threshold settings. In order to increase search sensitivity it is recommended to use relaxed thresholds and evaluate the obtained results individually in a consecutive step. This post-filtering process is typically performed using additional information, such as functional, contextual and taxonomic data (7). It is also customary practice to support subsignificant domain hits by likewise subthreshold matches among clear sequence homologs of the initial query. We have challenged the latter approach in a test-case based on the SCOP protein classification and defined conditions for which the co-occurrence of subsignificant domain hits within a protein family is a reliable measure for a similarity to that domain.
SCOP is a database commonly used in the evaluation of distant sequence comparison tools, as it provides a hierarchy of proteins beyond obvious sequence similarities. SCOP classifies structural domains within proteins into four hierarchical levels: families, superfamilies, folds and classes. Protein families consist of closely related sequences and are further grouped into superfamilies of presumed monophyletic origin. Folds subsume superfamilies with common topology and unclear evolutionary relationship. We used the ASTRAL SCOP 1.69 dataset and supplemented sparsely populated protein families using the 30 nearest sequence neighbors identified in BLAST against a 80% non-redundant Uniref variant (8,9). Protein families were analyzed for significant (E
0.005) as well as subsignificant (E > 0.005) domain hits using RPS-BLAST against the Pfam database (10). Co-occurrence of two or more subsignificant hits within a family could be either supported: by a significant domain hit in any other family of the same superfamily, or disproved: if a significant domain hit appeared in another fold and never in the fold the family belongs to. Ninety-four percent of all disproved domain similarities with an expect value between 0.005 and 20 showed a domain coverage of 0.4 or less (84% 0.3 or less), where the domain coverage is the length of the aligned domain segment versus the domains consensus length. To avoid overpopulation of the analysis by false-negative hits the default domain-coverage threshold for considering subsignificant domain hits is set to 0.4 at the DOUTfinder web server.
The D-sum score was introduced as a quantitative estimate rating the quality of a domain outlier prediction for a protein family F with multiple subsignificant hits of a domain D. The D-score interprets the sum-scores, Sr, for all occurrences, r, of a domain D within a family F normalized by
(11), and penalizes for the size of the search space, mn, where n is the database length. The query length m, and the constants
and
are calculated for a concatenation product of all residues within family F. A reward proportional to the sum of the average domain coverage, Cr, and the ratio of domain instances to the number of all proteins, N, within the family is applied.
![]() |
|
The validity of the approach and of the defined cut-offs was further evaluated by a DOUTfinder analysis of 1462 domains of unknown function (DUF) derived from the Pfam18 dataset, of which 1434 retained more than one sequence after redundancy removal at a 80% identity threshold. Analysis of the subsignificant domain hits of these DUF families resulted in the suggestion of around 80 probable and around 20 potential domain outliers. In
20% of these cases the prediction could be confirmed by a PSI-BLAST link between the domains. Approximately 35% of the DUF similarities to other Pfam domains were also detected by the profileprofile-based Clans assignment provided since Pfam19 (3). A complete listing of the suggested domain similarities can be accessed on the DOUTfinder website. The agreement of DOUTfinder predictions with Clans relationships and the even higher sensitivity in more than half of the established cases of DUF domain relationships indicates that DOUTfinder is a useful complementation to other available methods. It should be noted that as pointed out in the original Clans report, Pfam PRC profileprofile comparison (http://supfam.mrc-lmb.cam.ac.uk/PRC/) with its current settings has not yet reached its maximal sensitivity.
Example of usage: DOUTfinder single sequence analysis
The DOUTfinder web server implements the analysis of subsignificant domain hits using a D-score as described above. Two types of input are acceptedeither a family of homologous sequences collected by the user, or a single sequence, which is used to collect homologous segments in a non-redundant protein database. For the following illustrations the full-length human IL17/SEF receptor protein (AAM74077) is used as a single sequence input to DOUTfinder. In this example the analysis of subsignificant domain hits of SEF family members can identify an intracellular region of similarity to the Toll/interleukin-1 receptor (TIR) homology domain in agreement with previous observations (12).
When a single input is provided DOUTfinder automatically and successively starts a series of steps, which do not require further user-intervention and lead to the retrieval of a homologous sequence set, its domain analysis and the domain outlier identification. According to the default parameterization the set collecting tool of DOUTfinder applies two rounds of PSI-BLAST search against a non-redundant database to obtain segments with IL17/SEF homology (13). The used non-redundant database is generated using NCBI nr (at various levels of non-redundancy) as well as Pfam and Smart domain sequences (14). Thereby the initial PSI-BLAST step can also be used to link the submitted sequence to known domains via a logically inverted profile-based search, where the query protein provides the profile against which the individual domain sequences can be matched. Upon completion of the PSI-BLAST search this initial protein set is filtered up to 80% non-redundancya setting which is user-adjustable (8). The obtained representative sequences are filtered using the optionally applied COILS and HMMTOP algorithms (15,16) and supplied to domain-analysis using RPS-BLAST in a search against SMART (4) and PFAM (3) databases. Domains are evaluated based on their score and graphical and textual reports are prepared.
Example output
The output of DOUTfinder domain analysis consists of a tabular and a graphical part. In the graphical part proteins are represented as bars and domains are color-coded according to the similarity category they belong to (i) significant RPS-BLAST similarityred boxes (ii) subthreshold hits supported by a significant hit somewhere else in the homologous setorange boxes (iii) probable domain outliers with a D-score above the 5% error limitblue boxes (iv) potential domain outliers with a D-score above the 10% error limitcyan boxes (v) other domains found more than oncegray boxes (vi) single occurrence domainswhite boxes. Mouse-over functions provide additional information on the domains, and link to the original domain databases. In the example of SEF homologs twilight zone similarities (orange) to the fibronectin type 3 domain can be supported by a significant hit in one of the proteins in the set (red) (Figure 2B). The TIR domain is identified as a probable domain outlier with five subsignificant hits in five of the analyzed 20 sequences.
|
In addition to the graphical output the identified domain similarities are also presented in two types of tabular output, which are structured and colored analogously to the graphical one. The short tabular output provides comprehensive functional annotation of the domains, which support the fast evaluation of the obtained hits (Figure 2A). Further expert evaluation is assisted by the PSI-BLAST keyword assessment, a PSI-BLAST domain hit evaluation and listing of those domains within the set, which belong to the same Pfam CLAN. This information is provided below the short domain summary if applicable. An extensive listing of the obtained domain hits is provided in the second tabular output. Various links allow fast switching between the result sections.
| CONCLUSIONS |
|---|
|
|
|---|
Sensitive domain detection typically relies on the use of curated consensus representations of known protein families, such as PSSMs and profile HMMs (17). The indisputable advantage of these approaches compared to pairwise sequence comparison lies in the integrated description of multiple sequence information in one statistical representation. However a single domain model will likely be less sensitive in uncovering atypical homologs (18), which can arise in families with differing evolutionary speed and high diversification into a non-homogenous sequence space. The sensitivity of a profile-based search will also be hampered by domain definitions based on a small domain family with few members, represented in one taxon only, or features that are too short and therefore lead to an incorrect sequence alignment. In such cases biologically relevant twilight zone similarities can remain below recommended significance thresholds. By analyzing such subsignificant relationships DOUTfinder can identify distant sequence similarities and potentially lead to true remote homologs that could have otherwise been missed.
| ACKNOWLEDGEMENTS |
|---|
The authors are grateful for generous support from Boehringer Ingelheim. This project has been partly funded by the Austrian Gen-AU bioinformatics integration network. Funding to pay the Open Access publication charges for this article was provided by the Research Institute of Molecular Pathology.
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. (2004) The Pfam protein families database Nucleic Acids Res, . 32, D138D141
[Abstract/Free Full Text] . - Copley, R.R., Doerks, T., Letunic, I., Bork, P. (2002) Protein domain analysis in the era of complete genomes FEBS Lett, . 513, 129134[CrossRef][ISI][Medline] .
- Finn, R.D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., et al. (2006) Pfam: clans, web tools and services Nucleic Acids Res, . 34, D247D251
[Abstract/Free Full Text] . - Letunic, I., Copley, R.R., Pils, B., Pinkert, S., Schultz, J., Bork, P. (2006) SMART 5: domains in the context of genomes and networks Nucleic Acids Res, . 34, D257D260
[Abstract/Free Full Text] . - Chandonia, J.M., Hon, G., Walker, N.S., Lo, C.L., Koehl, P., Levitt, M., Brenner, S.E. (2004) The ASTRAL Compendium in 2004 Nucleic Acids Res, . 32, D189D192
[Abstract/Free Full Text] . - Marchler-Bauer, A. and Bryant, S.H. (2004) CD-Search: protein domain annotations on the fly Nucleic Acids Res, . 32, W327W331
[Abstract/Free Full Text] . - Coin, L., Bateman, A., Durbin, R. (2004) Enhanced protein domain discovery using taxonomy BMC.Bioinformatics, . 5, 56 .
- Li, W., Jaroszewski, L., Godzik, A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases Bioinformatics, . 17, 282283
[Abstract/Free Full Text] . - Wu, C.H., Apweiler, R., Bairoch, A., Natale, D.A., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information Nucleic Acids Res, . 34, D187D191
[Abstract/Free Full Text] . - Marchler-Bauer, A., Anderson, J.B., Cherukuri, P.F., DeWeese-Scott, C., Geer, L.Y., Gwadz, M., He, S., Hurwitz, D.I., Jackson, J.D., Ke, Z., et al. (2005) CDD: a Conserved Domain Database for protein classification Nucleic Acids Res, . 33, D192D196
[Abstract/Free Full Text] . - Altschul, S.F. (1997) Evaluating the statistical significance of multiple distinct local alignments In Suhai, S. (Ed.). Theoretical and Computational Methods in Genome Research, . pp. 114 .
- Novatchkova, M., Leibbrandt, A., Werzowa, J., Neubuser, A., Eisenhaber, F. (2003) The STIR-domain superfamily in signal transduction, development and immunity Trends Biochem. Sci, . 28, 226229[CrossRef][ISI][Medline] .
- Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements Nucleic Acids Res, . 29, 29943005
[Abstract/Free Full Text] . - Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R., Federhen, S., et al. (2006) Database resources of the National Center for Biotechnology Information Nucleic Acids Res, . 34, D173D180
[Abstract/Free Full Text] . - Tusnady, G.E. and Simon, I. (2001) The HMMTOP transmembrane topology prediction server Bioinformatics, . 17, 849850
[Abstract/Free Full Text] . - Lupas, A. (1996) Prediction and analysis of coiled-coil structures Meth. Enzymol, . 266, 513525[ISI][Medline] .
- Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., Chothia, C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods J. Mol. Biol, . 284, 12011210[CrossRef][ISI][Medline] .
- Murvai, J., Vlahovicek, K., Barta, E., Szepesvari, C., Acatrinei, C., Pongor, S. (1999) The SBASE protein domain library, release 6.0: a collection of annotated protein sequence segments Nucleic Acids Res, . 27, 257259
[Abstract/Free Full Text] .
This article has been cited by other articles:
![]() |
B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C. H. Wu UniRef: comprehensive and non-redundant UniProt reference clusters Bioinformatics, May 15, 2007; 23(10): 1282 - 1288. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



