3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment
1 Information Génomique et Structurale UPR2589-CNRS, CNRS, 31, Chemin Joseph Aiguier, 13 402 Marseille Cedex 20, France, 2 Swiss Institute of Bioinformatics, Lausanne University, Chemin des Boversesses, 1066 Epalinges, Switzerland and 3 hp High Performance Technical Computing Division, Hewlett Packard, BallyBrit, Galway, Ireland
* To whom correspondence should be addressed. Tel: +33 491 164 606; Fax: +33 491 164 549; Email: cedric.notredame{at}europe.com
Received February 14, 2004; Accepted March 16, 2004
| ABSTRACT |
|---|
|
|
|---|
This paper presents 3DCoffee@igs, a web-based tool dedicated to the computation of high-quality multiple sequence alignments (MSAs). 3D-Coffee makes it possible to mix protein sequences and structures in order to increase the accuracy of the alignments. Structures can be either provided as PDB identifiers or directly uploaded into the server. Given a set of sequences and structures, pairs of structures are aligned with SAP while sequencestructure pairs are aligned with Fugue. The resulting collection of pairwise alignments is then combined into an MSA with the T-Coffee algorithm. The server and its documentation are available from http://igs-server.cnrs-mrs.fr/Tcoffee/.
| INTRODUCTION |
|---|
|
|
|---|
The assembly of an accurate multiple sequence alignment (MSA) is a key step in many sequence analysis procedures. One could cite in bulk: the identification of a protein signature such as a Prosite pattern (1), the building of a domain profile (or HMM) needed for identifying the most remote members of a protein family (2), structure prediction and homology modeling (3) and phylogenetic analysis (4). More recently, MSAs have also proven useful to characterize nsSNPs (non-synonymous Single Nucleotide Polymorphisms) (5,6).
The success of such applications depends very much on the MSA quality, hence the importance of accuracy when computing an alignment. In practice, structurally correct alignments are considered to be a good starting point for most MSA applications (with maybe the exception of phylogenetic reconstruction), and established collections of reference structural alignments are widely used to benchmark and train existing MSA packages (7,8). However, when state-of-the-art packages are applied to sets of distantly related sequences, they deliver alignments that are only partly correct from a structural point of view (8), thus suggesting that sequence-based alignment procedures can still be greatly improved. In the current situation, the best way to produce a high-quality MSA remains the assembly of a multiple structural alignment. Unfortunately, few examples exist where enough related structures are available to carry out such a task.
An elegant alternative to the use of many structures is to mix sequences and structures, in the hope that the 3D information contained within the structures will help deliver a better alignment of the other sequences. Such a mix also constitutes a realistic solution, considering the increasing proportion of sequences without a known structure and the decreasing proportion of protein families not associated with at least one structure. However, the problem of combining sequences and structures has not yet been extensively addressed, and only a handful of methods are available that allow the seamless combining of sequences and structures (9) while appropriately using 3D information.
Here we present 3DCoffee@igs, a web server especially designed to combine sequences and structures by seamlessly integrating in T-Coffee (10) the three types of alignment methods needed for this purpose: sequencesequence, sequencestructure and structurestructure alignment methods. When using one or more structures, the alignments thus produced are more accurate than similar alignments based on sequence information alone, as judged by the comparison with reference structure-based alignments (O. O'Sullivan, K. Suhre, D. Higgins and C. Notredame, submitted for publication). The inclusion of a threading method (sequencestructure alignment) makes it possible to use as little as one structure.
| METHODS |
|---|
|
|
|---|
Standard T-Coffee sequence alignment assembly
We use T-Coffee to mix sequences and structures. Given a set of sequences, the regular T-Coffee procedure involves the computation of a collection of pairwise alignments where for each possible pair of sequences in the dataset, the program computes the best global alignment and the 10 best local alignments [using the Sim algorithm from the Lalign package (11)]. This collection of pairwise alignments is named a library. The second step of the procedure involves the assembly of an MSA with a high level of consistency with the alignments contained in the library. Since T-Coffee uses a heuristic, the optimality of this process is not guaranteed, although the results are generally satisfactory as judged by comparison with alternative optimization methods (12). The assembly procedure is very similar to that described for ClustalW (13); extensive details are available in the original publication (10).
3D-Coffee protocol
The 3D-Coffee protocol takes advantage of the method-independent manner in which T-Coffee uses its libraries. Rather than filling the library with sequence-based pairwise alignments, 3D-Coffee compiles it using three types of pairwise methods: sequencesequence, structurestructure and structuresequence (threading) alignment procedures. From among the vast variety of structure comparison algorithms, we selected SAP (14) for the structurestructure alignments and Fugue (15) for the structuresequence comparisons. A full validation of these choices is detailed in O. O'Sullivan, K. Suhre, D. Higgins and C. Notredame, submitted for publication. Our main criterion was the relatively high accuracy of these two methods and their ease of integration within theT-Coffee framework. It is nonetheless worth pointing out that any method with similar characteristics (i.e. able to deliver a sequence alignment) could easily be added to the procedure we describe here.
In practice, given a sequence dataset, the program starts by identifying the sequences associated with a structure and those that are not. It then considers all the possible pairs and applies the appropriate methods to these pairs. For instance, given a pair of structures, the program will successively make a global pairwise alignment, a local pairwise sequence alignment and a structure-based sequence alignment with SAP. If only one of the two sequences has a known structure, Fugue will be used instead of SAP.
The resulting pairwise alignments are compiled into a list of weighted pairs of aligned residues found in the individual alignments. Each pair receives a weight equal to the average level of identity within the pairwise alignment where it occurred. When two or more alignments contribute the same pair, their respective weights are added to yield the final weight. The collection of weighted residue-pairs constitutes the T-Coffee library.
T-Coffee uses the library to assemble a standard progressive alignment in a ClustalW-like manner. The program starts by computing the distance matrix of the sequences and uses it to estimate a guide tree. The guide tree controls the order in which the sequences are included one by one into the MSA. Each sequence is incorporated using the library in place of a substitution matrix. A recent modification of the T-Coffee algorithm (to be described elsewhere) has made it possible to significantly reduce the time complexity of the algorithm, down to O(N2L2) from the previously reported O(N3L2), N being the number of sequences and L their average length. However, in 3D-Coffee, SAP is the limiting factor, with a time complexity in the order of O(L3).
| USING THE TCOFFEE@IGS SERVER |
|---|
|
|
|---|
3D-Coffee is a new service that is available through the previously presented Tcoffee@igs server (17). It is maintained by IGS (Information Génomique et Structurale) and runs on a dedicated Alpha ES45 quadriprocessor server. It supports the analysis of a maximum number of 100 sequences with a maximum of 2000 residues each.
The 3D-Coffee service is provided in two versions, a regular and an advanced version. The regular version requires limited input from the user while the advanced version offers more possibilities such as uploading personal PDB structures and controlling the methods used to compute the library.
Tcoffee@igs server
The homepage of the server (http://igs-server.cnrs-mrs.fr/Tcoffee/) contains pointers to the four types of computation performed:
- The Make a Multiple Alignment section opens to the standard computation of a T-Coffee MSA, using the default parameters of the program, as described in (10).
- The Evaluate a Multiple Alignment section provides an alignment evaluation using the CORE method as described in (17).
- The Combine Multiple Alignments section makes it possible to combine several alignments into one. The advanced section of each server offers extra control on the library computation (choice of the methods) as well as a larger number of output options. These servers have all been previously described in (16).
- The last section, Align Structures (3D-Coffee), is new and described in the next paragraph.
Align structures and sequences with 3DCoffee::regular
The 3DCoffee::regular server inputs a set of sequences in FASTA format. Among the sequences, those with a 3D structure must be named according to their PDB identifier. If the PDB file contains several chains, the chain index (letter or number) must be added to the name (1ppt
[PDB]
A). If the sequence provided in the FASTA file is a subsequence of the indicated chain, T-Coffee aligns the provided sequence with its full PDB counterpart and makes sure that only the appropriate 3D information is used for alignment computation. This comparison also handles slight sequence discrepancies between the PDB and the user-provided sequence. In the regular mode of3D-Coffee, the handling of the structures is entirely under T-Coffee control, which uses the FASTA information to gather the structures and chop them to the relevant portion. For users familiar with the stand-alone version of T-Coffee, we give the corresponding command line:
![]() |
|
The returned alignment is a sequence alignment, albeit generally improved by the use of structural information. Systematic benchmarking, carried out on a subset of HOMSTRAD (O. O'Sullivan, K. Suhre, D. Higgins and C. Notredame, submitted for publication), indicates that the accuracy of mixed sequences/structure alignments increases proportionally to the amount of structural information provided.
The 3DCoffee::advanced server
The advanced server makes it possible to upload user-defined PDB structures (up to three). The sequences of the uploaded structures should not be included within the FASTA sequences. The limitation to three private structures is arbitrary and will be increased upon request. In case the file contains more than one chain, the program extracts only the first one. It is the user's responsibility to provide the correct chain.
The advanced server also makes it possible to control the computation of the T-Coffee library by selecting the methods one wishes to include. For instance, if all the sequences have a known 3D structure, it is advisable to use only sap_pair, the structurestructure alignment method, to generate a structure-based MSAs.
| CONCLUSION |
|---|
|
|
|---|
In this paper, we present 3D-Coffee, a major extension of the Tcoffee@igs server. This new feature of the server makes it possible to combine sequences and structures within an MSA, thus producing high-quality MSAs.
The method we present here is versatile and easy to use. It affords the possibility of seamlessly combining structure and sequence information, private and public data, without the need to install additional programs such as SAP and Fugue locally. It certainly constitutes an adequate means to efficiently use available structural data. Future plans will involve the addition of new modules, rendering easier the mapping of structural information on to sequence data.
We strongly encourage users to send us their feedback.
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.
| REFERENCES |
|---|
|
|
|---|
- Bairoch,A., Bucher,P. and Hofmann,K. ( (1997) ) The PROSITE database, its status in 1997. Nucleic Acids Res., , 25, , 217221.
[Abstract/Free Full Text] - Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. ( (2003) ) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., , 31, , 315318.
[Abstract/Free Full Text] - Jones,D.T. ( (1999) ) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., , 292, , 195202.[CrossRef][Web of Science][Medline]
- Phillips,A., Janies,D. and Wheeler,W. ( (2000) ) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol., , 16, , 317330.[CrossRef][Web of Science][Medline]
- Ng,P.C. and Henikoff,S. ( (2002) ) Accounting for human polymorphisms predicted to affect protein function. Genome Res., , 12, , 436446.
[Abstract/Free Full Text] - Ramensky,V., Bork,P. and Sunyaev,S. ( (2002) ) Human non-synonymous SNPs: server and survey. Nucleic Acids Res., , 30, , 38943900.
[Abstract/Free Full Text] - Thompson,J.D., Plewniak,F. and Poch,O. ( (1999) ) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, , 15, , 8788.
[Abstract/Free Full Text] - O'Sullivan,O., Zehnder,M., Higgins,D., Bucher,P., Grosdidier,A. and Notredame,C. ( (2003) ) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, , 19, , I215I221.
- de Bakker,P.I., Bateman,A., Burke,D.F., Miguel,R.N., Mizuguchi,K., Shi,J., Shirai,H. and Blundell,T.L. ( (2001) ) HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families. Bioinformatics, , 17, , 748749.
[Abstract/Free Full Text] - Notredame,C., Higgins,D.G. and Heringa,J. ( (2000) ) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., , 302, , 205217.[CrossRef][Web of Science][Medline]
- Huang,X. and Miller,W. ( (1991) ) A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math., , 12, , 337357.[CrossRef]
- Notredame,C., Holm,L. and Higgins,D.G. ( (1998) ) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, , 14, , 407422.
[Abstract/Free Full Text] - Thompson,J., Higgins,D. and Gibson,T. ( (1994) ) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., , 22, , 46734690.
[Abstract/Free Full Text] - Taylor,W.R. and Orengo,C.A. ( (1989) ) Protein structure alignment. J. Mol. Biol., , 208, , 122.[CrossRef][Web of Science][Medline]
- Shi,J., Blundell,T.L. and Mizuguchi,K. ( (2001) ) FUGUE: sequencestructure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol., , 310, , 243257.[CrossRef][Web of Science][Medline]
- Poirot,O., O'Toole,E. and Notredame,C. ( (2003) ) Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Res., , 31, , 35033506.
[Abstract/Free Full Text] - Notredame,C. and Abergel,C. ( (2003) ) Using multiple sequence alignments to assess the quality of genomic data. In Andrade,M. (ed.), Bioinformatics and Genomes: Current Perspectives. Horizon Scientific Press, Norfolk, UK, pp. 3050.
- Gouet,P., Robert,X. and Courcelle,E. ( (2003) ) ESPript/ENDscript: extracting and rendering sequence and 3D information from atomic structures of proteins. Nucleic Acids Res., , 31, , 33203323.
[Abstract/Free Full Text] - Kabsch,W. and Sander,C. ( (1983) ) Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers, , 22, , 25772637.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
T. J. Lawton, L. A. Sayavedra-Soto, D. J. Arp, and A. C. Rosenzweig Crystal Structure of a Two-domain Multicopper Oxidase: IMPLICATIONS FOR THE EVOLUTION OF MULTICOPPER BLUE PROTEINS J. Biol. Chem., April 10, 2009; 284(15): 10174 - 10180. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Akiva, Z. Itzhaki, and H. Margalit Built-in loops allow versatility in domain-domain interactions: Lessons from self-interacting domains PNAS, September 9, 2008; 105(36): 13292 - 13297. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Beloglazova, G. Brown, M. D. Zimmerman, M. Proudfoot, K. S. Makarova, M. Kudritska, S. Kochinyan, S. Wang, M. Chruszcz, W. Minor, et al. A Novel Family of Sequence-specific Endoribonucleases Associated with the Clustered Regularly Interspaced Short Palindromic Repeats J. Biol. Chem., July 18, 2008; 283(29): 20361 - 20371. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Shimamura, K. Hiraki, N. Takahashi, T. Hori, H. Ago, K. Masuda, K. Takio, M. Ishiguro, and M. Miyano Crystal Structure of Squid Rhodopsin with Intracellularly Extended Cytoplasmic Region J. Biol. Chem., June 27, 2008; 283(26): 17753 - 17756. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Kerk, G. Templeton, and G. B.G. Moorhead Evolutionary Radiation Pattern of Novel Protein Phosphatases Revealed by Analysis of Protein Data from the Completely Sequenced Genomes of Humans, Green Algae, and Higher Plants Plant Physiology, February 1, 2008; 146(2): 351 - 367. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Abergel, J. Rudinger-Thirion, R. Giege, and J.-M. Claverie Virus-Encoded Aminoacyl-tRNA Synthetases: Structural and Functional Characterization of Mimivirus TyrRS and MetRS J. Virol., November 15, 2007; 81(22): 12406 - 12417. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Fitzgerald, M. Collins, S. van Duyne, M. Mikoleit, T. Brown, and P. Fields Multiplex, Bead-Based Suspension Array for Molecular Determination of Common Salmonella Serogroups J. Clin. Microbiol., October 1, 2007; 45(10): 3323 - 3334. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Abergel, V. Monchois, D. Byrne, S. Chenivesse, F. Lembo, J.-C. Lazzaroni, and J.-M. Claverie Structure and evolution of the Ivy protein family, unexpected lysozyme inhibitors in Gram-negative bacteria PNAS, April 10, 2007; 104(15): 6394 - 6399. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Pasupuleti, B. Walse, E. A. Nordahl, M. Morgelin, M. Malmsten, and A. Schmidtchen Preservation of Antimicrobial Properties of Complement Peptide C3a, from Invertebrates to Humans J. Biol. Chem., January 26, 2007; 282(4): 2520 - 2528. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. L. Bauer DuMont, H. A. Flores, M. H. Wright, and C. F. Aquadro Recurrent Positive Selection at Bgcn, a Key Determinant of Germ Line Differentiation, Does Not Appear to be Driven by Simple Coevolution with Its Partner Protein Bam Mol. Biol. Evol., January 1, 2007; 24(1): 182 - 191. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Harlin-Cognato, E. A. Hoffman, and A. G. Jones Gene cooption without duplication during the evolution of a male-pregnancy gene in pipefish PNAS, December 19, 2006; 103(51): 19407 - 19412. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. B. Sherman, R. H. Guenther, F. Tama, T. L. Sit, C. L. Brooks, A. M. Mikhailov, E. V. Orlova, T. S. Baker, and S. A. Lommel Removal of Divalent Cations Induces Structural Transitions in Red Clover Necrotic Mosaic Virus, Revealing a Potential Mechanism for RNA Release J. Virol., November 1, 2006; 80(21): 10395 - 10406. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Abergel, G. Blanc, V. Monchois, P. Renesto, C. Sigoillot, H. Ogata, D. Raoult, and J.-M. Claverie Impact of the Excision of an Ancient Repeat Insertion on Rickettsia conorii Guanylate Kinase Activity Mol. Biol. Evol., November 1, 2006; 23(11): 2112 - 2122. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Barroso, R.L. Milne, L.P. Fernandez, P. Zamora, J.I. Arias, J. Benitez, and G. Ribas FANCD2 associated with sporadic breast cancer risk Carcinogenesis, September 1, 2006; 27(9): 1930 - 1937. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. T. Larson, D. Reiter, M. Young, and C. M. Lawrence Structure of A197 from Sulfolobus Turreted Icosahedral Virus: a Crenarchaeal Viral Glycosyltransferase Exhibiting the GT-A Fold. J. Virol., August 1, 2006; 80(15): 7636 - 7644. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Armougom, S. Moretti, O. Poirot, S. Audic, P. Dumas, B. Schaeli, V. Keduas, and C. Notredame Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W604 - W608. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. C. O'Farrell, N. Pulicherla, P. M. Desai, and J. P. Rife Recognition of a complex substrate by the KsgA/Dim1 family of enzymes has been conserved throughout evolution RNA, May 1, 2006; 12(5): 725 - 733. [Abstract] [Full Text] [PDF] |
||||
![]() |
S V Tavtigian, A M Deffenbaugh, L Yin, T Judkins, T Scholl, P B Samollow, D de Silva, A Zharkikh, and A Thomas Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral J. Med. Genet., April 1, 2006; 43(4): 295 - 305. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. N. BASTUREA, K. E. RUDD, and M. P. DEUTSCHER Identification and characterization of RsmE, the founding member of a new RNA base methyltransferase family RNA, March 1, 2006; 12(3): 426 - 434. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Chenevix-Trench, S. Healey, S. Lakhani, P. Waring, M. Cummings, R. Brinkworth, A. M. Deffenbaugh, L. A. Burbidge, D. Pruss, T. Judkins, et al. Genetic and Histopathologic Evaluation of BRCA1 and BRCA2 DNA Sequence Variants of Unknown Clinical Significance Cancer Res., February 15, 2006; 66(4): 2019 - 2027. [Abstract] [Full Text] [PDF] |
||||
![]() |
P K Lovelock, S Healey, W Au, E Y M Sum, A Tesoriero, E M Wong, S Hinson, R Brinkworth, A Bekessy, O Diez, et al. Genetic, functional, and histopathological evaluation of two C-terminal BRCA1 missense variants J. Med. Genet., January 1, 2006; 43(1): 74 - 83. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Q. Diep, G. Peng, M. Bewley, V. Pilauri, I. Ropson, and J. E. Hopper Intragenic Suppression of Gal3C Interaction With Gal80 in the Saccharomyces cerevisiae GAL Gene Switch Genetics, January 1, 2006; 172(1): 77 - 87. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Suhre Gene and Genome Duplication in Acanthamoeba polyphaga Mimivirus J. Virol., November 15, 2005; 79(22): 14095 - 14101. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Naude, J. A. Brzostowski, A. R. Kimmel, and T. E. Wellems Dictyostelium discoideum Expresses a Malaria Chloroquine Resistance Mechanism upon Transfection with Mutant, but Not Wild-type, Plasmodium falciparum Transporter PfCRT J. Biol. Chem., July 8, 2005; 280(27): 25596 - 25603. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. BONNEFOND, M. FRUGIER, R. GIEGE, and J. RUDINGER-THIRION Human mitochondrial TyrRS disobeys the tyrosine identity rules RNA, May 1, 2005; 11(5): 558 - 562. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Casbon and M. A. S. Saqi S4: structure-based sequence alignments of SCOP superfamilies Nucleic Acids Res., January 1, 2005; 33(suppl_1): D219 - D222. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-B. Claude, K. Suhre, C. Notredame, J.-M. Claverie, and C. Abergel CaspR: a web server for automated molecular replacement using homology modelling Nucleic Acids Res., July 1, 2004; 32(suppl_2): W606 - W609. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













