Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (833K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (47)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Plewniak, F.
Right arrow Articles by Poch, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Plewniak, F.
Right arrow Articles by Poch, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2003, Vol. 31, No. 13 3829-3832
© 2003 Oxford University Press

PipeAlign: a new toolkit for protein family analysis

Frédéric Plewniak, Laurent Bianchetti, Yann Brelivet, Annaick Carles, Frédéric Chalmel, Odile Lecompte, Thiebaut Mochel, Luc Moulinier, Arnaud Muller, Jean Muller, Veronique Prigent, Raymond Ripp, Jean-Claude Thierry, Julie D. Thompson, Nicolas Wicker and Olivier Poch*

Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, (CNRS/INSERM/ULP), BP 10142, 67404 Illkirch Cedex, France

*To whom correspondence should be addressed. Tel: +33 388653294; Fax: +33 388653276; Email: poch{at}titus.u-strasbg.fr

Received February 4, 2003; Revised and Accepted March 17, 2003


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE MODULES
 USER INTERFACE
 QUALITY CONTROL
 PERSPECTIVES
 REFERENCES
 
PipeAlign is a protein family analysis tool integrating a five step process ranging from the search for sequence homologues in protein and 3D structure databases to the definition of the hierarchical relationships within and between subfamilies. The complete, automatic pipeline takes a single sequence or a set of sequences as input and constructs a high-quality, validated MACS (multiple alignment of complete sequences) in which sequences are clustered into potential functional subgroups. For the more experienced user, the PipeAlign server also provides numerous options to run only a part of the analysis, with the possibility to modify the default parameters of each software module. For example, the user can choose to enter an existing multiple sequence alignment for refinement, validation and subsequent clustering of the sequences. The aim is to provide an interactive workbench for the validation, integration and presentation of a protein family, not only at the sequence level, but also at the structural and functional levels. PipeAlign is available at http://igbmc.u-strasbg.fr/PipeAlign/.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE MODULES
 USER INTERFACE
 QUALITY CONTROL
 PERSPECTIVES
 REFERENCES
 
Protein sequence analysis is a key issue in post-genomic biology. High-throughput genome sequencing and assembly techniques, structural proteomics and gene expression analysis have led to a rapid increase in the amount of sequence, structure and functional data available in the public databases. In order to fully understand the biological role of a particular protein, such diverse information as cellular location, 2D/3D structures, mutations and their associated illnesses, the evolutionary context and literature references must be retrieved, validated, classified and made available to the biologist. The integration of the protein in the context of the complete family is an essential first step in the analysis process. As a consequence, a new generation of protein family analysis tools is now required to organise this heterogeneous, often predicted data into a structured, hierarchical network of connected information.

Here, we present the PipeAlign web server, which offers an integrated, interactive approach to protein family analysis. The rationale of the PipeAlign design is the automation of the initial stages of the analysis process, i.e. the retrieval of homologous sequences and other related information and the hierarchical organisation of this information in the context of a multiple alignment of complete sequences (MACS) (1). As the MACS presents a synthetic view of the variability along a sequence and among homologous sequences, it represents an ideal workbench for the integration of the heterogeneous, often predicted data associated with each of the different members of a protein family. In the context of the MACS, the information can be statistically validated, classified and reliably propagated at either the family or the sub-family level, as appropriate.


    SOFTWARE MODULES
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE MODULES
 USER INTERFACE
 QUALITY CONTROL
 PERSPECTIVES
 REFERENCES
 
PipeAlign takes either (i) a single protein sequence, (ii) a set of unaligned sequences or (iii) a set of aligned sequences as input and automatically performs a cascade of five different sequence analysis programs, recently developed in-house (Fig. 1). The first task is to perform initial Ballast processing (2) including BLAST database searches (3) and subsequent delineation of the local maximum conserved segments (LMSs). A high-quality MACS of potential homologues is then constructed using the DbClustal multiple alignment program (4) and the RASCAL alignment analysis and correction program (5). Quality validation of the MACS and removal of any sequences that do not belong to the protein family are performed by the NorMD objective function (6). Finally, the sequences are clustered into potential functional subfamilies using two different, complementary programs. By default, the Secator program (7) is used and the DPC program (8) is offered as an optional alternative. Each of the five core programs in PipeAlign is an independent software module and the web interface includes separate entry and exit points at each step in the pipeline.



View larger version (50K):
[in this window]
[in a new window]
 
Figure 1. The PipeAlign protein family analysis process.

 

    USER INTERFACE
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE MODULES
 USER INTERFACE
 QUALITY CONTROL
 PERSPECTIVES
 REFERENCES
 
The PipeAlign web server is designed to work in two basic modes (Fig. 1). In the fully automatic mode, a user enters a single sequence or set of unaligned sequences and the complete pipeline of five software modules is launched. When more than one sequence is input, options are provided to align exclusively the user's sequences or to include any additional homologues detected by a database search. In either case, the final result is a high quality MACS of the protein family, which can be viewed with the interactive, graphical browser (Fig. 2). The sequences in the MACS are colour coded according to distinct properties in order to highlight conserved residues and individual family members are clustered into subgroups. Links are provided to relevant information mined during the PipeAlign process, e.g. the local pairwise alignments produced by BLAST and the LMSs deduced by Ballast, the full sequence information in the SWISS-PROT/SpTrEMBL (9) databases as well as 3D structural information in the PDB database (10) when available. Figure 2 shows some of the results available for a typical automatic protein sequence analysis with PipeAlign. In this case, a yeast mRNA guanylyltransferase protein (SWISS-PROT identifier MCE1_YEAST) was used as the query sequence. Thirty-four sequences were automatically selected from the BLAST database searches for alignment with DbClustal, of which three DNA ligase sequences were considered to be unrelated to the query and were subsequently removed from the alignment. The remaining 31 sequences were then clustered into three subgroups: the first mainly composed of kinetoplastidae, bacteria and viruses; the second consisting of metazoa and fungi; and a third subgroup of plants. This automatic process is useful for initial analyses of proteins of unknown function and is particularly suitable for high-throughput, automatic systems, such as genome annotations. However, the PipeAlign web server also provides a more flexible approach, in which the user can choose to enter the pipeline at any one of the five different stages in the PipeAlign process. For example, by starting the PipeAlign process at the DbClustal entry point, it is possible to review the results of the database search and to manually select the set of homologues to be included in the MACS. In addition, PipeAlign provides a number of options for the refinement, validation and clustering of existing multiple sequence alignments, either from other automatic methods or manually edited.



View larger version (74K):
[in this window]
[in a new window]
 
Figure 2. Screenshots of a typical PipeAlign automa.

 
The complexity of the PipeAlign analysis process means that the default parameters used at each stage may not necessarily be the most suitable parameters for the particular protein family studied. The choice of query sequence for the initial database search, as well as the threshold used to select proteins for inclusion in the final multiple alignment are crucial to the success of the PipeAlign analysis. The web server therefore offers the biologist the possibility to review the results of each step in the process, to modify certain key parameters and, if necessary, to launch the subsequent software modules. This allows an evaluation and eventual correction of the PipeAlign results and facilitates the analysis and integration of the consequent functional, structural and evolutionary inferences. Such detailed analyses based on MACS can yield important new structural and/or functional insights and form the basis for new hypotheses that can then be tested experimentally. The accuracy and reliability of the PipeAlign analysis has recently been exploited in a number of different projects, from the comparison of three complete genomes of hyperthermophilic archaea (11) and the semi-automatic annotation of the Pyrococcus abyssi genome (12) to the in-depth study of ribosomal genes in 66 different complete genomes (13).


    QUALITY CONTROL
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE MODULES
 USER INTERFACE
 QUALITY CONTROL
 PERSPECTIVES
 REFERENCES
 
An important aspect of the PipeAlign design is the incorporation of quality control procedures at each stage in the analysis process. The initial processing of the BLAST database search results by Ballast uses only those sequences with high significance scores (E<0.1) for the construction of the conservation profile, used in the detection of the LMSs. The LMSs represent locally conserved segments which can be used as reliable anchor points for the DbClustal multiple alignment program. The result is a high-quality global multiple alignment of the full-length sequences, even for highly divergent sequences with large N/C-terminal extensions or internal insertions. Nevertheless, local misalignments can still occur and for this reason, the RASCAL (rapid scanning and correction of alignments) program is used to detect potential badly aligned regions and refine them. The final validation of the MACS alignment is performed by the NorMD objective function. Any sequences in the MACS which are considered to be unrelated to the initial query sequence are removed at this stage.


    PERSPECTIVES
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE MODULES
 USER INTERFACE
 QUALITY CONTROL
 PERSPECTIVES
 REFERENCES
 
While the parameters used in the PipeAlign modules have been selected to be suitable for the majority of protein families, further studies are required to determine the optimal parameters for special cases, such as proteins with a bias in their residue composition. Future developments will also include a hierarchical conservation analysis package and integration with a 3D display. A data mining module is also being developed to provide access to external databases such as the Interpro database (14). This will allow the user to visualise a selection of properties, including known protein domains, motifs and secondary structures.


    ACKNOWLEDGEMENTS
 
This work was supported by funds from the Institut National de la Santé et de la Recherche Médicale, the Centre National de la Recherche Scientifique, the Hôpital Universitaire de Strasbourg and the Fond National de la Science (GENOPOLE).


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 SOFTWARE MODULES
 USER INTERFACE
 QUALITY CONTROL
 PERSPECTIVES
 REFERENCES
 

  1. Lecompte,O., Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2001) Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene, 270, 17–30.[CrossRef][Web of Science][Medline]

  2. Plewniak,F., Thompson,J.D. and Poch,O. (2000) Ballast: blast post-processing based on locally conserved segments. Bioinformatics, 9, 750–759.

  3. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 17, 3389–3402.

  4. Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res., 15, 2919–2926.

  5. Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2003) RASCAL: Rapid scanning and correction of multiple sequence alignment programs.Bioinformatics, in press.

  6. Thompson,J.D., Plewniak,F., Ripp,R., Thierry,J.C. and Poch,O. (2001) Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol., 4, 937–951.

  7. Wicker,N., Perrin,G.R., Thierry,J.C. and Poch,O. (2001) Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol. Biol. Evol., 8, 1435–1441.

  8. Wicker,N., Dembele,D., Raffelsberger,W. and Poch,O. (2002) Density of points clustering, application to transcriptomic data analysis. Nucleic Acids Res., 18, 3992–4000.

  9. Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 1, 45–48.

  10. Sussman,J.L., Lin,D., Jiang,J., Manning,N.O., Prilusky,J., Ritter,O. and Abola,E.E. (1998) Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr. D Biol. Crystallogr., 54, 1078–1084.[CrossRef][Medline]

  11. Lecompte,O., Ripp,R., Puzos-Barbe,V., Duprat,S., Heilig,R., Dietrich,J., Thierry,J.C. and Poch,O. (2001) Genome evolution at the genus level: comparison of three complete genomes of hyperthermophilic archaea. Genome Res., 6, 981–993.

  12. Cohen,G., Barbe,V., Flament,D., Galperin,M., Heilig,R., Lecompte,O., Poch,O., Prieur,D., Quérellou,J., Ripp,R., et al. (2003) An integrated analysis of the genome of the hyperthermophilic archaeon Pyrococcus abyssi. Mol. Microbiol., 47, 1495–1512.[CrossRef][Web of Science][Medline]

  13. Lecompte,O., Ripp,R., Thierry,J.C., Moras,D. and Poch,O. (2002) Comparative analysis of ribosomal proteins in complete genomes: an example of reductive evolution at the domain scale. Nucleic Acids Res., 24, 5382–5390.

  14. Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P., Bucher,P., et al. (2002) InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform., 3, 225–235.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
M. Chitale, T. Hawkins, C. Park, and D. Kihara
ESG: extended similarity group method for automated protein function prediction
Bioinformatics, July 15, 2009; 25(14): 1739 - 1745.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
M. R. Aniba, S. Siguenza, A. Friedrich, F. Plewniak, O. Poch, A. Marchler-Bauer, and J. D. Thompson
Knowledge-based expert systems and a proof-of-concept case study for multiple sequence alignment construction and analysis
Brief Bioinform, January 1, 2009; 10(1): 11 - 23.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Benelli, S. Marzi, C. Mancone, T. Alonzi, A. la Teana, and P. Londei
Function and ribosomal localization of aIF6, a translational regulator shared by archaea and eukarya
Nucleic Acids Res., January 1, 2009; 37(1): 256 - 267.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
S. Gallien, E. Perrodou, C. Carapito, C. Deshayes, J.-M. Reyrat, A. Van Dorsselaer, O. Poch, C. Schaeffer, and O. Lecompte
Ortho-proteogenomics: Multiple proteomes investigation through orthology and a new MS-based protocol
Genome Res., January 1, 2009; 19(1): 128 - 135.
[Abstract] [Full Text] [PDF]


Home page
Antimicrob. Agents Chemother.Home page
S. Matrat, A. Aubry, C. Mayer, V. Jarlier, and E. Cambau
Mutagenesis in the {alpha}3{alpha}4 GyrA Helix and in the Toprim Domain of GyrB Refines the Contribution of Mycobacterium tuberculosis DNA Gyrase to Intrinsic Resistance to Quinolones
Antimicrob. Agents Chemother., August 1, 2008; 52(8): 2909 - 2914.
[Abstract] [Full Text] [PDF]


Home page
Hum Mol GenetHome page
V. Choesmel, S. Fribourg, A.-H. Aguissa-Toure, N. Pinaud, P. Legrand, H. T. Gazda, and P.-E. Gleizes
Mutation of ribosomal protein RPS24 in Diamond-Blackfan anemia results in a ribosome biogenesis disorder
Hum. Mol. Genet., May 1, 2008; 17(9): 1253 - 1263.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
T. Golubchik, M. J. Wise, S. Easteal, and L. S. Jermiin
Mind the Gaps: Evidence of Bias in Estimates of Multiple Sequence Alignments
Mol. Biol. Evol., November 1, 2007; 24(11): 2433 - 2442.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L. A. Gregory, A.-H. Aguissa-Toure, N. Pinaud, P. Legrand, P.-E. Gleizes, and S. Fribourg
Molecular basis of Diamond Blackfan anemia: structure and function analysis of RPS19
Nucleic Acids Res., September 27, 2007; 35(17): 5913 - 5921.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Legrand, N. Pinaud, L. Minvielle-Sebastia, and S. Fribourg
The structure of the CstF-77 homodimer provides insights into CstF assembly
Nucleic Acids Res., July 26, 2007; 35(13): 4515 - 4522.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
C. T. Ranjith-Kumar, W. Miller, J. Sun, J. Xiong, J. Santos, I. Yarbrough, R. J. Lamb, J. Mills, K. E. Duffy, S. Hoose, et al.
Effects of Single Nucleotide Polymorphisms on Toll-like Receptor 3 Activity and Expression in Cultured Cells
J. Biol. Chem., June 15, 2007; 282(24): 17696 - 17705.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
N. Garnier, A. Friedrich, R. Bolze, E. Bettler, L. Moulinier, C. Geourjon, J. D. Thompson, G. Deleage, and O. Poch
MAGOS: multiple alignment and modelling server
Bioinformatics, September 1, 2006; 22(17): 2164 - 2165.
[Abstract] [Full Text] [PDF]


Home page
Mol. Biol. CellHome page
J. Muller, Y. Oma, L. Vallar, E. Friederich, O. Poch, and B. Winsor
Sequence and Comparative Genomic Analysis of Actin-related Proteins
Mol. Biol. Cell, December 1, 2005; 16(12): 5736 - 5748.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. D. Thompson, S. R. Holbrook, K. Katoh, P. Koehl, D. Moras, E. Westhof, and O. Poch
MAO: a Multiple Alignment Ontology for nucleic acid and protein sequences
Nucleic Acids Res., July 25, 2005; 33(13): 4164 - 4171.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
F. Chalmel, A. Lardenois, J.D. Thompson, J. Muller, J.-A. Sahel, T. Leveillard, and O. Poch
GOAnno: GO annotation based on multiple alignment
Bioinformatics, May 1, 2005; 21(9): 2095 - 2096.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
S. Degot, H. Le Hir, F. Alpy, V. Kedinger, I. Stoll, C. Wendling, B. Seraphin, M.-C. Rio, and C. Tomasetto
Association of the Breast Cancer Protein MLN51 with the Exon Junction Complex via Its Speckle Localizer and RNA Binding Module
J. Biol. Chem., August 6, 2004; 279(32): 33702 - 33715.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. D. Thompson, V. Prigent, and O. Poch
LEON: multiple aLignment Evaluation Of Neighbours
Nucleic Acids Res., February 24, 2004; 32(4): 1298 - 1307.
[Abstract] [Full Text] [PDF]


Home page
Hum Mol GenetHome page
M.-A. Debily, A. Camarca, M. Ciullo, C. Mayer, S. El Marhomy, I. Ba, A. Jalil, A. Anzisi, J. Guardiola, and D. Piatier-Tonneau
Expression and molecular characterization of alternative transcripts of the ARHGEF5/TIM oncogene specific for human breast cancer
Hum. Mol. Genet., February 1, 2004; 13(3): 323 - 334.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (833K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (47)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Plewniak, F.
Right arrow Articles by Poch, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Plewniak, F.
Right arrow Articles by Poch, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?