Nucleic Acids Research Advance Access published online on May 8, 2007
Nucleic Acids Research, doi:10.1093/nar/gkm325
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Berkeley Phylogenomics Group web servers: resources for structural phylogenomic analysis
Jake Gunn Glanville,
Dan Kirshner,
Nandini Krishnamurthy and
Kimmen Sjölander*
Berkeley Phylogenomics Group, University of California, Berkeley
*To whom correspondence should be addressed. Tel: 510 642 9932; Fax: 510 666 3327; Email: kimmen{at}berkeley.edu
Received January 31, 2007. Revised April 10, 2007. Accepted April 17, 2007.
 |
ABSTRACT
|
|---|
Phylogenomic analysis addresses the limitations of function
prediction based on annotation transfer, and has been shown
to enable the highest accuracy in prediction of protein molecular
function. The Berkeley Phylogenomics Group provides a series
of web servers for phylogenomic analysis: classification of
sequences to pre-computed families and subfamilies using the
PhyloFacts Phylogenomic Encyclopedia,
FlowerPower clustering
of proteins sharing the same domain architecture,
MUSCLE multiple
sequence alignment,
SATCHMO simultaneous alignment and tree
construction and
SCI-PHY subfamily identification. The
PhyloBuilder web server provides an integrated phylogenomic pipeline starting
with a user-supplied protein sequence, proceeding to homolog
identification, multiple alignment, phylogenetic tree construction,
subfamily identification and structure prediction. The Berkeley
Phylogenomics Group resources are available at
http://phylogenomics.berkeley.edu.
 |
INTRODUCTION
|
|---|
The standard protocol for gene function prediction involves
homology-based annotation transfer (e.g. using the top BLAST
hit); this approach is now known to be fraught with systematic
errors (
13). Biological processes such as gene duplication,
mutation at critical residues, speciation and domain shuffling
contribute to modifications of the original function that significantly
complicate the process of functional annotation (
1,
46).
Existing annotation errors can also be propagated by homology-based
annotation transfer (
7).
Phylogenomic inference of gene function is known to be the most robust and accurate method for functional annotation. This approach enables the function of a protein to be inferred in an evolutionary context, avoiding the pitfalls of simple pairwise sequence comparison based approaches, and vastly improving the accuracy of functional annotation (810). Phylogenomic analysis proceeds in stages, starting with homolog identification and multiple sequence alignment (MSA). The (masked) alignment is then used as input to phylogenetic tree construction. Examination of the tree topology enables biologists to discriminate between orthologs (with presumably conserved function) and paralogs (related by gene duplication, and potentially divergent in function), providing improved discrimination of specific function in instances when a protein family has evolved multiple but related distinct functions (11,12). To increase the confidence in function prediction, the source of the annotations can be examined; the Gene Ontology resource includes evidence codes for annotations for this purpose (13). The Berkeley Phylogenomics Group has developed a series of web servers for individual steps in a phylogenomic pipeline and a single web server PhyloBuilder that performs all the steps as shown in Figure 1. Each web server can be used individually or in combination for phylogenomic inference.
Each server includes Java applets for viewing the associated data; data can also be downloaded in standard formats. Users can bookmark a results page, or choose to receive results by email.

View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 1. Berkeley Phylogenomics Group web servers for the different steps of a phylogenomic pipeline. Top: Users can submit sequences for classification against the PhyloFacts Phylogenomic Encyclopedia of pre-computed families and subfamilies. Middle: The phylogenomic pipeline. Bottom: Web servers for specific tasks in the pipeline. Many of these servers cover more than one step in the process, e.g. the PhyloBuilder web server, which performs all the steps of the pipeline and outputs a MSA, subfamilies, domain/3D structure predictions and phylogenetic trees overlaid with annotations.
|
|
 |
PHYLOFACTS PHYLOGENOMIC ENCYCLOPEDIA
|
|---|
PhyloFacts enables functional classification of user-submitted
sequences to pre-computed families and subfamilies from across
the Tree of Life (
14). Hidden Markov models are provided for
functional classification of novel sequences to families and
subfamilies. PhyloFacts protein family books include
an MSA, phylogenetic trees, predicted structures and critical
residues, experimental and annotation data, hidden Markov models,
and links to other resources. Since the initial publication
(
14), the PhyloFacts resource has significantly increased in
size, from

9000 families in May 2006 to >27 000 families
in April 2007. Most of this increase in size has been to expand
our coverage of microbial gene families and gene families found
in the human genome, including homologs in other species. New
functionality included in PhyloFacts over the past year also
includes super-fast classification of user-submitted sequences
to global homology groups (proteins sharing the same domain
architecture) and a new protocol for functional sub-classification.
Usage: Users can submit DNA or protein sequences in FASTA format for classification to PhyloFacts families and subfamilies. PhyloFacts family books are selected for HMM scoring by a pre-processing step of BLAST search of the query sequences against the consensus sequences for each of the families in the resource; HMMs from families with BLAST E-values of 10 or better are scored against the query. An example output from PhyloFacts is shown in Figure 2. Clicking on View Alignment displays the pairwise alignment between the submitted query and consensus sequence and statistics about the alignment. Clicking in the Search subfamilies box for families of interest followed by clicking on the box at bottom labeled Search selected books for top-scoring subfamily HMMs against query initiates the subfamily HMM-based classification; logistic regression analysis is used to differentiate sequences that can be assigned to the top-scoring subfamily and those that represent novel subtypes. Users can examine PhyloFacts protein family books by following links in the PhyloFacts book column in the table of results. Super-fast classification of query sequences to families with global similarity is provided (results would otherwise include local matches). Users can bookmark a results page, or choose to receive results by email. PhyloFacts is available at http://phylogenomics.berkeley.edu/phylofacts.

View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 2. Result of functional classification against PhyloFacts. The figure shows HMM scoring results for the UniProt sequence Q6BH13 from Debaryomyces hansenii. The search retrieves protein family books constructed using three different protocols: global homology, conserved region and domain. Subfamily classification is enabled by selecting books (clicking in boxes at left side of table, under Search subfamilies) followed by clicking the Go button at bottom. See text for details.
|
|
 |
FLOWERPOWER HOMOLOGY DETECTION
|
|---|
FlowerPower is an iterative homology-detection server akin to
PSI-BLAST (
15), but designed specifically for phylogenomic inference
of function (
16). FlowerPower is optimized for the retrieval
of sequences sharing the same domain architecture; this prevents
transfer of database annotation based on partial homology (i.e.
local instead of global similarity). FlowerPower uses iterative
subfamily hidden Markov model (HMM) searches against PSI-BLAST-identified
homologs and alignment analysis to discriminate between partial
and global homologies; this approach outperforms existing methods
in gathering global homologs.
Usage: The input to FlowerPower
is a protein sequence in FASTA format; default parameters search
the UniProt (
17) database for proteins sharing the same domain
architecture. The Advanced Settings page enables
users to modify the PSI-BLAST parameters for database searched,
number of iterations and maximum number of hits returned. Parameters
for the iterated search with subfamily HMMs can also be modified.
Finally, users can choose between two homolog-selection modes:
global (to both query and hit) and glocal (global-local
homology, retrieved sequences must align over a specific region,
but can have additional structure). Results include the selected
sequences, the raw FlowerPower alignment, a MUSCLE (
18) re-alignment,
and the results of the initial PSI-BLAST search. FlowerPower
is available through
http://phylogenomics.berkeley.edu/flowerpower/.
 |
MULTIPLE SEQUENCE ALIGNMENT USING MUSCLE
|
|---|
The MUSCLE software produces high-accuracy multiple sequence
alignments, with outstanding scores on benchmark dataset tests;
it is also very fast, making it suitable for large-scale application
(
18). We employ MUSCLE in our internal pipeline for the PhyloFacts
Phylogenomic Encyclopedia construction (
14).
Usage: The input
to MUSCLE is a set of protein sequences in FASTA format. Alignments
can be viewed online or downloaded in Aligned FASTA format.
MUSCLE is available at
http://phylogenomics.berkeley.edu/muscle.
 |
SATCHMO
|
|---|
SATCHMO (Simultaneous Alignment and Tree Construction using
Hidden Markov mOdels) is a progressive method of multiple sequence
alignment that uses agglomerative clustering to estimate a phylogenetic
tree simultaneously with the alignment. SATCHMO uses Dirichlet
mixture densities (
19) to construct profiles, and profileprofile
scoring and alignment (
2023) to determine the phylogenetic
tree topology. Each node in the tree contains a MSA and a corresponding
profile. As sequences diverge in evolution, small insertions,
deletions and mutations result in changes in structure and function;
SATCHMO is intended to model these changes in different lineages
in a family. Profiles and alignments at internal nodes in the
tree represent the sequences descending from that node and may
be of different lengths. The alignment at the root of the tree
is an estimate of the conserved core structure defining all
family members; when highly divergent sequences are input to
SATCHMO this root alignment may be a small fraction of the average
sequence length. Tree topologies produced using SATCHMO are
consistent with expert-defined subtypes; alignment accuracy
is also high (
20).
Usage: The input to SATCHMO is a set of unaligned
protein sequences, in FASTA format. The SATCHMO root alignment
can be viewed online using a Java applet or downloaded from
the website. Special SATCHMO tree-alignment viewing software
is available online (currently for PCs only) enabling the different
alignments descending from each internal node of the tree to
be examined separately. SATCHMO is available at
http://phylogenomics.berkeley.edu/satchmo.
 |
SCI-PHY AND SUBFAMILY HMM CONSTRUCTION
|
|---|
SCI-PHY (Subfamily Classification in PHYlogenomics) uses Bayesian
and information-theoretic approaches to construct a hierarchical
tree and cut the tree into subtrees to identify functional subfamilies
(
24). Subfamily hidden Markov models are constructed using Dirichlet
mixture densities to derive a position- and subfamily-specific
weighting scheme to share information across subfamilies; this
has been shown to increase the separation between homologous
and unrelated sequences and to provide high specificity of classification
(
25).
Usage: The input to SCI-PHY is a MSA in either Aligned
FASTA or the UCSC A2M format. Outputs include the MSA divided
into subfamilies, the SCI-PHY tree, and subfamily and family
HMMs in both HMMER and UCSC SAM formats. The SCI-PHY tree can
be downloaded or viewed online using the Java ATV applet (
26).
SCI-PHY is available at:
http://phylogenomics.berkeley.edu/SCI-PHY.
 |
PHYLOBUILDER
|
|---|
PhyloBuilder is an automated computational pipeline for phylogenomic
analysis, starting from an input protein sequence. PhyloBuilder
is a modified version of the pipeline we use to populate the
PhyloFacts Phylogenomic Encyclopedias with protein family books
(
14). The
PhyloBuilder pipeline has multiple stages, as shown
in
Figure 1. In stage 1, FlowerPower is used to retrieve global
homologs for the user-supplied sequence. Program parameters
for this stage are set by default to maximize the retrieval
of proteins sharing a common domain architecture; alternative
settings are provided to enable users to request the selection
of
glocal homologs (sequences sharing a common domain but which
may have different overall folds). If fewer than three sequences
matching user criteria are identified, the program skips stages
2 and 3 and jumps directly to stage 4. In stage 2, the FlowerPower
cluster is aligned using MUSCLE, followed by alignment masking
in preparation for phylogenetic analysis (removing columns containing
>70% gap characters). In stage 3, the masked alignment is
used as the basis for neighbor-joining tree construction using
the PHYLIP software (
27), and submitted to the SCI-PHY software
for subfamily identification. In stage 4, Gene Ontology annotations
and evidence codes (
13), Enzyme Classification data, and other
data are retrieved for sequences in the cluster, and put into
a spreadsheet, separated into SCI-PHY subfamilies. The species
of origin, accession and definition lines are overlaid on the
neighbor-joining tree, and can be viewed using the ATV tree-viewer/editor
Java applet. In stage 5, domain and 3D-structure predictions
for the family as a whole are performed based on analysis of
the consensus sequence for the family: PFAM domains (
28) are
predicted (using the PFAM gathering threshold), transmembrane
domains and signal peptides are predicted using the Phobius
server (
29), and homologous 3D structures are identified using
BLAST analysis against the Protein Data Bank (PDB) (
30). The
phylogenetic trees produced by the
PhyloBuilder web server can
be used to identify orthologs manually; users can also download
these trees and alignments for input to automated ortholog identification
programs such as Orthostrapper (
11) and RIO (
12).
Usage: Users
paste in (or upload) a protein sequence for analysis. Results
are stored for ten days; users can request long-term storage
of these results.
PhyloBuilder program outputs include a multiple
sequence alignment, phylogenetic tree, subfamily identification,
predicted domain/3D structure, and experimental and annotation
data (see
Figure 3). PhyloBuilder is available at
http://phylogenomics.berkeley.edu/phylobuilder.

View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 3. Result of PhyloBuilder run for human Caspase-1. PhyloBuilder takes an input protein sequence and outputs a web page containing a cluster of homologous proteins, multiple sequence alignment, neighbor-joining tree, predicted subfamilies, PFAM domains, transmembrane domains and signal peptides, and retrieval of Gene Ontology (GO) and Enzyme Classification (EC) data. Top: Summary data include the number of homologs retrieved, taxonomic distribution, EC numbers and GO annotations and evidence codes. SCI-PHY subfamilies can be viewed by clicking on the link labeled View details (see inset at top right). The multiple sequence alignment can be viewed using JalView or hypertext. A spreadsheet with annotations for all sequences is available under Experimental and annotation data for sequences in family. Middle: PFAM and transmembrane domain/signal peptide predictions are displayed. Neighbor-joining and SCI-PHY trees can be viewed using ATV. Homologous 3D structures can be viewed using JMOL; residues predicted to be critical using evolutionary conservation analysis are displayed on the structure. Catalytic Site Atlas data are included. Bottom: Various downloads are available, including a multiple sequence alignment for the family and individual subfamilies, a FASTA file for all PSI-BLAST hits, NJ tree and HMMs for the family and SCI-PHY subfamilies in HMMER and SAM formats. An Edit book info button enables users to add descriptive labels to families and subfamilies (as shown in the inset at top right). See text for details.
|
|
 |
FUTURE WORK
|
|---|
The PhyloFacts Phylogenomic Encyclopedia is under continuous
expansion; we plan to continue our development of this resource
to cover all protein families across the Tree of Life. The conservative
parameterization of homology clustering component of the PhyloBuilder
server occasionally results in a somewhat restrictive set of
homologs when global homology is enforced. We plan to explore
PhyloBuilder parameter settings that retain selectivity while
optimizing sensitivity, and to allow users to input a multiple
sequence alignment constructed independently instead of being
dependent on the FlowerPower clustering used in PhyloBuilder.
Computational efficiency remains a significant challenge in
phylogenomic inference. Many of the steps in a phylogenomic
pipeline are computationally intensive; this causes us to limit
the size of inputs and the number of jobs submitted per day
(see individual web server pages for guidelines). We plan to
improve the computational efficiency of these servers and also
increase the size of our compute cluster in order to overcome
this limitation.
 |
ACKNOWLEDGEMENTS
|
|---|
The authors wish to thank the numerous developers of bioinformatics
web servers and databases providing methods or data included
in the PhyloFacts resource and web servers. The PhyloFacts resource
is supported by a Presidential Early Career Award for Scientists
and Engineers (PECASE) from the National Science Foundation
(DBI-0238311), and by a grant from the National Human Genome
Research Institute of the NIH (R01HG02769). We thank Ian Tegebo
and Graham Melcher for assistance in quality control tests of
the web servers described here. Funding to pay the Open Access
publication charges for this article was provided by NIH (R01HG02769).
Conflict of interest statement. None declared
 |
REFERENCES
|
|---|
- Bork P, Koonin EV. Predicting functions from protein sequences where are the bottlenecks? Nat. Genet, ( (1998) ) 18, : 313318.[CrossRef][ISI][Medline]
- Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol, ( (1998) ) 1, : 5567.[Medline]
- Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol, ( (2000) ) 1, . REVIEWS0005.
- Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool, ( (1970) ) 19, : 99113.[Medline]
- Kaessmann H, Zollner S, Nekrutenko A, Li WH. Signatures of domain shuffling in the human genome. Genome Res, ( (2002) ) 12, : 16421650.[Abstract/Free Full Text]
- Rajalingam R, Parham P, Abi-Rached L. Domain shuffling has been the main mechanism forming new hominoid killer cell Ig-like receptors. J. Immunol, ( (2004) ) 172, : 356369.[Abstract/Free Full Text]
- Brenner SE. Errors in genome annotation. Trends Genet, ( (1999) ) 15, : 132133.[CrossRef][ISI][Medline]
- Brown D, Sjölander K. Functional classification using phylogenomic inference. PLoS Comput. Biol, ( (2006) ) 2, : e77.[CrossRef][Medline]
- Eisen JA. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res, ( (1998) ) 8, : 163167.[Free Full Text]
- Sjölander K. Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics, ( (2004) ) 20, : 170179.[Abstract/Free Full Text]
- Storm CE, Sonnhammer EL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics, ( (2002) ) 18, : 9299.[Abstract/Free Full Text]
- Zmasek CM, Eddy SR. RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics, ( (2002) ) 3, : 14.[CrossRef][Medline]
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet, ( (2000) ) 25, : 2529.[CrossRef][ISI][Medline]
- Krishnamurthy N, Brown DP, Kirshner D, Sjölander K. PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol, ( (2006) ) 7, : R83.[CrossRef][Medline]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, ( (1997) ) 25, : 33893402.[Abstract/Free Full Text]
- Krishnamurthy N, Brown D, Sjölander K. FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function. BMC Evol. Biol, ( (2007) ) 7, (Suppl. 1): S12.[CrossRef][Medline]
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al. The Universal Protein Resource (UniProt). Nucleic Acids Res, ( (2005) ) 33, : D154D159.[Abstract/Free Full Text]
- Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, ( (2004) ) 5, : 113.[CrossRef][Medline]
- Sjölander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci, ( (1996) ) 12, : 327345.[Abstract/Free Full Text]
- Edgar RC, Sjölander K. SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics, ( (2003) ) 19, : 14041411.[Abstract/Free Full Text]
- Edgar RC, Sjölander K. A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, ( (2004) ) 20, : 13011308.[Abstract/Free Full Text]
- Edgar RC, Sjölander K. COACH: profile-profile alignment of protein families using hidden Markov models. Bioinformatics, ( (2004) ) 20, : 13091318.[Abstract/Free Full Text]
- Sadreyev R, Grishin N. COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol, ( (2003) ) 326, : 317336.[CrossRef][ISI][Medline]
- Sjölander K. Phylogenetic inference in protein superfamilies: analysis of SH2 domains. Proc. Int. Conf. Intell. Syst. Mol. Biol, ( (1998) ) 6, : 165174.[Medline]
- Brown D, Krishnamurthy N, Dale JM, Christopher W, Sjölander K. Subfamily hmms in functional genomics. Pac. Symp. Biocomput, ( (2005) ) 10, : 322333.
- Zmasek CM, Eddy SR. ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics, ( (2001) ) 17, : 383384.[Abstract/Free Full Text]
- Felsenstein J. 3.6a2.1 ed. Distributed by the author. In: Department of Genome Science., ( (2005) ) Seattle: University of Washingtion.
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, et al. The Pfam protein families database. Nucleic Acids Res, ( (2004) ) 32, : D138D141.[Abstract/Free Full Text]
- Kall L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol, ( (2004) ) 338, : 10271036.[CrossRef][ISI][Medline]
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res, ( (2000) ) 28, : 235242.[Abstract/Free Full Text]

CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:

|
 |

|
 |
 
A. Dereeper, V. Guignon, G. Blanc, S. Audic, S. Buffet, F. Chevenet, J.-F. Dufayard, S. Guindon, V. Lefort, M. Lescot, et al.
Phylogeny.fr: robust phylogenetic analysis for the non-specialist
Nucleic Acids Res.,
July 1, 2008;
36(suppl_2):
W465 - W469.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
S. Wanchana, S. Thongjuea, V. J. Ulat, M. Anacleto, R. Mauleon, M. Conte, M. Rouard, M. Ruiz, N. Krishnamurthy, K. Sjolander, et al.
The Generation Challenge Programme comparative plant stress-responsive gene catalogue
Nucleic Acids Res.,
January 11, 2008;
36(suppl_1):
D943 - D946.
[Abstract]
[Full Text]
[PDF]
|
 |
|