Nucleic Acids Research 2005 33(Web Server Issue):W724-W727; doi:10.1093/nar/gki434
© The Author 2005. Published by Oxford University Press. All rights reserved
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions{at}oupjournals.org
Éclaira web service for unravelling species origin of sequences sampled from mixed host interfaces
Stephen Rudd* and
Igor V. Tetko1,2
Centre for Biotechnology, Tykistökatu 6 FIN-20521 Turku, Finland
1Institute for Bioinformatics (MIPS), GSF Research Centre for Environment and Health D-85764 Neuherberg, Germany
2IBPC, Biomedical Department, Ukrainian Academy of Sciences, Murmanskaya 1 UA-02094, KYIV, Ukraine
*To whom correspondence should be addressed. Tel: +358 2 333 8611; Fax: +358 2 333 8000 Email: stephen.rudd{at}btk.utu.fi
Received February 9, 2005. Revised March 25, 2005. Accepted March 25, 2005.
 |
ABSTRACT
|
|---|
The identification of the genes that participate at the biological
interface of two species remains critical to our understanding
of the mechanisms of disease resistance, disease susceptibility
and symbiosis. The sequencing of complementary DNA (cDNA) libraries
prepared from the biological interface between two organisms
provides an inexpensive way to identify the novel genes that
may be expressed as a cause or consequence of compatible or
incompatible interactions. Sequence classification and annotation
of species origin typically use an orthology-based approach
and require access to large portions of either genome, or a
close relative. Novel species- or clade-specific sequences may
have no counterpart within existing databases and remain ambiguous
features. Here we present a web-service, Éclair, which
utilizes support vector machines for the classification of the
origin of expressed sequence tags stemming from mixed host cDNA
libraries. In addition to providing an interface for the classification
of sequences, users are presented with the opportunity to train
a model to suit their preferred species pair. Éclair
is freely available at
http://eclair.btk.fi.
 |
INTRODUCTION
|
|---|
The identification of the genes, their corresponding proteins
and the concomitant protein networks mediating successfully
between two species and resulting in either disease or symbiosis
remains critical to contemporary research (
1
4). While
genome-wide expression-profiling approaches have been applied
to successfully identify the genes that are differentially regulated
within either organism (
5
7), not all species are endowed
with the luxury of whole genome arrays or even completely sequenced
genome scaffolds. In the many more exotic interacting
species pairs where neither genome is sequenced, there remains
a need to sample the pools of genes expressed at the biological
interface. Expressed sequence tag (EST) sequencing has come
to the forefront as a robust but relatively inexpensive method
for sampling the protein encoding genes that are expressed within
a tissue, reviewed in (
8). Dissection techniques may be used
to prepare tissue homogeneous for each of the test organisms,
e.g. from within a plantnematode interaction. Complementary
DNA (cDNA) that is homogeneous for species origin may be prepared
from such tissue. This scenario becomes much more complicated
when we wish to consider finer biological interfaces such as
those within plant bacterial interactions where the bacteria
may exist as an intracellular parasite, or where a fungal genome
may co-exist with a host plant as an endophyte. In such cases
it is easiest to prepare and sequence cDNA libraries that contain
mixed content from both genomes. Dozens of such cDNA libraries
already appear within the large publicly available EST sequence
databases for plant pathogen pairs including soybean and
Phytophthora sojae (
9),
Biomphalaria glabrata and
Schistosoma mansoni (
10),
Medicago truncatula and
Glomus versiforme,
Populus tremula x P.tremuloides and
Amanita muscaria,
Gerbera hybrid and
Botrytis,
Oryza sativa and
Magnaporthe grisea.
The sceptre of such mixed libraries has already been raised, and bioinformatics solutions that can assign sequences to one of the defined parental species with varying degrees of success have been described (1114). These solutions, however, remain firmly within the realm of the bioinformatics laboratory. They require large BLAST databases, or require robust training and test datasets. This demands the pre-processing of sequence to strip redundancy from the collection and to identify the probable protein coding sequence (CDS) that can be used to build classifiers based on e.g. the underlying codon and amino acid usages.
Here we present an integrated bioinformatics solution, Éclair, that can be automatically trained and tested for the classification of the species origin for ESTs sequenced from mixed cDNA libraries. In addition to providing a framework upon which a classification model may be produced and used, the Éclair web server also provides pre-computed models for a series of the more common host:pathogen and host:host pairs that have been encountered within our research.
 |
ECLAT
|
|---|
The Eclat solution (
14) to the problem of differentiating species
origin for ESTs sequenced from mixed libraries uses a support
vector machine (SVM) method for classification. The Eclat SVM
is trained to discriminate between species on the basis of codon
frequenciesthis requires robust training and test sequence
data from both species. These data should stem from either the
genomes that have to be classified or their close taxonomic
relatives, and should be homogeneous for species origin. Eclat
provides internal methodology to predict the CDS, but can also
use CDS predictions generated by other methods. The SVM model
is then used by the classifier methods to assign a CDS to one
of the parental species. Eclat has been empirically shown to
offer superior classification rates when compared with other
approaches (
14).
 |
IMPLEMENTATION
|
|---|
Éclair is a web-service that builds upon the functionality
provided by Eclat to allow a user to estimate the probable origin
of ESTs from within a mixed sequence collection. Éclair
utilizes the core analytical pipeline from the openSputnik software
(
15). The logic flow for the Éclair web application is
summarized in
Figure 1. The core openSputnik software and the
Éclair adapters are implemented using the JAVA programming
language. The web display and interfaces are written in Python
and are implemented as a Zope product. The upload of sequences
to the Éclair service will create a series of case scripts
that are run within a distributed Linux environment using the
Sun GridEngine software for job scheduling. There are two approaches
for the use of Éclair.

View larger version (47K):
[in this window]
[in a new window]
|
Figure 1 A schematic showing the workflow as applied by the Éclair web-server to classify EST sequences for species origin. There are two routes by which Éclair may be used. In Route 1, a user applies an already existing model to classify sequences. In the second scenario, Route 2, the application is trained. A user uploads homogeneous sequence collections and these are used to prepare the required models for ESTScan and the Eclat SVM. Both methods produce extensive WWW reporting to indicate sequence origins and to indicate the sensitivity and selectivity of the underlying models.
|
|
In the first instance (Route 1), a user has a collection of
EST sequences that have been sequenced from the biological interface
of two organisms. Each of the species has already been trained
within the Éclair system so no new model needs to be
produced. The user uploads the sequences to the Éclair
server along with information on the species these sequences
should be classified to. An openSputnik project is created and
the sequences are imported. From each EST sequence, a CDS is
predicted through the ESTScan application (
16) using each of
the species-specific ESTScan models. The resultant CDSs are
classified using the Eclat SVM and the results are placed back
in the openSputnik database. A result page is produced that
identifies the sequences predicted to stem from each of the
genomes, along with basic statistics as to how successful the
SVM model was at the time of preparation.
In the second instance (Route 2), a user has a sequence collection associated with a species pair that is novel to Éclair. To produce and test an Eclat SVM, training data must be supplied; the user, therefore, uploads homogeneous data stemming from each of the species and sequence data from the mixed library. openSputnik projects are created for each of the datasets and sequences are inserted into the underlying database. To remove any codon bias within the more abundant transcripts, the training data are clustered and assembled using the sequence assembly pipeline within the openSputnik application. The CDS is identified from the unigene sequences by performing a BLASTX (17) against the Swissprot database, filtering the results arbitrarily at 1 x 108 and selecting, where applicable, the best result. These CDSs are used to train an ESTScan model which in turn is used with ESTScan to predict the CDS from the remaining unigenes. Repeated random sampling of the CDS sets are used to split the dataset into training and test sets. Training data is used to produce an Eclat SVM, while the test data is used to evaluate the efficacy of the resulting model. The results from the repeated random samples are retained and are displayed with the classification results to indicate any probable error.
 |
APPLICATION OF ÉCLAIR
|
|---|
We have tested the Éclair web service using EST data
from the host pairs shown in
Table 1. The number of sequences
used for testing and training is shown along with some statistics
that demonstrate the efficiency with which Éclair has
classified sequences during the repeated training and testing
cycles. Using the species pair with most underlying EST sequences
(
Lycopersicon esculentum and
Phytophthora infestans) we have
additionally tested the effect of the number of sequences used
to train the model with the sensitivity of the final model (data
not shown). This reveals that at least 1000 unigene training
sequences for each species should be the minimum number applied
to obtain an optimal SVM model, but only as few as 100 training
sequences for each species are required to establish a model
that has >80% sensitivity and selectivity. The efficiency
of any model is of course largely dependent upon the underlying
differences in both codon usage and amino acid usage.
View this table:
[in this window]
[in a new window]
|
Table 1 A list of the host: pathogen pairs that are available through the Éclair web-server and basic statistics that illustrate the effectiveness of the underlying Eclat SVM models
|
|
 |
FUTURE DIRECTIONS
|
|---|
The Éclair pipeline is to be fully integrated with the
openSputnik database to provide an additional annotative resource
for EST collections sequenced from mixed cDNA libraries. This
will provide detailed classification for the large numbers of
already existing sequences of unclear origin.
The openSputnik association will also provide information on
and context with complete and draft genome assemblies. EST sequences
that can be anchored at high confidence to annotated genes will
be automatically excluded from the processing by the Éclair
pipeline thereby increasing the quality of classification.
The Éclair web-server will be further developed by the inclusion of additional Eclat SVM models for new host::pathogen pairs as they are encountered within our research, when they are requested by the community or when models are created by users. This will hopefully shift the current bias for plant genomes towards a more comprehensive platform for host interactions.
 |
AVAILABILITY
|
|---|
The Éclair system is a freely available web resource
and may be used anonymously by all scientific users. An email
address can be supplied and the system will alert the user to
visit a URL to retrieve the results of the completed analysis;
this URL is also supplied at submission time. The only caveats
that are imposed are with the creation of new Eclat SVM models.
The process of clustering, assembly and training is computationally
expensive and before a job is executed it is subject to checks
by an annotator. We would welcome the opportunity to further
develop Éclair and Eclat within the context of collaborative
projectsplease contact the authors for details.
 |
ACKNOWLEDGEMENTS
|
|---|
This work was supported by Academy of Finland grant 107333 to
S.R. and BFAM 031U212C (BMBF) and TE 380/1-1 (DFG) grants to
I.V.T. Funding to pay the Open Access publication charges for
this article was provided by Academy of Finland grant 107333.
Conflict of interest statement. None declared.
 |
REFERENCES
|
|---|
- Thomas, S.R. and Elkinton, J.S. (2004) Pathogenicity and virulence J. Invertebr. Pathol., 85, 146151[CrossRef][Medline]
.
- Matsumura, H., Reich, S., Ito, A., Saitoh, H., Kamoun, S., Winter, P., Kahl, G., Reuter, M., Kruger, D.H., Terauchi, R. (2003) Gene expression analysis of plant hostpathogen interactions by SuperSAGE Proc. Natl Acad. Sci. USA, 100, 1571815723[Abstract/Free Full Text]
.
- Stokes, T. (2001) Transcriptional responses to plant pathogen interactions Trends Plant Sci., 6, 5051
.
- Staskawicz, B.J. (2001) Genetics of plantpathogen interactions specifying plant disease resistance Plant Physiol., 125, 7376[Free Full Text]
.
- Wan, J., Dunning, F.M., Bent, A.F. (2002) Probing plantpathogen interactions and downstream defense signaling using DNA microarrays Funct. Integr. Genomics, 2, 259273[CrossRef][Medline]
.
- Marathe, R., Guan, Z., Anandalakshmi, R., Zhao, H., Dinesh-Kumar, S.P. (2004) Study of Arabidopsis thaliana resistome in response to cucumber mosaic virus infection using whole genome microarray Plant Mol. Biol., 55, 501520[CrossRef][Medline]
.
- Moran, P.J., Cheng, Y., Cassell, J.L., Thompson, G.A. (2002) Gene expression profiling of Arabidopsis thaliana in compatible plantaphid interactions Arch. Insect Biochem. Physiol., 51, 182203[CrossRef][Web of Science][Medline]
.
- Rudd, S. (2003) Expressed sequence tags: alternative or complement to whole genome sequences? Trends Plant Sci., 8, 321329[CrossRef][Web of Science][Medline]
.
- Qutob, D., Hraber, P.T., Sobral, B.W., Gijzen, M. (2000) Comparative analysis of expressed sequences in Phytophthora sojae Plant Physiol., 123, 243254[Abstract/Free Full Text]
.
- Nowak, T.S., Woodards, A.C., Jung, Y., Adema, C.M., Loker, E.S. (2004) Identification of transcripts generated during the response of resistant Biomphalaria glabrata to Schistosoma mansoni infection using suppression subtractive hybridization J. Parasitol., 90, 10341040[Medline]
.
- Hraber, P.T. and Weller, J.W. (2001) On the species of origin: diagnosing the source of symbiotic transcripts Genome Biol., 2, RESEARCH0037
.
- Hsiang, T. and Goodwin, P.H. (2003) Distinguishing plant and fungal sequences in ESTs from infected plant tissues J. Microbiol. Methods, 54, 339351[CrossRef][Web of Science][Medline]
.
- Maor, R., Kosman, E., Golobinski, R., Goodwin, P., Sharon, A. (2003) PF-IND: probability algorithm and software for separation of plant and fungal sequences Curr. Genet., 43, 296302[CrossRef][Web of Science][Medline]
.
- Friedel, C.C., Jahn, K.H., Sommer, S., Rudd, S., Mewes, H.W., Tetko, I.V. (2005) Support vector machines for separation of mixed plant-pathogen EST collections based on codon usage Bioinformatics, 21, 13831388[Abstract/Free Full Text]
.
- Rudd, S. (2005) openSputnika database to ESTablish comparative plant genomics using unsaturated sequence collections Nucleic Acids Res., 33, D622D627[Abstract/Free Full Text]
.
- Iseli, C., Jongeneel, C.V., Bucher, P. (1999) ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences Proc. Int. Conf. Intell. Syst. Mol. Biol., 138148
.
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool J. Mol. Biol., 215, 403410[CrossRef][Web of Science][Medline]
.

CiteULike
Connotea
Del.icio.us What's this?