Nucleic Acids Research, 2000, Vol. 28, No. 18 3442-3444
© 2000 Oxford University Press
STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene
1European Molecular Biology Laboratory, Meyerhofstrasse 1, D-69117 Heidelberg, Germany and 2Max-Delbruck-Centrum for Molecular Medicine, D-13122 Berlin-Buch, Germany
Received July 14, 2000; Accepted August 2, 2000.
| ABSTRACT |
|---|
|
|
|---|
The repeated occurrence of genes in each others neighbourhood on genomes has been shown to indicate a functional association between the proteins they encode. Here we introduce STRING (search tool for recurring instances of neighbouring genes), a tool to retrieve and display the genes a query gene repeatedly occurs with in clusters on the genome. The tool performs iterative searches and visualises the results in their genomic context. By finding the genomically associated genes for a query, it delineates a set of potentially functionally associated genes. The usefulness of STRING is illustrated with an example that suggests a functional context for an RNA methylase with unknown specificity. STRING is available at http://www.bork.embl-heidelberg.de/STRING
| INTRODUCTION |
|---|
|
|
|---|
The availability of complete genome sequences has stimulated the development of new methods for protein function prediction (16). In contrast to classical, homology-based function assignment, these methods do not predict the function of proteins, but rather the functional association between proteins, based on the genomic association of their genes. One approach is based on the observation that genes that repeatedly occur in each others proximity on genomes (in potential operons) tend to encode functionally interacting proteins, e.g. the proteins are part of the same protein complex or metabolic pathway (13,79). Here we introduce a web-server that retrieves for a given query gene all the genes that repeatedly occur within potential operons. The server is named STRING (search tool for recurring instances of neighbouring genes). It also retrieves, by an iterative approach, the genes that are indirectly (via other genes) associated with the query gene. The web-interface (http://www.bork.embl-heidelberg.de/STRING ) visualises the results in their genomic context (Fig. 1).
|
| METHODOLOGY |
|---|
|
|
|---|
The tool starts with a single seed gene. In the zero iteration it retrieves and displays the genes that repeatedly occur with this gene in clusters on the genome in multiple, phylogenetically distant species (for a definition see 10). We define gene clusters here as introduced by Overbeek et al. with the concept of a run, a stretch of genes on the same strand not interrupted by >300 bp (2). In addition we count two genes that are actually fused into one gene as being in the same run. In subsequent iterations the tool repeats this process using as seeds all the new genes retrieved in the previous iteration, thereby uncovering the set of genes that are indirectly linked to the seed gene. The iterations continue until the number of iterations set by the user is reached, or until no new genes are found (convergence). Normally the query gene is used as seed. If the query gene is not part of a conserved gene cluster itself, the tool uses orthologues of the query gene that are in conserved gene clusters as seed. When a protein sequence is submitted as query, the tool performs a blast search against the proteins from the published genomes (NCBI basic protein blast2.0 with a cut-off E-value of 105; 11). If a perfect match is found, that gene is used as seed. Otherwise the user can select a seed from the list of blast hits. With the results of the last iteration the tool also displays the genes that are not retrieved via conserved gene order but that are still present in the species of which other genes already have been retrieved. The presence or absence of these genes that are not in a conserved cluster complements the cluster information. The explicit focus on (iteratively) searching and displaying the integral conserved genomic organisation for a given gene is one of the defining features of this server, and set it apart from what is currently available at servers like KEGG (12). A conceptual similar approach is being developed independently at WIT (http://wit.integratedgenomics.com/IGwit/ ), which in principle allows one to obtain similar results. Apart from many small differences in the implementation and visualisation, the major difference seems to be that STRING is a specialised and dedicated server for this type of search.
Orthology is operationally defined as bidirectional best, significant (E < 0.01), hit, based on SmithWaterman (13) comparisons of the complete genomes with one another, and including the possibility of gene fusion/fission (14). The iterative usage of these orthology relations can give rise to inconsistencies, due to unrecognised paralogy, unrecognised homology, and/or gene fusion. However, the quality of orthology prediction here is relatively high because of the additional requirement in STRING of conserved gene order (14).
| DISPLAY |
|---|
|
|
|---|
All the retrieved information is displayed in one graphic that features extra information about the genes and their context (Fig. 1A). The extra information includes additional non-conserved neighbouring genes, the gene order, the relative location of the gene clusters in the genome, and the relative direction of transcription of the genes. Also featured is a table that lists how often the seed gene occurs in the same run with each other gene, both in all genomes as well as only in phylogenetically distant genomes (Fig. 1B). This indicates the degree of genomic association between the two genes, and thereby the strength of functional association between their respective products. The number of co-occurrences of two genes in the same cluster is linked to a page that displays only the clusters in those species containing that specific organisation and highlighting the two specified genes (Fig. 1C). To assist in assessing the substructure of genomic associations between all the retrieved genes, the number of co-occurrences of genes in the same cluster for every pair of genes is shown in a separate matrix which can be accessed by clicking on its link.
| COVERAGE |
|---|
|
|
|---|
STRING finds results for 24 768 out of the 59 416 genes in the presently included set of completely sequenced genomes. Although there is little operon structure in eukaryotes, to the extent that orthologues of their genes are present in prokaryotes, it is possible to predict functional associations for these genes. In this way we found results for 637 genes out of the 1681 yeast genes that have orthologues in the prokaryotes.
| SELECTIVITY |
|---|
|
|
|---|
We tested the probability that two genes repeatedly occur in one cluster by chance. In randomly shuffled genomes the probability that a given gene occurs with the same other gene in one cluster in two species is 0.02. For three species this probability is <0.002, and for four species or more it is <0.0005. The accuracy in terms of predicted functional relations is difficult to determine because of the broad definition of functional association, which includes a spectrum of possible protein relations ranging from direct ones such as physical interactions to more vague ones like the proteins being active in the same cellular process. Notice, however, that the functional link tends to be stronger when the conservation is stronger (6). Furthermore, the interpretation of the type of association is facilitated by what is known about the putative molecular functions of the proteins, that can be inferred from conventional homology (see the example of cys-tRNA in Fig. 1). In general, only the user can interpret the nature of the association by knowledge of the genes and organisms involved.
| CONCLUDING REMARKS |
|---|
|
|
|---|
STRING provides a platform for searching and interpreting conserved patterns in genome organisation with the aim of finding functional associations for a given gene. The iterations and visualisation of the thereby retrieved genes allow the analysis and delineation of the set of potential interaction partners.
| ACKNOWLEDGEMENTS |
|---|
The authors wish to thank the members of the Bork group for helpful discussion and feedback. This work was supported by the DFG and the BMBF.
| FOOTNOTES |
|---|
* To whom correspondence should be addressed. Tel: +49 6221 387 372; Fax: +49 6221 387 517; Email: snel@embl-heidelberg.de
| REFERENCES |
|---|
|
|
|---|
-
1 Dandekar,T., Snel,B., Huynen,M. and Bork,P. (1998) Trends Biochem. Sci., 23, 324328.[Web of Science][Medline]
2 Overbeek,R., Fonstein,M., DSouza,M., Pusch,G.D. and Maltsev,N. (1998) In Silico Biol., 1, 0009.
3 Overbeek,R., Fonstein,M., DSouza,M., Pusch,G.D. and Maltsev,N. (1999) Proc. Natl Acad. Sci. USA, 96, 28962901.
4 Marcotte,E.M., Pellegrini,M., Ng,H.L., Rice,D.W., Yeates,T.O. and Eisenberg,D. (1999) Science, 285, 751753.
5 Enright,A.J., Iliopoulos,I., Kyrpides,N.C. and Ouzounis,C.A. (1999) Nature, 402, 8690.[Medline]
6 Pellegrini,M., Marcotte,E.M., Thompson,M.J., Eisenberg,D. and Yeates,T.O. (1999) Proc. Natl Acad. Sci. USA, 96, 42854288.
7 Mushegian,A.R. and Koonin,E.V. (1996) Trends Genet., 12, 289290.[Web of Science][Medline]
8 Tamames,J., Casari,G., Ouzounis,C. and Valencia,A. (1997) J. Mol. Evol., 44, 6673.[Web of Science][Medline]
9 Watanabe,H., Mori,H., Itoh,T. and Gojobori,T. (1997) J. Mol. Evol., 44, S57S64.
10 Huynen,M.A. and Snel,B. (2000) Adv. Protein Chem., 54, 345379.[Web of Science][Medline]
11 Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
12 Kanehisa,M. and Goto,S. (2000) Nucleic Acids Res., 28, 2730.
13 Smith,T.F. and Waterman,M.S. (1981) J. Mol. Biol., 25, 195197.
14 Huynen,M.A. and Bork,P. (1998) Proc. Natl Acad. Sci. USA, 95, 58495856.
15 Hamann,C.S., Sowers,K.R., Lipman,R.S. and Hou,Y.M. (1999) J. Bacteriol., 181, 58805884.
16 Lipman,S.A. and Hou,Y.M. (1998) Proc. Natl Acad. Sci. USA, 95, 1349513500.
This article has been cited by other articles:
![]() |
J. D. Berndt, T. L. Biechele, R. T. Moon, and M. B. Major Integrative Analysis of Genome-Wide RNA Interference Screens Sci. Signal., May 12, 2009; 2(70): pt4 - pt4. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. J. Jensen, M. Kuhn, M. Stark, S. Chaffron, C. Creevey, J. Muller, T. Doerks, P. Julien, A. Roth, M. Simonovic, et al. STRING 8--a global view on proteins and their functional interactions in 630 organisms Nucleic Acids Res., January 1, 2009; 37(suppl_1): D412 - D416. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. W. Davies and G. C. Walker A Highly Conserved Protein of Unknown Function Is Required by Sinorhizobium meliloti for Symbiosis and Environmental Stress Protection J. Bacteriol., February 1, 2008; 190(3): 1118 - 1123. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. N. Chua, W.-K. Sung, and L. Wong An efficient strategy for extensive integration of diverse biological data for protein function prediction Bioinformatics, December 15, 2007; 23(24): 3364 - 3373. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. W. Shultz, V. M. Tatineni, L. Hanley-Bowdoin, and W. F. Thompson Genome-Wide Analysis of the Core DNA Replication Machinery in the Higher Plants Arabidopsis and Rice Plant Physiology, August 1, 2007; 144(4): 1697 - 1714. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. von Mering, L. J. Jensen, M. Kuhn, S. Chaffron, T. Doerks, B. Kruger, B. Snel, and P. Bork STRING 7--recent developments in the integration and prediction of protein interactions Nucleic Acids Res., January 12, 2007; 35(suppl_1): D358 - D362. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Janga, W. F. Lamboy, A. M. Huerta, and G. Moreno-Hagelsieb The distinctive signatures of promoter regions and operon junctions across prokaryotes Nucleic Acids Res., September 1, 2006; 34(14): 3980 - 3987. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.-J. Han and S. Y. Lee The Escherichia coli Proteome: Past, Present, and Future Prospects Microbiol. Mol. Biol. Rev., June 1, 2006; 70(2): 362 - 439. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. A. Notebaart, M. A. Huynen, B. Teusink, R. J. Siezen, and B. Snel Correlation between sequence conservation and the genomic context after gene duplication Nucleic Acids Res., October 27, 2005; 33(19): 6164 - 6171. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. J. Alm, K. H. Huang, M. N. Price, R. P. Koche, K. Keller, I. L. Dubchak, and A. P. Arkin The MicrobesOnline Web site for comparative genomics Genome Res., July 1, 2005; 15(7): 1015 - 1022. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. N. Kinch, K. Ginalski, L. Rychlewski, and N. V. Grishin Identification of novel restriction endonuclease-like fold families among hypothetical proteins Nucleic Acids Res., June 22, 2005; 33(11): 3598 - 3605. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Banci, I. Bertini, V. Calderone, F. Cramaro, R. Del Conte, A. Fantoni, S. Mangani, A. Quattrone, and M. S. Viezzoli A prokaryotic superoxide dismutase paralog lacking two Cu ligands: From largely unstructured in solution to ordered in the crystal PNAS, May 24, 2005; 102(21): 7541 - 7546. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Ozyildirim, G. J. Wistow, J. Gao, J. Wang, D. P. Dickinson, H. F. Frierson Jr, and G. W. Laurie The Lacrimal Gland Transcriptome Is an Unusually Rich Source of Rare and Poorly Characterized Gene Transcripts Invest. Ophthalmol. Vis. Sci., May 1, 2005; 46(5): 1572 - 1580. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Pysz, S. B. Conners, C. I. Montero, K. R. Shockley, M. R. Johnson, D. E. Ward, and R. M. Kelly Transcriptional Analysis of Biofilm Formation Processes in the Anaerobic, Hyperthermophilic Bacterium Thermotoga maritima Appl. Envir. Microbiol., October 1, 2004; 70(10): 6098 - 6112. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. v. Mering, M. Huynen, D. Jaeggi, S. Schmidt, P. Bork, and B. Snel STRING: a database of predicted functional associations between proteins Nucleic Acids Res., January 1, 2003; 31(1): 258 - 261. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. B. Rogozin, K. S. Makarova, J. Murvai, E. Czabarka, Y. I. Wolf, R. L. Tatusov, L. A. Szekely, and E. V. Koonin Connected gene neighborhoods in prokaryotic genomes Nucleic Acids Res., May 15, 2002; 30(10): 2212 - 2223. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Snel, P. Bork, and M. A. Huynen The identification of functional modules from the genomic association of genes PNAS, April 30, 2002; 99(9): 5890 - 5895. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. C. Mellor, I. Yanai, K. H. Clodfelter, J. Mintseris, and C. DeLisi Predictome: a database of putative functional links between proteins Nucleic Acids Res., January 1, 2002; 30(1): 306 - 309. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. I. Wolf, I. B. Rogozin, A. S. Kondrashov, and E. V. Koonin Genome Alignment, Evolution of Prokaryotic Genome Organization, and Prediction of Gene Function Using Genomic Context Genome Res., March 1, 2001; 11(3): 356 - 372. [Abstract] [Full Text] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












