Published online 7 December 2004
Nucleic Acids Research, Vol. 32 No. 21 © Oxford University Press 2004; all rights reserved
Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae
1 UT-ORNL Graduate School of Genome Science and Technology, Oak Ridge, TN, USA and 2 Digital Biology Laboratory, Computer Science Department, 201 Engineering Building West, University of Missouri, Columbia, MO 65211, USA
* To whom correspondence should be addressed. Tel: +1 573 882 7064; Fax: +1 573 882 8318; Email: xudong{at}missouri.edu
Present address: Yu Chen, BioMarker Development, Novartis Pharmaceuticals Corp., One Health Plaza, East Hanover, NJ 07936, USA
Received October 11, 2004; Revised and Accepted November 15, 2004
| ABSTRACT |
|---|
|
|
|---|
As we are moving into the post genome-sequencing era, various high-throughput experimental techniques have been developed to characterize biological systems on the genomic scale. Discovering new biological knowledge from the high-throughput biological data is a major challenge to bioinformatics today. To address this challenge, we developed a Bayesian statistical method together with Boltzmann machine and simulated annealing for protein functional annotation in the yeast Saccharomyces cerevisiae through integrating various high-throughput biological data, including yeast two-hybrid data, protein complexes and microarray gene expression profiles. In our approach, we quantified the relationship between functional similarity and high-throughput data, and coded the relationship into functional linkage graph, where each node represents one protein and the weight of each edge is characterized by the Bayesian probability of function similarity between two proteins. We also integrated the evolution information and protein subcellular localization information into the prediction. Based on our method, 1802 out of 2280 unannotated proteins in yeast were assigned functions systematically.
| INTRODUCTION |
|---|
|
|
|---|
An immediate challenge of the post-genomic era is to assign biological functions to all the proteins encoded by the genome. Despite all the efforts, only 5060% of genes have been annotated in most organisms (1). This leaves bioinformatics with the opportunity and challenge of predicting functions for unannotated proteins, by developing effective and automated methods. Several approaches have been developed for predicting protein function based on sequence similarity, such as FASTA (2) and PSI-BLAST (3). Another method to predict function is based on sequence fusion information, e.g. the Rosetta Stone approach (4). Function can also be inferred through the phylogenetic profiling of proteins in multiple genomes (5). With ever-increasing flow of biological data generated by the high-throughput methods, such as yeast two-hybrid systems (6), protein complexes identification by mass spectrometry (7,8) and microarray gene expression profiles (9,10), some computational approaches have been developed to use these data for gene function prediction. Cluster analysis of the gene-expression profiles is a common approach for predicting functions based on the assumption that genes with similar functions are likely to be co-expressed (9,10). Using proteinprotein interaction data to assign functions to novel proteins is another approach. Proteins often interact with one another in an interaction network to achieve a common objective. It is therefore possible to infer the functions of proteins based on the functions of their interaction partners, also known as guilt by association (11). Schwikowski et al. (11) applied a neighbor-counting method for predicting the function. They assigned function to an unknown protein based on the frequencies of its neighbors having certain functions. The method was improved by Hishigaki et al. (12), who used
2-statistics. Both these approaches give equal significance to all the functions contributed by the protein neighbors in the interaction network. Other function prediction methods using high-throughput data include machine-learning and data-mining approaches (13) and Markov random fields (14,15). MAGIC (Multisource Association of Genes by Integration of Clusters) also combined heterogeneous data for function assignment (16). One major challenge for protein function prediction is that, the errors in the high-throughput data have not been handled well and the rich information contained in various high-throughput data has not been fully utilized, given the complexity and the quality of high-throughput data (17). A possible solution for this problem is Bayesian probabilistic model (18), which could lead to a coherent function prediction and reduce the effect of noise by combining information from diverse data sources within a common probabilistic framework, and naturally weighs each information source according to the conditional probability relationship among information sources. Another major limitation of current function prediction methods based on majority rule assignment (11) is that the global properties of interaction network are underutilized, since current methods often do not take into account the links among proteins of unknown functions. Recently, to address this challenge, Vazquez et al. (19) proposed a global method to assign protein functions based on protein interaction network, by minimizing the number of protein interactions among different functional categories. Karaoz et al. (20) mapped gene expression and protein interaction data into Hopfield network to make function predictions for >200 proteins with unknown functions.
To further overcome these limitations, we developed a computational framework for systematic protein function annotation on the genomic scale. Our current study focuses on the yeast Saccharomyces cerevisiae (Baker's yeast), where rich high-throughput data are available. Comparing with current methods, our method is distinctive in the following aspects: (i) unannotated proteins can be assigned to various function categories of Gene Ontology (GO) biological processes (21) with Reliability scores. This is in contrast to most other prediction methods, where proteins were predicted as yes or no without confidence assessment to a limited number of function categories [e.g. MIPS (22), which are less detailed than GO]. (ii) We quantitatively measured functional relationship between genes underlying each type of high-throughput data (protein binary interactions, protein complexes and microarray gene expression profiles) and coded the relationship into functional linkage graph (interaction network), where each node represents one protein and the weight of each edge is characterized by the Bayesian probability of function similarity between two proteins. (iii) We also integrated evolutionary information and protein subcellular localization information into function annotation. (iv) We developed a novel global function prediction method based on Boltzmann machine, for protein function annotation with integration of functional linkage evidences from different types of high-throughput data. We may predict the function of an unannotated gene, even if none of its neighbors in the network has known function. Our method is robust for combining and propagating information systematically across the entire network based on the global optimization of the network configuration.
| DATA SOURCES |
|---|
|
|
|---|
The high-throughput data including microarray data, protein binary interaction data and protein complex data were coded into an interaction network, which can be viewed as a weighted non-directed graph Gp (D) = (Vp, Ep) with the vertex set Vp = {di|di
D}; and the edge set Ep = {(di, dj)|for di, dj
D and i
j}. Each vertex represents one protein and each edge represents one measured connection between the two linked proteins from different types of high-throughput data, which are denoted as correlation in gene expression profiles with Pearson correlation coefficient r, the protein binary interaction or protein complex interaction.
Proteinprotein binary interaction data
The proteinprotein interaction data from high-throughput yeast two-hybrid interaction experiments were from Uetz et al. (23) and Ito et al. (24), together with 5075 unique interactions among 3567 proteins. We combined the yeast two-hybrid data with the known proteinprotein interaction data in the MIPS database (http://mips.gsf.de/proj/yeast/CYGD/db/). In total, 6516 unique binary interactions among 3989 proteins were used in this study.
Protein complexes
The protein complex data were obtained from Gavin et al. (7) and Ho et al. (8). In the protein complexes, although it is unclear which proteins are in physical contact, the protein complex data contain rich information about functional relationship among involved proteins. For simplicity, we assigned binary interactions between any two proteins participating in a complex. Thus in general, if there are n proteins in a protein complex, we add n * (n1)/2 binary interactions. This yields 49 313 edges to the interaction network.
Microarray gene expression data
The gene-express profiles of microarray data were from Gasch et al. (25), which included 174 experimental conditions for all the genes in yeast. For each experiment, if there was a missing point, we substituted its gene expression ratio to the reference state with the average ratio of all the genes under that specific experimental condition. A Pearson correlation coefficient was calculated for each possible gene pairs to quantify the correlation between the gene pairs.
Subcellular localization data
We used the genome-scale protein subcellular localization data obtained from green fluorescent protein (GFP)-tagged yeast strain (26). The 4156 proteins were assigned into 22 distinct subcellular localization categories. The data are available at http://yeastgfp.ucsf.edu/.
Genomic sequence data
We downloaded the genomic sequence and the protein annotation data of five species at public databases, including budding yeast S.cerevisiae (http://genome-www.stanford.edu/Saccharomyces/), Arabidopsis thaliana (http://www.arabidopsis.org/), Drosophila melanogaster (http://flybase.bio.indiana.edu/) and Caenorhabditis elegans (http://www.wormbase.org/).
| METHODS |
|---|
|
|
|---|
Measurement of protein function similarity
A particular gene product can be characterized with different types of functions, including molecular function at the biochemical level (e.g. cyclase or kinase, whose annotation is often more related to sequence similarity and protein structure) and the biological process at the cellular level (e.g. pyrimidine metabolism or signal transduction, which is often revealed in the high-throughput data of protein interaction and gene expression profiles). In our study, function annotation of protein is defined by the GO biological process (21). The GO biological process ontology is available at http://www.geneontology.org. It has a hierarchical structure with multiple inheritances. We used GO biological process classification, as of November 2003, to assign function to unannotated proteins in our study. After acquiring the biological process functional annotation for the known proteins along with their GO Identification (ID), we generated a numerical GO INDEX, which represents the hierarchical structure of the classification. The more detailed level of the GO INDEX, the more specific is the function assigned to a protein. The maximum level of GO INDEX is 12. The following shows an example of GO INDEX hierarchy, with the numbers on the left giving the GO INDICES and the numbers in the brackets indicating the GO IDs:
- 2 cellular process (GO:0009987)
- 2-1 cell communication (GO:0007154)
- 2-1-8 signal transduction (GO:0007165)
- 2-1-8-1 cell surface receptor linked signal transduction (GO:0007166)
- 2-1-8-1- 4 G-protein coupled receptor protein signaling pathway (GO:0030454)
- 2-1-8-4-4-12 signal transduction during conjugation with cellular fusion (GO:0000750)
- 2-1 cell communication (GO:0007154)
Calculation of Bayesian probabilities
We calculated probabilities for two genes to share the same function based on different types of high-throughput data, i.e. microarray data, protein binary interaction data and protein complex data. Given two genes are correlated in gene expression with Pearson correlation coefficient r in microarray data (Mr), the posterior probability that two genes have the same function, p(S|Mr), is computed using the Bayes' formula:
![]() | (1) |
To quantify the gene function relationship among the correlated gene expression pairs, we calculated the probabilities of such gene expression correlated pairs sharing the same function at each GO INDEX level. It shows a higher probability of sharing the same function for broad functional categories (the high-order GO INDEX levels), or highly correlated genes in expression profiles (Figure 1A). Figure 1B shows the presence of information in highly correlated gene-expression pairs for their gene functional relationship in comparison to random pairs. Based on Figure 1, we decided to consider pairs with gene expression profile correlation coefficient
0.7 for function predictions, as other pairs have little information for function prediction. The estimated probabilities of sharing the same function corresponding to gene pairs with r
0.7 were smoothed by using a monotone regression function [the pool-adjacent-violators algorithm (27)] for protein function prediction. We also integrated protein subcellular information into probability calculations of microarray data. As shown in Figure 2, two genes with correlated gene expression profiles are more likely to have the same function if they share the same cellular compartment.
|
|
For protein binary interaction (B), the probability that two proteins have the same function, p(S|B), is computed as:
![]() | (2) |
Similarly, given two proteins are in the same complex, i.e. have a complex interaction (C), we can estimate the probability of two proteins having the same function p(S|C) as:
![]() | (3) |
The analysis result of the proteinprotein interaction data is shown in Figure 3 that shows the normalized ratios of proteinprotein interaction pairs against the random pairs for sharing the same GO INDEX level. Since the value is highly above 1, particularly for more specific function categories, there clearly exists a relationship between the proteinprotein interaction data and similarity in function. Such relationships can be utilized to make function predictions. It is assumed that if the protein interaction pairs are evolutionally conserved, they are more likely to share the same function since protein interaction might put constraints on sequence divergence (28). We added the evolution information into the probability calculations for interacting proteins to share the same function based on sequence comparison. For each protein in S.cerevisiae, its putative orthologs in other three distantly related species (A.thaliana, D.melanogaster and C.elegans) were identified using the reciprocal search method (29). Thus, protein interaction data can be classified into two subsets: (i) for each interacting pair {Pi, Pj}, both proteins i and j have orthologs in at least one organism out of the three species; and (ii) the remaining data. For each subset we calculated its Bayesian probability (Figure 4). The interaction pairs in subset (i) can be considered as co-evolved and they indeed have higher probabilities of sharing the same function as shown in Figure 4.
|
|
Protein function prediction
Local prediction
In the local prediction of an unannotated protein using its immediate neighbors in the network graph, we follow the idea of guilt by association, i.e. if an interaction partner of the studied unannotated protein x has a known function, x may share the same function with a probability underlying the high-throughput data between x and its partner. We identify the possible interactors for protein x in each high-throughput data type (protein binary interaction, protein complex interaction and microarray gene expression with correlation coefficient r
0.7). We assign functions to the unannotated proteins on the basis of common functions identified among the annotated interaction partners, using the probabilities described in the previous section on Calculation of Bayesian probabilities. Furthermore, we assume that the information contents for protein function prediction from different sources of high-throughput data or different interaction partners are independent, based on the early suggestion that the information from different high-throughput data are conditionally uncorrelated (30,31). A protein can belong to one or more GO INDICES, depending upon its interaction partners and their functions. For example, in Figure 5, protein x is an unannotated protein. Proteins a, b and c that interact with x have known functions. With the assumption that Fl, l = 1, 2,..., n, represents a collection of all the functions that proteins a, b and c have, a likelihood score function for protein x to have function Fl, G(Fl, x), is defined as:
![]() | (4) |
0.7 (M), protein binary interaction (B) and protein complex interaction (C), respectively. In each type of high-throughput data, one unannotated protein might have multiple interaction partners with function Fl. Suppose that there are nM, nB and nC interaction partners with function Fl in the three types of high-throughput data, respectively. P'(Sl|M), P'(Sl|B) and P'(Sl|C) in Equation 4 are calculated as:
![]() | (5) |
![]() | (6) |
![]() | (7) |
|
Pj(Sl|M), Pj(Sl|B) and Pj(Sl|C) were estimated probabilities retrieved from the probability curves calculated in the previous section. We defined the likelihood score G(Fl, x) as Reliability score for each function Fl. The final predictions are sorted based on the Reliability score for each predicted GO INDEX. The Reliability score represents the probability for the unannotated protein to have a function Fl, assuming all the evidences from the high-throughput data are independent and only applicable to immediate neighbors in the network.
Global prediction
The major limitation of the local prediction method is that it only uses the information of immediate neighbors in a graph to predict a protein's function. In some cases, the uncharacterized proteins may not have any interacting partners with known function annotation, and its function cannot be predicted using the local prediction method. In addition, the global properties of the graph are underutilized since this analysis does not include the links among proteins of unknown functions. In Figure 6 proteins 1, 2, 3 and 4 are unannotated proteins and proteins 5, 6, 7 and 8 are annotated proteins with known functions. If we only use the local prediction method, the function of proteins 3 and 4 can be predicted but the function of proteins 1 and 2 cannot be predicted, since all the neighbors of proteins 1 and 2 are unannotated proteins. Moreover, the contributions of function assignment for protein 4 are not only from the neighbor proteins 7 and 8 whose functions are already known, but also from protein 1 when its functions are predicted through the following information propagation: proteins 5 and 6
protein 3
protein 2
protein 1. Hence, the functional annotation of uncharacterized proteins should not only be decided by their direct neighbors, but also controlled by the global configuration of the interaction network. Based on such global optimization strategy, we developed a new approach for predicting protein function. We used the Boltzmann machine to characterize the global stochastic behavior of the network. A protein can be assigned to multiple functional classes, each with a certain probability.
|
In the Boltzmann machine, we consider a physical system with a set of states,
, each of which has energy H
. In thermal equilibrium, given a temperature T, each of the possible states
occurs with probability:
![]() | (8) |
and KB is the Boltzmann's constant. This is called the BoltzmannGibbs distribution (32). It is usually derived from the general assumptions about microscopic dynamics. It is also applied to a stochastic network. In an undirected graphical model with binary-valued nodes, each node (protein) i in the network has only one state value Z (1 or 0). In our case, Z = 1 means that the corresponding node (protein/gene) has either known functions or predicted functions assigned to the node. Now, we consider the system going through a dynamic process from non-equilibrium to equilibrium, which corresponds to the optimization process for the function prediction. For the state at time t (optimization integration step t), node i has the probability for Zt,i to be 1, P(Zt,I = 1|Zt1, j
I) and the probability is given as a sigmoid-function of the inputs from all the other nodes at time t 1:
![]() | (9) |
0.7 (M), protein binary interaction (B) and protein complex interaction (C):
![]() | (10) |
j is the modifying weight:
![]() | (11) |
To achieve the global optimization, we applied simulated annealing technique as the following process (Figure 7): first, we set the initial state of all unannotated proteins (nodes) to be 0 or 1 randomly. The state of any annotated protein is always 1. If an unannotated protein is assigned with the state 1, its function will be predicted based on its immediate neighbors with known functions, using the local prediction method. Next, starting with a high temperature, pick a node i and compute Pi according to Equation 9, then update its state to 1 if the probability Pi is above a certain threshold. Each update of function prediction is based on its immediate neighbors with state 1 (i.e. known functions or predicted functions in the previous steps), using the local prediction method. The iterations are done till all the nodes in the network reach the equilibrium. Figure 8 shows the flow chart of this process. With gradually cooling down, the system is likely to settle in a global optimal state of the network configuration (Figure 7D).
|
|
| RESULTS |
|---|
|
|
|---|
We have implemented three methods for predicting the protein functions as described above, i.e. (i) local prediction without integrating evolution and localization information; (ii) local prediction with integrating evolution and localization information; and (iii) global prediction with integrating evolution and localization information. We evaluated the performance of the three methods using all annotated proteins in yeast. The performance of our prediction methods was evaluated using two different methods: function prediction accuracy at the level of protein, and sensitivity and specificity of prediction at the level of function.
We first measured the performance of our methods at the level of proteins, i.e. a correct prediction for a protein means that at least one predicted function is the same as a known function for the protein. For validation, we divided the 4044 annotated proteins with known GO INDICES into two sets randomly, i.e. 75% for the training set and 25% for the test set. All a priori probabilities were calculated from the training set and used for function prediction in the test set. Figure 9 shows the percentage of proteins whose functions can be predicted accurately. We found that the localization and evolution information improved the prediction. The global method has the best performance since it utilizes the maximal available information. Moreover, 84% proteins of the test set can be predicted using the local prediction method while 87% proteins of the test set can be predicted using the global method since the global method can assign functions to proteins that only have unannotated interaction partners. The function of the remaining 13% proteins cannot be predicted, since they do not connect to any other protein with known function, either directly or indirectly, in the current available high-throughput data.
|
We further used sensitivity (SN) and specificity (SP) to measure the performance of our methods at the level of functions (one protein can have multiple functions) using 10-fold cross-validation. We labeled all 4044 annotated proteins with known GO INDICES into fold 110. Each time, we pick one fold as the test data set and the other nine folds were used as training data to calculate prior probabilities. We estimate the SN to determine the success rate of the method and SP to assess the confidence in the predictions (14). For a given set of proteins K, let ni be the number of the known functions for protein Pi. Let mi be the number of functions predicted for the protein Pi by the method. Let ki be the number of predicted functions that are correct (the same as the known function). Thus, SN and SP are defined as:
![]() | (12) |
![]() | (13) |
|
Using all the 4044 annotated proteins with known GO INDICES as the training set, we are able to assign functions to 1802 out of the 2280 unannotated proteins in yeast at different levels of functions (GO INDICES). The detail prediction results are available at http://digbio.missouri.edu/~ychen/ProFunPred. The number of unannotated genes with function predictions with respect to the specificity and GO INDEX levels can be found in Table 1. Using our method, we assign not only general functional categories to unannotated protein, but also the specific functions to unannotated proteins. For example, Table 2 shows 14 genes whose predicted functions are with Reliability score
0.9 and GO index level
8. A total of 104 unannotated proteins were assigned functions with Reliability score
0.9 and GO index level
7. The MS Excel file of 104 proteins can be downloaded at http://digbio.missouri.edu/~ychen/ProFunPred.
|
|
| DISCUSSION |
|---|
|
|
|---|
Systematic and automated prediction of gene function using high-throughput data represents a major challenge in the post genomic era. To address this challenge, we developed a systematic method to assign function in an automated fashion, using integrated computational analysis of yeast high-throughput data, including binary interaction, protein complexes and gene expression microarray data, together with the GO biological process functional annotation. The main contribution of our work is to provide a framework of integrating heterogeneous biological information for genome-scale protein function prediction. In addition, we combine protein subcellular localization and evolution information into function prediction. It is worthwhile mentioning that some predictions can be used as input data for our framework, although predicted results are not as reliable as experimental data. We used predicted proteinprotein interactions, together with microarray data for gene function prediction in A.thaliana (33). In addition, subcellular localization can be predicted with good confidence (34,35) and the information may help gene function prediction as well. Our method is robust to obtain global optimization using simulated annealing. With starting from six different sets of randomly selected starting points, we obtained exactly the same result as shown in Table 1.
Our methods assign functions for unannotated proteins on the genomic scale. To our knowledge, our method covers more unannotated proteins for functional predictions than any other methods published previously (see Table 3). From 29 proteins listed in Table 1 in the paper of Schwikowski et al. (11) that have two or more interacting proteins, we randomly choose five unannotated proteins that are not annotated till now to compare the prediction results between our method and other methods by Schwikowski et al. (11), Deng et al. (14), Letovsky and Kasif (15), Vazquez et al. (19) and Karaoz et al. (20) (see Table 4). One improvement from our method is that we can assign unannotated proteins into deeper levels of biological processes, while most other methods make protein function prediction using less detailed functional categories defined in YPD (http://proteome.incyte.com) or MIPS (http://mips.gfs.de) databases. Some of the increased performance of our method might be due to the different size of data set used in different studies, but we believe it does not account for the major improvement of our method. The major contribution is that our method integrated multiple sources of data, by combining and propagating information systematically across the entire network, based on the global optimization. Moreover, using our global prediction method, we can assign functions for the proteins whose interacting partners do not have any known function as shown in Figure 11 and Table 5. Our predictions can provide biologists with hypotheses to study and design specific experiments, to validate the predicted functions using tools such as mutagenesis. Such combination of computational methods and experiments may discover biological functions much more efficiently than traditional approaches.
|
|
|
|
Future work includes exploring better optimization methods and statistical models. To solve the optimization problem in Boltzmann machine, in contrast to the simulated annealing technique, a Bayesian learning of posterior distributions over parameters (36) provides a more elaborate and systematic estimation of maximum likelihood. In addition, supervised learning methods such as Conditional Random Fields (37) can also be alternative schemes to model this stochastic learning process. Furthermore, we will develop more elaborate model-based integrations to address the dependences among different high-throughput data for protein function prediction.
| ACKNOWLEDGEMENTS |
|---|
We would like to thank Drs Jeff Becker, Ying Xu, Loren Hauser and Qiang Zhao for helpful discussions. We would also like to thank the two anonymous reviewers for their helpful suggestions and comments. This research is supported in part by the US Department of Energy's Genomes to Life program (http://www.doegenomestolife.org) under project, Carbon Sequestration in Synechococcus Sp.: From Molecular Machines to Hierarchical Modeling (www.genomes-to-life.org). It was also partially funded by Nation Science Foundation (EIA-0325386).
| REFERENCES |
|---|
|
|
|---|
- Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B., Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M., Louis,E.J, Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and Oliver,S.G. ( (1996) ) Life with 6000 genes. Science, , 546, , 346352.[CrossRef]
- Pearson,W. and Lipman,D. ( (1998) ) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, , 85, , 24442448.
- Altschul,S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. ( (1997) ) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., , 25, , 33893402.
[Abstract/Free Full Text] - Marcotte,E., Pellegrini,M., Thompson,M., Yeates,T. and Eisenberg,D. ( (1999) ) A combined algorithm for genome-wide prediction of protein function. Nature, , 402, , 8386.[CrossRef][Medline]
- Pellegrini,M., Marcotte,E., Thompson,M., Eisenberg,D. and Yeates,T. ( (1999) ) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA, , 96, , 42854288.
[Abstract/Free Full Text] - Chien,C., Bartel,P., Sternglanz,R. and Fields,S. ( (1991) ) The two-hybrid system: a method to identify and clone genes for proteins that interact with a protein of interest. Proc. Natl Acad. Sci. USA, , 88, , 95789582.
[Abstract/Free Full Text] - Gavin,A., Bosche,M., Krause,R., Grandi,P., Marzioch,M., Bauer,A., Schultz,J., Rick,J., Michon,A. and Cruciat,C. ( (2002) ) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, , 415, , 141147.[CrossRef][Medline]
- Ho,Y., Gruhler,A., Heilbut,A., Bader,G.D., Moore,L., Adams,S., Millar,A., Taylor,P., Bennett,K., Boutilier,K. et al. ( (2002) ) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, , 415, , 180183.[CrossRef][Medline]
- Eisen,M., Spellman,P., Brown,P. and Bostein,D. ( (1998) ) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, , 95, , 1486314868.
[Abstract/Free Full Text] - Brown,M., Grundy,W., Lin,D., Cristianini,N., Sugnet,C., Furey,T., Ares,M. and Haussler,D. ( (2000) ) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA, , 97, , 262267.
[Abstract/Free Full Text] - Schwikowski,B., Uetz,P. and Fields,S. ( (2000) ) A network of proteinprotein interactions in yeast. Nat. Biotechnol., , 18, , 12571261.[CrossRef][Web of Science][Medline]
- Hishigaki,H., Nakai,K., Ono,T., Tanigami,A. and Takagi,T. ( (2001) ) Assessment of prediction accuracy of protein function from proteinprotein interaction data. Yeast, , 18, , 523531.[CrossRef][Web of Science][Medline]
- Clare,A. and King,R.D. ( (2003) ) Predicting gene function in Saccharomyces cerevisiae. Bioinformatics, , 19, , II42II49.
- Deng,M.H., Zhang,K., Mehta,S., Chen,T. and Sun,F.Z. ( (2002) ) Prediction of protein function using proteinprotein interaction data. In Proceedings of the first IEEE Computer Society bioinformatics conference (CSB2002), Stanford University, Palo Alto, CA, August 1416, pp. 117126.
- Letovsky,S. and Kasif,S. ( (2003) ) Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics, , 19, , I197I204.
- Troyanskaya,O., Dolinski,K., Owen,A., Altman,R. and Botstein,D. ( (2003) ) A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA, , 100, , 83488353.
[Abstract/Free Full Text] - Chen,Y. and Xu,D. ( (2003) ) Computation analysis of high-throughput proteinprotein interaction data. Curr. Protein Pept. Sci., , 4, , 159181.[CrossRef][Web of Science][Medline]
- Winkler,R.L. ( (1972) ) An Introduction to Bayesian Inference and Decision. Holt, Rinehart and Winston Inc., Austin, TX.
- Vazquez,A., Flammini,A., Maritan,A. and Vespignani,A. ( (2003) ) Global protein function prediction from proteinprotein interaction networks. Nat. Biotechnol., , 21, , 697700.[CrossRef][Web of Science][Medline]
- Karaoz,U., Murali,T.M., Letovsky,S., Zheng,Y., Ding,C., Cantor,C.R. and Kasif,S. ( (2004) ) Whole-genome annotation by using evidence integration in functional-linkage networks. Proc. Natl Acad. Sci. USA, , 101, , 28882893.
[Abstract/Free Full Text] - The Gene Ontology Consortium ( (2000) ) Gene Ontology: tool for the unification of biology. Nature Genet., , 25, , 2529.[CrossRef][Web of Science][Medline]
- Mewes,H., Frishman,D., Guldener,U., Mannhaupt,G., Mayer,K., Mokrejs,M., Morgenstern,B., Munsterkotter,M., Rudd,S. and Weil,B. ( (2002) ) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., , 30, , 3134.
[Abstract/Free Full Text] - Uetz,P., Giot,L., Cagney,G., Mansfield,T.A., Judson,R.S., Knight,J.R., Lockshon,D., Narayan,V., Srinivasan,M., Pochart,P. et al. ( (2000) ) A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature, , 403, , 623627.[CrossRef][Medline]
- Ito,T., Tashiro,K., Muta,S., Ozawa,R., Chiba,T., Nishizawa,M., Yamamoto,K., Kuhara,S. and Sakaki,Y. ( (2001) ) Toward a proteinprotein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl Acad. Sci. USA, , 98, , 45694574.
[Abstract/Free Full Text] - Gasch,A.P., Spellman,P.T., Kao,C.M., Carmel-Harel,O., Eisen,M.B., Storz,G., Botstein,D. and Brown,P.O. ( (2000) ) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Cell. Biol., , 11, , 42414257.
- Huh,W.K., Falvo,J.V., Gerke,L.C., Carroll,A.S., Howson,R.W., Weissman,J.S., O'Shea,E.K. ( (2003) ) Global analysis of protein localization in budding yeast. Nature, , 425, , 686691.[CrossRef][Medline]
- Haerdle,W. ( (1992) ) Applied Nonparametric Regression. Cambridge University Press, Cambridge, UK.
- Teichmann,S.A. ( (2002) ) The constraints of proteinprotein interactions place on sequence divergence. J. Mol. Biol., , 324, , 399407.[CrossRef][Web of Science][Medline]
- Chen,Y. and Xu,D. ( (2004) ) Genome-scale understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics, , in press.
- Jansen,R., Yu,H., Greenbaum,D., Kluger,Y., Krogan,N.J., Chung,S., Emili,A., Snyder,M., Greenblatt,J.F. and Gerstein,M. ( (2003) ) A Bayesian networks approach for predicting proteinprotein interactions from genomic data. Science, , 302, , 449453.
[Abstract/Free Full Text] - Asthana,S., King,O.D., Gibbons,F.D. and Roth,F.P. ( (2004) ) Predicting protein complex membership using probabilistic network. Genome Res., , 14, , 11701175.
[Abstract/Free Full Text] - Parisi,G. ( (1988) ) Statistical Field Theory. Addison-Wesley, Reading, MA.
- Joshi,T., Chen,Y., Alexandrov,N. and Xu,D. ( (2004) ) Cellular function prediction and biological pathway discovery in Arabidopsis thaliana using microarray data. In Proceedings of the 26th Annual International Conference of the IEEE EMBS, San Francisco, CA, pp. 28812884.
- Gardy,J.L., Spencer,C., Wang,K., Ester,M., Tusnady,G.E., Simon,I., Hua,S., deFays,K., Lambert,C., Nakai,K. and Brinkman,F.S. ( (2003) ) PSORT-B: improving protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids Res., , 31, , 36133617.
[Abstract/Free Full Text] - Scott,M.S., Thomas,D.Y. and Hallett,M.T. ( (2004) ) Predicting subcellular localization via protein motif co-occurrence. Genome Res., , 14, , 19571966.
[Abstract/Free Full Text] - Ackley,D.H., Hinton,G.E. and Sejnowski,T.J. ( (1985) ) A learning algorithms for Boltzmann machines. Cognit. Sci., , 9, , 147169.[CrossRef]
- Lafferty,J., McCallum,A. and Pereira,F. ( (2001) ) Conditional random fields: probabilistic models for segmenting and labeling sequence data, In International Conference on Machine Learning (ICML), Williams College, MA, pp. 282289.
This article has been cited by other articles:
![]() |
P. Gerlee, T. Lundh, B. Zhang, and A. R. A. Anderson Gene divergence and pathway duplication in the metabolic network of yeast and digital organisms J R Soc Interface, December 6, 2009; 6(41): 1233 - 1245. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Wang, X.-S. Zhang, and Y. Xia Predicting eukaryotic transcriptional cooperativity by Bayesian network integration of genome-wide data Nucleic Acids Res., October 1, 2009; 37(18): 5943 - 5958. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Lehner and I. Lee Network-guided genetic screening: building, testing and using gene networks to predict gene function Brief Funct Genomic Proteomic, May 1, 2008; 7(3): 217 - 227. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. N. Chua, W.-K. Sung, and L. Wong An efficient strategy for extensive integration of diverse biological data for protein function prediction Bioinformatics, December 15, 2007; 23(24): 3364 - 3373. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Tao, L. Sam, J. Li, C. Friedman, and Y. A. Lussier Information theory applied to the sparse gene ontology annotation network to predict novel gene function Bioinformatics, July 1, 2007; 23(13): i529 - i538. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Aittokallio and B. Schwikowski Graph-based methods for analysing networks in cell biology Brief Bioinform, September 1, 2006; 7(3): 243 - 255. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Guo, R. Liu, C. D. Shriver, H. Hu, and M. N. Liebman Assessing semantic similarity measures for the characterization of human regulatory pathways Bioinformatics, April 15, 2006; 22(8): 967 - 973. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Barutcuoglu, R. E. Schapire, and O. G. Troyanskaya Hierarchical multi-label prediction of gene function Bioinformatics, April 1, 2006; 22(7): 830 - 836. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. V. Antonov, I. V. Tetko, and H. W. Mewes A systematic approach to infer biological relevance and biases of gene network structures Nucleic Acids Res., January 10, 2006; 34(1): e6 - e6. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



























