Scoring functions for computational algorithms applicable to the design of spiked oligonucleotides
Scoring functions for computational algorithms applicable to the design of spiked oligonucleotidesLars Juhl Jensen, Kim Vilbour Andersen, Allan Svendsen and Titus Kretzschmar*
Department of Enzyme Design, Novo Nordisk A/S, DK-2880 Bagsværd, Denmark
Received December 3, 1997;Accepted December 4, 1997
ABSTRACT
Protein engineering by inserting stretches of random DNA sequences into target genes in combination with adequate screening or selection methods is a versatile technique to elucidate and improve protein functions. Established compounds for generating semi-random DNA sequences are spiked oligonucleotides which are synthesised by interspersing wild type (wt) nucleotides of the target sequence with certain amounts of other nucleotides. Directed spiking strategies reduce the complexity of a library to a manageable format compared with completely random libraries. Computational algorithms render feasible the calculation of appropriate nucleotide mixtures to encode specified amino acid subpopulations. The crucial element in the ranking of spiked codons generated during an iterative algorithm is the scoring function. In this report three scoring functions are analysed: the sum-of-square-differences function s, a modified cubic function c, and a scoring function m derived from maximum likelihood considerations. The impact of these scoring functions on calculated amino acid distributions is demonstrated by an example of mutagenising a domain surrounding the active site serine of subtilisin-like proteases. At default weight settings of one for each amino acid, the new scoring function m is superior to functions s and c in finding matches to a given amino acid population.
With the advent of efficient site-directed mutagenesis methods, protein engineering became a widely-used method for analysing structure-function relationships (1,2). It has produced a wealth of information on protein functions and is a precious tool for the closer understanding of binding events, enzymatic activity, or stability of proteins on the molecular level. Site-directed mutagenesis is typically guided by some pre-knowledge of structural features of the protein and putative binding sites or conserved motifs.
In order to obtain higher degrees of diversity at a certain region of the protein and to accelerate the protein engineering process, random saturation mutagenesis (or combinatorial cassette mutagenesis) has been introduced (3-6). One or more codons consisting of equimolar mixtures of up to all four nucleotides at each position (NNN) are inserted into a gene. By this technique libraries of variants encoding all 20 natural amino acids plus the translation stop signals are created at a defined region of the gene which are then screened or selected by appropriate assays. This type of library encodes a biased pool of amino acids according to the degeneracy of the codons. A more parsimonious approach is using NN(G/C) or NN(G/T) codons which reduce the complexity at the DNA level by a factor of two per codon and result in a more even distribution of the amino acids with a smaller percentage of stop signals (7,8). NN(G/C) or NN(G/T) mutagenesis confines the window of the mutagenic area to about five codons, i.e., 3.4 * 107 variants at the DNA level with ~15% harbouring stop codons, because of limits in transforming host cells. A major disadvantage with these kinds of libraries, however, is the dilution of functionally active proteins by, e.g., misfolded, catalytically inactive, or prematurely truncated variants due to the very high density of mutant positions (9,10).
In order to overcome the constraints in window size and to drive the mutagenesis towards a larger fraction of functional molecules, three different strategies have evolved which aim at creating specified amino acid pools at certain positions based either on phylogenetic considerations, e.g., sequence alignments, on structurally derived properties of the protein, or simply on guesses.
Firstly, split-and-mix methods have been devised where during oligonucleotide synthesis the resin is split, differently processed, and mixed before proceeding with synthesis (11,12). The result is oligonucleotide pools exactly encoding a reference amino acid distribution. This approach becomes experimentally impractical when several complex split-and-mix steps are required.
Another elegant and efficient way of mutagenesis is the application of pre-formed trinucleotide phosporamidites for mixed oligodeoxyribonucleotide synthesis which also allows precise adjustment of a specified amino acid subset (13-17). The trinucleotide phosporamidites technique is not yet widespread and reagents are not commercially available.
Today's most applied method is exploiting spiked oligonucleotides. During oligonucleotide synthesis wild type (wt) bases are deliberately mixed with certain amounts of the other bases (18-25). Due to constraints imposed by the natural genetic code, spiking of oligonucleotides frequently does not result in perfect matches to a given amino acid distribution. Thus, the spiking approach has to consider the generation of an amino acid population which is as close to the reference amino acid set as possible while keeping the amount of stop codons low (12,26-28). In order to enable the calculation of appropriate nucleotide mixtures, algorithms eligible to computer application have been developed. The essential element of these algorithms is the scoring function which allows for ranking of the spiked codons according to the best fit to a given amino acid population (12,26, this report).
We designed three scoring functions and discuss here their influence on semi-random mutagenesis with spiked oligonucleotides. We have chosen an example of mutagenesis of subtilisin-like proteases. Subtilases are of significant importance in detergent industry. Protein engineering, modelling and investigations on the mechanism of action as well as on the considerable stability of subtilisins have been performed for some time (29-33).
The algorithm consists of 100 or more cycles and each cycle of 1000 iterations. Each iteration is based on a Monte-Carlo simulation: starting from any nucleotide composition of a codon a new random nucleotide composition is chosen for each position in the codon. The absolute difference between the starting and the new composition is controlled by a parameter d for each of the four nucleotides in all three positions of the codon. The parameter d is only kept constant within one iteration and decreases linearly from one to zero within one cycle. The encoded amino acid distribution is then scored by the respective scoring functions s, c, or m (see below) and compared with the starting score. If the new score is better or equal to the score of the current base composition, the new base composition will be the next starting point. If the new score is worse, the probability of keeping the new composition as new starting point equals e(-1000 * [brvbar]new_score-current_score[brvbar]). The term `better score' means, in the case of function s and function c, a lower score, and in the case of function m a higher score obtained by the algorithm. The entire algorithm was written in C++. It is compiled with the GNU C compiler and runs under Linux on a standard 100 MHz Pentium personal computer with 16 Mb of memory capacity. Performing 1000 cycles consisting each of 1000 iterations takes ~4 min of computing time.
(i) The function s is derived from the least square sum method. A basically equivalent function is also used by Tomandl et al. (12):
with xi defined as the fraction of an amino acid species i calculated by the algorithm for the current base composition, ai is the reference fraction of amino acid species i as given by the user, and wi is a user-specified weighting factor allowing for amino acids of special interest to be favoured. Default weight setting for all herein presented scoring functions is one.
(ii) The function c is based on a cubic sum calculation:
with the same definitions of xi, ai and wi as described above.
(iii) The function m is derived from maximum-likelihood considerations:
with the same definitions of xi, ai and wi as described above. The term 00 is defined as being one.
Average number of mutations. The average number of mutations per molecule is estimated from randomly generated variants by taking into account the probability of getting a non-wt amino acid at the individual positions. The number of mutations and of stop codons in each variant is counted and the average number of mutations is calculated on the basis of all species harbouring no stop codons. One thousand or more variants are generated until the estimated standard deviation of the average is <0.05.Calculation of library sizes. For the calculation of the library size that has to be screened to cover at a 95% confidence level all demanded variants having n mutant sites, the probability of the least likely specified variant with n mutant sites has to be determined. The following algorithm is applied:
(i) The probability p-wt of getting wt amino acids at all positions is calculated.
(ii) For every position i the ratio ri of the probability of getting the specified amino acid with the lowest amount divided by the probability of getting wt is calculated.
(iii) To find the probability p1 of the least likely single mutant, the probability p-wt is multiplied with the smallest of the ratios ri.
(iv) The probability p2 of the least likely double mutant is found by multiplying p1 with the second smallest ratio ri.
(v) The probabilities p3-pn are analogously calculated.
(vi) The library size which has to be screened to cover all variants with n mutant sites at a 95% confidence level equals log(1 - 0.95)/log(1 - pn).
The influence of the scoring functions s, c and m on the encoded amino acid population of mixed oligonucleotides and on the library size is shown here by the mutagenesis of six amino acids within a 10 amino acid stretch encompassing the active site serine of subtilisin proteases. For the analysis, 35 subtilisin sequences of class I type are used (30). We extracted the following distribution of amino acids at the positions 218-227 (Table 1). The wt amino acid is defined as the most abundant amino acid in the compilation of the position in question (e.g., serine in position 218, see Table 1).
The calculated base composition in a codon encoding a given amino acid distribution is deliberately randomised imitating a 1%, 5% or 10% deviation range of the oligonucleotide synthesisers. For example, when assuming a maximum of a 10% error of the synthesiser, instead of 5% of A at a certain position in the codon, random numbers ranging from 0% to 15% of A were generated. The corresponding encoded amino acid distribution is then calculated. By repeating this algorithm 105 times a standard deviation for the amount of each amino acid as a function of the maximum error possible of the oligonucleotide synthesiser is obtained (Table 8).
In this study we present a new scoring function m which meets the requirements of both getting all requested amino acids and giving reasonable initial solutions even at default weight settings. This will keep the possible subsequent optimisation of spiked oligonucleotides by weight changes in the function m at a minimum.
Before designing function m we looked at the function s, the sum-of-square-differences. In simple cases like aiming at 50% of cysteine and 50% of methionine residues problems emerged with function s. The anticipated solution is the codon (A/T)(G/T)(G/T) encoding 12.5% of each cysteine, methionine, arginine, isoleucine, leucine, phenylalanine, serine and tryptophan. With all weights at their default settings, this codon results in a score of 0.0179. But the triplet (T)(G)(25%G/25%A/50%T) yielding 50% cysteine, 25% tryptophan and 25% stops also gives a score of 0.0179 even though no methionine is obtained. In contrast, by exploiting the sum-of-cubic-differences function c for scoring, the expected (A/T)(G/T)(G/T) codon is always found. Another problem inherent to the function s is the loss of amino acids of which only small amounts were requested. We observed in several instances that one or more amino acids were lacking in the calculated solutions. An illustrative example is depicted in Table 6: neither glycine, leucine, nor lysine are obtained by the scoring function s at default settings (see Table 6, column s-def.). It should be noted that the problem could be overcome by optimising the weights, although in a time-consuming trial and error approach (see Table 6, column s-opt.). The underlying problems with scoring function s are its symmetry and the use of absolute instead of relative differences.
In order to circumvent these pitfalls the asymmetric, cubic scoring function c was designed. We noticed that with function c small amounts of specified, non-wt amino acids are easily preserved, however, in some instances unsatisfactory low amounts of the wt amino acid are retained (see for example Table 3, column c-def.) By increasing the weight for wt, it usually is possible but again rather time-consuming to come to a reasonable compromise between getting all the minor amino acids and an acceptable amount of wt. As functions c and s are sum functions, specified amino acids might be lacking in the calculated solutions, although such a case has not yet been observed with function c.
Distributions of amino acids in percentage at position 218 of subtilisin class I (30), and the resulting amino acid distribution when either scoring function s, c, or m with default (def.) or optimised (opt.) weights are applied for the design of the respective spiked codons
Distributions of amino acids in percentage at position 222 of subtilisin class I (30), and the resulting amino acid distribution when either scoring function s, c, or m with default (def.) or optimised (opt.) weights are applied for the design of the respective spiked codons.
Finally, a completely different approach was chosen for the design of the scoring function m. It was inspired by the likelihood function for the polynomial distribution. Function m is attaining values in the range [0;1]. The maximum value of one is only obtained for a perfect match with the target amino acid subset. This is also valid when aiming at non-identical amounts for the amino acid species in the given target pool. Because of the latter feature and the possibility for putting weight factors on each single amino acid, function m is much more generally applicable than the earlier described function PG (26). Compared to the functions s and c, function m has two major advantages. Firstly, if any reference amino acid is missing, the score will be zero rejecting any incomplete solution. Secondly, the function m at default weight settings gives solutions where all specified minor components and the wt fraction are kept at a balanced level (e.g., compare columns s-def., c-def. and m-def. in Tables 0 and 0). In combination with the fact that function m is much more responsive to weight changes than functions s or c because the weights are entered as powers, the optimisation of function m for a user-specified purpose is significantly less time consuming (by a factor of 5-10).
The performance of these three scoring functions is examined by mutagenising a domain of subtilisin-like proteases. The domain comprises 10 amino acid sites (positions 218-227 in subtilisin BPN') of which four are strictly conserved and the residual six sites shall be mutagenised to match as precisely as possible the amino acid distribution derived from compilation of 35 subtilases.
In positions 223 (Table 4), 224 (Table 5), and 227 (Table 7) the reference amino acid distributions were closely matched by all three scoring functions without having to change weights. The corresponding codon composition is stable against errors which might occur during oligonucleotide synthesis (e.g., Table 8, position 224).
Distributions of amino acids in percentage at position 223 of subtilisin class I (30), and the resulting amino acid distribution when either scoring function s, c, or m with default (def.) or optimised (opt.) weights are applied for the design of the respective spiked codons
Distributions of amino acids in percentage at position 224 of subtilisin class I (30), and the resulting amino acid distribution when either scoring function s, c, or m with default (def.) or optimised (opt.) weights are applied for the design of the respective spiked codons
Distributions of amino acids in percentage at position 226 of subtilisin class I (30), and the resulting amino acid distribution when either scoring function s, c, or m with default (def.) or optimised (opt.) weights are applied for the design of the respective spiked codons
Distributions of amino acids in percentage at position 227 of subtilisin class I (30), and the resulting amino acid distribution when either scoring function s, c, or m with default (def.) or optimised (opt.) weights are applied for the design of the respective spiked codons
Robustness of the amino acid distributions towards deviations in the base composition of codons caused by the imprecision of the oligonucleotide synthesising machines
Assuming perfectly synthesised spiked oligonucleotides, the encoded amino acid distribution of the codons obtained with function m at default weight settings for subtilisins class I positions 222 and 224 are given (pos.222 m-def., pos.224 m-def., see also Tables 3 and 5). Assuming deviations of 1%, 5% or 10% during the oligonucleotide synthesising process, the respective changes in the amino acid distribution are calculated as described in Methods (mean ± standard deviation).
As for position 218 (Table 2) the scoring functions were providing reasonable results. With function s at default weight settings the requested phenylalanine was not present in the solution. The function m with default weights also yielded very small amounts of phenylalanine. By modifying the weights, the percentages of phenylalanine could be increased to acceptable values. As expected, the function c yielded relatively high amounts of the specified minor components alanine, aspartate and phenylalanine, at the expense of the major components serine and asparagine.
Regarding position 222 (Table 3), a matching combination for the five specified amino acids is difficult to obtain as can easily be deduced from codon tables. All found solutions generated at least 10 different amino acids. It is mutually excluding to get ~17% alanine and >70% methionine. The function s had no difficulty in identifying spiked codons encoding ~70% methionine but at the cost of alanine, while the function c had tremendous problems in preserving methionine. The function m at default weight settings resulted in a compromise between both extremes. Again, the solution calculated with function m is robust towards potential, non-deliberately introduced deviations by oligonucleotide synthesisers (Table 8, position 222). We noticed that it was impossible within a reasonable time-frame to optimise function s by trial and error weight adjustments without losing reference amino acids.
In position 226 (Table 6) it is impossible to obtain a high quality solution due to constraints imposed by the genetic code. With function s at default weight settings, glycine, leucine and lysine were missed. It was not trivial to find a set of weights driving the amino acid pool to contain all specified amino acids. In the course of optimising the weights for all three scoring functions, function m proved to be superior with regard to time considerations.
If one had exploited one of the techniques for generating perfect matches to the given amino acid distributions in the above subtilisin example, the average number of mutations per molecule would have been 2.1. In contrast, applying the scoring function m at optimised weight settings gives ~2.5 mutant sites per molecule. The distributions of mutant sites per molecule are shown to be rather similar (Fig. 1). When aiming at covering with a 95% confidence level all transformants harbouring specified double or triple mutants, and when the function m is used for ranking the spiked codons, the corresponding libraries should consist of 8.0 * 105 and 4.4 * 107 members. These libraries are clearly larger than those obtained with an optimal technique where only 2.9 * 104 and 4.9 * 105 transformants have to be screened to cover all double and triple mutants, respectively. As many as 63% of all the members of the library created by the exploitation of the function m harbour at least one non-wanted mutation.
A. Hidalgo, A. Schliessmann, R. Molina, J. Hermoso, and U. T. Bornscheuer A one-pot, simple methodology for cassette randomisation and recombination for focused directed evolution
Protein Eng. Des. Sel.,
September 1, 2008;
21(9):
567 - 576.
[Abstract][Full Text][PDF]
M. J. Volles and P. T. Lansbury Jr A computer program for the estimation of protein and nucleic acid sequence diversity in random point mutagenesis libraries
Nucleic Acids Res.,
June 29, 2005;
33(11):
3667 - 3677.
[Abstract][Full Text][PDF]
W. Wang and J. G. Saven Designing gene libraries from protein profiles for combinatorial protein experiments
Nucleic Acids Res.,
November 1, 2002;
30(21):
e120 - e120.
[Abstract][Full Text][PDF]