Statistical Methods For Fermentation Optimization Edwin 0. Geiger 1.0 INTRODUCTION A common problem for a biochemical engineer is to be handed a microorganism and be told he has six months to design a plant to produce the new fermentation product. Although this seems to be a formidable task, with the proper approach this task can be reduced to a manageable level. There are many ways to approach the problem of optimization and design of a fermentation process, One could determine the nutritional requirements of the organism and design a medium based upon the optimum combination of each nutrient, i.e., glucose, amino acids, vitamins, minerals, etc. This approach has two drawbacks. First, it is very time-consuming to study each nutrient and determine its optimum level, let alone its interaction with other nutrients. Secondly, although knowledge of the optimal nutritional require- ments is useful in designing amedia, this knowledge is difficult to apply when economics dictate the use of commercial substrates such as corn steep liquor, soy bean meal, etc., which are complex mixtures of many nutrients. 2.0 TRADITIONAL ONE-VARIABLE-AT-A-TIME METHOD The traditional approach to the optimization problem is the one- variable-at-a-time method. In this process, all variables but one are held constant and the optimum level for this variable is determined. Using this 161 162 Fermentation and Biochemical Engineering Handbook optimum, the second variable's optimum is found, etc. This process works if, and only if, there is no interaction between variables. In the case shown in Fig. 1, the optimum found using the one-variable-at-a-time approach was 85%, far from the real optimum of 90%. Because of the interaction between the two nutrients, the one-variable-at-a-time approach failed to find the true optimum. In order to find the optimum conditions, it would have been necessary to repeat the one-variable-at-a-time process at each step to verify that the true optimum was reached. This requires numerous sequential experimental runs, a time-consuming and ineffective strategy, especially when many variables need to be optimized. Because of the complexity of microbial metabolism, interaction between the variables is inevitable, espe- cially when using commercial substrates which are a complex mixture of many nutrients. Therefore, since it is both time-consuming and inefficient, the one-variable-at-a-time approach is not satisfactory for fermentation development. Fortunately, there are a number of statistical methods which will find the optimum quickly and efficiently. 3.0 EVOLUTIONARY OPTIMIZATION An alternative to the one-variable-at-a-time approach is the technique of evolutionary optimization. Evolutionary optimization (EVOP), also known as method of steepest ascent, is based upon the techniques developed by Spindley, et al.['] The method is an iterative process in which a simplex $figure is generated by running one more experiment than the number of variables to be optimized. It gets its name from the fact that the process slowly evolves toward the optimum. A simplex process is designed to find the optimum by ascending the reaction surface along the lines of the steepest slope, Le., path with greatest increase in yield. The procedure starts by the generation of a simplex figure. The simplex figure is atriangle when two variables are optimized, a tetrahedron when three variables are optimized, increasing to an n+l polyhedron, where n is the number ofvariables to be optimized. The experimental point with the poorest response is eliminated and a new point generated by reflection of the eliminated point through the centroid of the simplex figure. This process is continued until an optimum is reached. In Fig. 2, experimental points 1 , 2, and 3 form the vertices of the original simplex figure. Point 1 was found to have the poorest yield, and therefore was eliminated from the simplex figure and a new point (B) generated. Point 3 was then eliminated and the new point (C) generated. The process was continued until the optimum was reached. The EVOP process is a systematic method of adjusting the variables until an optimum is reached. Statistical Methods for Fermentation Optimization I63 c) c( 0 a (u 0 i 164 Fermentation and Biochemical Engineering Handbook Statistical Methods for Fermentation Optimization I65 Numerous modifications have been made to the original simplex method. One of the more important modifications was made by Nelder and Mead[?] who modifiedthe method to allow expansions in directions which are favorable and contractions in directions which are unfavorable. This modification increased the rate at which the optimum is found. Other important modifications were made by Bris~ey[~I who describes a high speed algorithm, and KeeferL4I who describes a high speed algorithm and methods dealing with bounds on the independent variables. Bruley,I6I Deming,ig] and Ryan.[8] For reviews on the simplex methods see papers by Deming et al.[9]-[11] EVOP does have its limitations. First, because of its iterative nature, it is a slow process which can require many steps. Secondly, it provides only limited information about the effects ofthe variables. Upon completion ofthe EVOP process only a limited region of the reaction surface will have been explored and therefore, minimal information will be available about the effects of the variables and their interactions. This information is necessary to determine the ranges within which the variables must be controlled to insure optimal operation. Further, EVOP approaches the nearest optimum. It is unknown whether this optimum is a local optimum or the optimum for the entire process Despite the limitations, EVOP is an extremely usefbl optimization technique. EVOP is robust, can handle many variables at the same time, and will always lead to an optimum. Also, because of its iterative nature, little needs to be known about the system before beginning the process. Most important, however, is the fact that it can be useful in plant optimization where the cost of running experiments using conditions that result in low yields or unusable product cannot be tolerated. In theory, the process improves at each step of the optimization scheme, making it ideal for a production situation. For application of EVOP to plant scale operations, see Refs. 12-14. The main difficulty with using EVOP in a plant environment is performing the initial experimental runs. Plant managers are reluctant to run at less than optimal conditions. Attempts to use process data as the initial experiments in the simplex is, in general, not successful because of confound- ing. Confounding occurs because critical variables are closely controlled, and therefore, the error in measuring the conditions and results tend to be greater than the effect ofthe variables. Because ofthis, operating data usually gives a false perspective as to which variables are important and the changes to be made for the next step. Additional modifications were reported by 166 Fermentation and Biochemical Engineering Handbook The successfid use of EVOP depends heavily upon the choice for the initial experimental runs. If the initial points are far from the optimum and relatively close to one another, many iterations will be required. Reasonable step sizes must be chosen to insure that a significant effect of the variable is observed between the points, however, the step size should not be so great as to encompass the optimum. A second factor to consider is magnitude effects. If one variable is measured over a range of 0.1 to 1.0 while another is measured over a range of 1 to 100 the magnitude difference between the variables can effect the simplex. Scaling factors should be used to keep all variables within the same order of magnitude. 4.0 RESPONSE SURFACE METHODOLOGY The best method for process optimization is response surface method- ology (RSM). This process will not only determine optimum conditions, but also give the information necessary to design a process. Response surface methodology (RSM) is a method of optimization using statistical techniques based upon the special factorial designs of Box and Behenkir~[~~] and Box and Wilson.[ls] It is a scientific approach to determining optimum conditions which combines special experimental de- signs with Taylor first and second order equations. The RSM process determines the surface of the Taylor expansion curve which describes the response (yield, impurity level, etc.) The Taylor equation, which is the heart of the RSM method, has the form: Response = A + B.X1 + CaX2 + . . . H-X12 + I.X22 + ... M*Xl*X2 +N*Xl*X3 + .,. where A,B,C,. . . are the coefficients of the terms of the equation, and X1 = linear term for variable 1 X2 = linear term for variable 2 Xl2 = nonlinear squared term for variable 1 X22 = nonlinear squared term for variable 2 Statistical Methods for Fermentation Optimization I67 X1-X2 = interaction term for variable 1 and variable 2 XleX3 = interaction term for variable 1 and variable 3 The Taylor equation is named after the English mathematician Brook Taylor who proposed that any continuous function can be approximated by a power series. It is used in mathematics for approximating a wide variety of continuous functions. The RSM protocol, therefore, uses the Taylor equation to approximate the function which describes the response in nature, coupled with the special experimental designs for determining the coefficients of the Taylor equation. The use of RSM requires that certain criteria must be met. These are: 1. The factors which are critical for the process are known. RSM programs are limited in the number ofvariables that they are designed to handle. As the number of variables increases the number of experiments required by the designs increases exponentially. Therefore, most RSM programs are limited to 4 to 5 variables. Fortunately for the scale up of most fermentations the number of variables to be optimized are limited. Some of the more important variables are listed in Table 1. Table 1. Typical Variables in a Fermentation Aeration rate Agitation rate Temperature CarbodNitrogen ratio Phosphate level Magnesium level Back pressure Sulhr level Carbon Source Nitrogen source PH Dissolved oxygen level Power input I68 Fermentation and Biochemical Engineering Handbook 2. The factors must vary continuously over the experimental range tested. For example, the variables of pH, aeration rate, and agitation rate are continuous and can be used in an RSM model. Variables such as carbon source (potato starch vs corn syrup) or nitrogen source (cotton seed meal vs soy bean meal) are noncontinuous and cannot be optimized by RSM. However, level of corn syrup or level of soy bean meal are continuous and can be optimized. 3. There exists a mathematical hnction which relates the response to the factors. For reviews on the RSM process see He~~ka[~’l or Giovanni.[’*] For details on the calculation methods see Cochran and or The difficult and time-consuming nature of these calculations have inhibited the wide spread use of RSM. Fortunately, numerous computer programs are available to perform this chore. They range from the expensive and sophisticated, such as SASTM, to inexpensive, PC based programs, SPSS- Xm , E-Chipm, and X STATTM.[*l] The availability of these programs, however, has led to a “black box” approach to RSM. This approach can lead to many problems if the user does not have a thorough understanding of the process or the meaning of the results. 5.0 ADVANTAGES OF RSM The response surface methodology approach has many advantages over other optimization procedures. These are listed in Table 2. Table 2. Advantages and Disadvantages of RSM Advantages of RSM 1. Greatest amount of information from experiments. 2. Forces you to plan. 3. Know how long project will take. 4. Gives information about the interaction between variables. 5. Multiple responses at the same time. 6. Gives information necessary for design and optimization of a process. 1. Tells what happens, not why. 2. Notoriously poor for predicting outside the range of study. Disadvantages of RSM Statistical Methods for Fermentation Optimization I69 5.1 Maximum Information from Experiments RSM yields the maximum amount of information from the minimum amount ofwork. For example, in the one-variable-at-a-time approach, shown in Fig. 1, ten experiments were run only to hd the suboptimum conditions. However, using RSM and thirteen properly designed experiments not only would the true optimum have been found, but also the information necessary to design the process would have been made available. Secondly, since all of the experiments can be run simultaneously, the results could be obtained quickly. This is the power of response surface methodology. RSM is a very efficient procedure. It utilizes partial factorial designs, such as central composite or star designs, and therefore, the number of experimental points required are a minimum (Table 3). A full factorial three level design would require n3 experiments; while a full factorial five level design would require n5 experiments, where n is the number of variables to be optimized. Response surface protocols, being a partial factorial design, require fewer experiments. For example, if one were to examine five variables at five different levels, a full factorial design approach would require 3 125 experiments. Response Surface Methodology, on the other hand, requires only 48 experiments, clearly a large savings in time, effort, and expense. Table 3. Experimental Efficiency of RSM Number Number of Number of Variables Combinations Actual Experiments NARROW THREE LEVEL DESIGN 2 9 3 27 4 81 5 234 BROAD FIVE LEVEL EXPLORATORY DESIGN 2 25 3 125 4 625 5 3 125 13 15 27 46 13 20 31 48 I70 Fermentation and Biochemical Engineering Handbook 5.2 Forces One To Plan The successful use ofan RSM protocol requires careful planning on the part of the experimenter before beginning the protocol. The ranges over which the variables are to be tested must be chosen with care. Choosing a range which is too narrow can result in a variable being discarded as not significant, not because the variable did not have an effect, but rather because the effect of the variable over the range evaluated was small in comparison to the experimental error. The range must be large enough so that the variable has a significant effect over the range evaluated. On the other hand, choosing a range which is too large can also result in a variable being discarded as not significant, not because the variable did not have an effect, but rather because the Taylor equation could not adequately explain the effect of the variable. It must be remembered that RSM does not determine the function which describes the results, but rather determines the Taylor expansion equation which best fits the data. Over a limited range, the Taylor equation will approximate the function which describes the results. The wider the range chosen the less likely a Taylor expansion equation which meaningfully explains the data will be obtained. Therefore, ranges which include extreme minimums and maximums for a variable should be avoided. Further, the experimenter needs to have an approximation as to where the optima exists. It is a sad state of affairs to have completed the RSM protocol only to find that the optimum conditions were outside of the range evaluated. RSM is notorious for its inability to predict outside the range evaluated. It is strongly advised that preliminary experiments be done to determine the ranges over which the variables are to be evaluated. 5.3 Know How Long Project Will Take A distinct advantage of the RSM procedure is that one knows how many experiments and the time frame needed to complete the process. This is especially helpful for budgetary purposes and the allocation of scarce scientific resources. Using RSM, the experimenter has the information necessary to determine whether a project is worth undertaking. 5.4 Interaction Between Variables With the one-variable-at-a-time approach, it is difficult to determine the amount of interaction between variables. Response surface methodology, since it looks at all the variables at the same time, can calculate the interaction Statistical Methods for Fermentation Optimization I 71 between them. This information is essential for optimizing conditions and determining what control limits are needed for the variables. 5.5 Multiple Responses RSM has the ability to model as many responses as one wishes to measure. For example, one may not only be interested in optimum yield, but also the level of a difficult to remove impurity. Both the yield and impurity levels could be modeled using data from the same set of experiments. Decisions could then be made between the cost to remove an impurity and changes in yield. 5.6 Design Data Last, but most important, RSM gives the information necessary to design the process. For example, Fig. 3 shows the effect of temperature and degree of saccharification on alcohol yield. This plot not only shows the conditions necessary for optimum yield, it also indicates the sensitivity ofthe process to changes in temperature and degree of saccharification. It shows the range over which these variables must be controlled for optimum yield. Temperature needs to be controlled within a 5 degree range and the degree of saccharification within a 10% range. This information can now be used in designing control loops for these variables. In any industrial process, the cost-effective conditions are influenced by factors other than optimum reaction conditions. There exists a compro- mise between optimum reaction conditions and economic factors such as capital and purification costs. In addition to determining optimum conditions and the ranges within which the variables need to be controlled, the regression equations generated by the RSM procedure allow the process to be modeled for a wide variety of operating parameters. The regression equations, therefore, are an ideal tool for evaluating various economic trade-offs. For example, in Fig. 4, 98% yields are obtained at low carbohydrate levels and long fermentation times. Although this is a high yield, both capital costs for the fermentation capacity and distillation costs for the resulting low alcohol beer makes this an uneconomical operating condition. Using the model developed by the RSM process, the trade-off between capital and purification costs can be weighed against lower yields to determine the best process. I72 Fermentation and Biochemical Engineering Handbook Statistical Methods for Fermentation Optimization I 73 174 Fermentation and Biochemical Engineering Handbook 6.0 DISADVANTAGES OF RSM There are two major disadvantages of RSM. First, it tells what happened, not why it happened. Aesthetically, this is not appealing to many scientists. This perhaps explains why, with the exception of analytical method development, few papers appear in the literature using RSM. This is an unfortunate circumstance since RSM is such a powefil and timesaving tool. In many cases, knowing what happens can lead to an explanation of the why or point to alternative directions for future research. For example, in Fig. 5 there is a definite optimum for the degree of saccharification. Hypotheses to explain this phenomenon are slow substrate production at low saccharifi- cation levels and substrate inhibition at high saccharification levels. Having seen the effect of saccharification, one can readily design experiments to determine the cause. 7.0 POTENTIAL DIFFICULTIES WITH RSM It must be remembered that RSM uses multiple regression techniques to determine the coefficients for the Taylor expansion equation which best fits the data. The RSM does not determine the function which describes the data. The Taylor equation only approximates the true function. The RSM process fits one of a series of curves to the data. Most RSM programs use only the first and second order terms of the Taylor equation to the data, which limits the number of curves available to fit the data. The first order Taylor equation is a linear model. Therefore, the only curves available are a series of straight lines. De second order Taylor equation is a nonlinear model where two types of curves are available; a peak or a saddle surface. Over anarrow range, these curves will approximate the true function that exists in nature; but they are not necessarily the function that describes the response. Although RSM is a rapid method for determining optimum conditions for a process, caution must be used when interpreting the results. Always remember the quote by Mark Twain, “There are liars, damn liars, and statisticians.” Unless the RSM output is used properly, it is easy to make this quote true. RSM will always give the user a number. The question remains as to how good is that number and what does it mean? Some of the important statistical values which should be considered in evaluating the RSM output are listed below. Statistical Methods for Fermentation Optimization I 75 c3 r- e - 1 76 Fermentation and Biochemical Engineering Handbook 7.1 Correlation Coefficient The correlation coefficient is a measure of the relationship between the Taylor expansion term and the response obtained. The correlation coefficient can vary from 0 (absolutely no correlation) to 1 or -1 (perfect correlation). A correlation coefficient of 0.5 shows a weak but usekl correlation. A positive sign for the correlation coefficient indicates that the response increases as the variable increases while a negative sign indicates that the response decreases as the variable increases. 7.2 Regression Coefficients The regression coefficients are the coefficients for the terms of the Taylor expansion equation. These coefficients can be determined either by using the actual values for the independent variables or coded values. Using the actual values makes it easy to calculate the response from the coefficients since it is not necessary to go through the coding process. However, there is a loss of important information. The reason for coding the variables is to eliminate the effect that the magnitude ofthe variable has upon the regression coefficient. When coded values are used in determining the regression coefficients, the importance of the variable in predicting the results can be determined from the absolute value of the coefficient. Using coded values for the independent variables, those variables which are important and must be closely controlled can readily be determined. The formula for coding values is: Coded Value = (Value minus Midpoint value)/Step value where: Value = The level of the variable used Midpoint Value = Level of variable at the mid point of the range Step Value = Midpoint value minus next lowest value 7.3 Standard Error of the Regression Coefficient RSM determines the best estimate of the coefficients for the Taylor equation which explains the response. The estimated regression coefficient Statistical Methods for Fermentation Optimization I 77 is not necessarily the exact value but rather an estimate for the coefficient. The advantage of statistical techniques is that fromthe standard error one has information about how valid is the estimate for the coefficient (The range within which the exact value for the coefficient may be found). The greater the standard error, the larger the range within which the exact value for the coefficient may be, Le., the larger the possible error in the value for the coefficient. The standard error ofthe regression coefficient should be as small as possible. A standard error which is 50% of the coefficient indicates a coefficient which is usefbl in predicting the response. Designing a process using coefficients with a large standard error can lead to serious difficulties. 7.4 Computed T Value The T test value is a measure of the regression coefficient’s signifi- cance, Le., does the coefficient have a real meaning or should it be zero. The larger the absolute value of T the greater the probability that the coefficient is real and should be used for predictions. A T test value 1.7 or higher indicates that there is a high probability that the coefficient is real and the variable has an important effect upon the response. 7.5 Standard Error of the Estimate The standard error of the estimate yields information concerning the reliability ofthe values predicted by the regression equation. The greater the standard error of the estimate, the less reliable the predicted values. 7.6 Analysis of Variance Three other statistical numbers which should be closely examined relate to the source of variation in the data. The variation attributable to the regression reflects the amount of variation in the data explained by the regression equation. The deviation from regression is ameasure ofthe scatter in the data which is not explained, Le., the experimental error. Ideally the deviation from the regression should be very small in comparison to the amount of variation explained by the regression. If this is not the case, it means that the Taylor equation does not explain the data and the regression equation should not be used as a design basis. The third important factor is the relationship between the explained and unexplained variation. The greater the amount of variation explained by the regression equation, the greater the probability that the equation meaningfully explains the results. I 78 Fermentation and Biochemical Engineering Handbook The F value is a measure of this relationship. The larger the F value the greater significance the regression equation has in explaining the data. The F value is also helpful in comparing different models. Models with the larger F value are better in explaining the response data. 8.0 METHODS TO IMPROVE THE RSM MODEL The output from an RSM program is only as good as the data entered. The cliche GIGO (garbage in garbage out) applies especially to the RSM process. Since the minimum amount of experiments is being used, any inaccuracies in the data can have a large effect upon the results. One acceptable method to increase the accuracy of the results is to perform replicate experiments and use the averages as the input data. Care must be taken, however, to avoid confounding the results by performing replicates of only aportion ofthe experimental design. This will result in the experimental error being understated in some areas of the response surface and over stated in others. All experimental points must be treated in a similar manner in order to insure that a meaningful response surface is obtained. A common error, especially when using multiple regression programs, is to use all the data available. Performing the regression analysis with missing data points or the addition of data points to the design leads to misleading results unless special care is taken. The design used must be symmetrical to prevent the uneven weighting of specific areas of the response surface from distorting the final model. Although adding the extra data points may improve the statistics of the model, it can also reduce its reliability. RSM users are strongly cautioned to resist the temptation to add extra data points to the model simply because they are available. Another method to improve the reliability of the RSM model is the use of backward elimination, Le., the removal of those variables whose T test value is below the 95%confidence limit. This process, however, must be used with care. There are two types of statistical errors. A Type I error is saying a variable is significant when it is not. A Type I1 error is saying a variable is not significant when it is. Statistical procedures are designed to minimize the chances of committing a Type I error. The statistical process determines the probability that a variable is indeed important. Elimination of those variables not significant at the 95% confidence limit reduces the chances for making a Type I error. This does not mean that the variables eliminated were not important. Lack of statistical significance means the variable was not proven to be important. There is a large difference between unimportant and Statistical Methods for Fermentation Optimization I 79 not proven important. While eliminationofthevariables not significant at the 95% confidence limit decreases the probability of making a Type I error, it increases the chances of making a Type I1 error; disregarding a variable which was important. Some mathematical considerations also need to be taken into account when eliminating variables from the equation. An equation where the linear term was eliminated while the nonlinear term was retained can mathemati- cally produce only a curve with the maxima, or minima, centered in the region evaluated. It is necessary to retain the linear term in order to move the maxima or minima to the appropriate area on the plot. Similarly, an equation containing only an interaction term, can mathematically produce only a saddle surface centered on the region evaluated. The other terms for the variables are necessary to move the optimum to the appropriate area of the response surface. When eliminating terms, it is best to eliminate the entire variable and not just selected terms for the variable. Failure to heed these warnings will result in a process being designed for conditions which are not optimum. 9.0 SUMMARY The problem of designing and optimizing fermentation processes can be handled quickly using a number of statistical techniques. It has been our experience that the best technique is response surface methodology. Al- though not reported widely in the literature, this process is used by most pharmaceutical companies for the optimization of their antibiotic fermenta- tions. RSM is a highly efficient procedure for determining not only the optimum conditions, but also the data necessary to design the entire process. In cases where RSM cannot be applied, evolutionary optimization (EVOP) is an alternative method for optimization of a process. These methods are systematic procedures which guarantee optimum conditions will be found. REFERENCES 1. Spindely, W., Hext, G. R., and Himsworth, F. R., Technomefn'cs, 4:411 (1962) 2. Nelder, J. A. and Mead, R., Compuf J., 7:308 (1965) 3. Brissey, G. W., Spencer, R. B., and Wilkins, C. L., Anal. Chern., 5 1 :2295 (1 979) 4. Keefer, D., ZndEng. Chem. Process Des. Develop., 12(1):92 (1973) 180 Fermentation and Biochemical Engineering Handbook 5. Nelson, L., Annual Conference Transactions of the American Society for Quality Control, pp. 107-1 17 (May 1973) 6. Glass, R. W. and Bruley, D. F., Znd. Eng. Chem. Process Des. Development, 12( 1 ):6 (1 973) 7. King, P. G. and Deming, S. N., Anal. Chem., 46:1476 (1974) 8. Ryan, P. B., Barr, R. L., and Tood, H. D., Anal. Chem., 52:1460 (1980) 9. Deming, S. N. and Parker, Crit. Rev. Anal. Chem., 7: 187 (1 978) 10. Deming, S. N., Morgan, S. L., and Willcott, M. R., Amer. Lab., 8(10):13 (1976) 11. Shavers, C. L., Parsons, M. L., and Deming, S. N., J. Chem. Educ., 56:307 (1 976) 12. Carpenter, B. H. and Sweeny, H., C. Chem. Eng., 72:117 (1965) 13. Umeda T. and Ichikawa, A., Znd. Eng. Chem. Process Des., 10:229 (1971) 14. Basel, W. D., Chem. Eng., 72:147 (1965) 15. Box, G. E. P. and Wilson, K. B.,J. R. Stat. SOC. B., 13:l (1951) 16. Hill, W. J. andHunter, W. G., Technometrics, 8571 (1966) 17. Henika, R. G., ,CeralScience Today, 17:309 (1972) 18. Giovanni, M., Food Technolow, 41 (November 1983) 19. Cochran, W. G. and Cox, G. M., in Experimental Designs, pp. 335, John Wiley & Sons, New York City (1957) 20. Box, G. E. P.,Hunter, W. G., andHunter, J. S.,Statistics forExperimenters, pp. 5 10, John Wiley & Sons, New York City (1 978) 21. SAS is a trademark of SAS Institute, Cary, NC; SPSS-X is a trademark of SPSS, Chicago, IL; E Chip is a trademark of E-CHIP, Inc., Hockessin, DE; X STAT is a trademark of Wiley and Sons, New York, NY