########################################################################### ########################################################################### NOTE! NOTE! This file, written for "old" S, has not been adapted for R. Also, it uses graphics applications that may not be available for Mac OS X: "brush" - SGI Irix application "xgobi" or successor "ggobi" Nevertheless, the discussion of the methods remains accurate and, IMHO, worth a read, even if long. ########################################################################### ########################################################################### ########################################################################### #################### TUTORIAL ON S AND FITTING OF DATA ################## ########################################################################### ########################################################################### #################### INTRODUCTION ###################################### ########################################################################### This tutorial uses the S language and environment to fit and display data. S is an integrated suite of software for statistical analysis, computation, and the display and manipulation of data. S is built around operations on vectors and arrays. S has a powerful programming language and a development environment. There are various versions of S. The one installed on this computer is from the 1992 AT&T sources. A commercial version of S, called S-Plus, for workstations and PC's, is available from Statistical Sciences and is used by several University of Arizona Departments. The user should be familiar with the basic operations of S. For an introductory description of S, see: "Notes on S-PLUS", by Bill Venables and Dave Smith, 1992. Hard copies and computer readable copies of "Notes on S-PLUS" have been made available. A new user of S should work through the sample S session on pages 61-63. A copy of these commands is in the file "venables.s". For a full description of S and its use, see: The S "Blue Book": R. A. Becker, J. M. Chambers, and A. R. Wilks, "The New S Language", Wadsworth, 1988. The S "White Book": J. M. Chambers and T. J. Hastie, "Statistical Models in S", Wadsworth, 1988. In the first chapter of each book the authors work through examples that illustrate the use of S. Both books have been made available for short-term reference (checkout for two hours during the day or overnight if after 4pm). Steady-state initial-rate data measured for the enzyme lactate dehydrogenase (LDH) with NADH and pyruvate as substrates are used as a test set for this tutorial. The model for the LDH reaction is a compulsory-order mechanism, with binary LDH-NADH and LDH-NAD complexes and ternary substrate complexes, LDH-NADH-pyruvate and LDH-NAD-lactate. There are also ternary abortive complexes, LDH-NADH-lactate and LDH-NAD-pyruvate, formed from reduced cofactor and reduced substrate and oxidized cofactor and oxidized substrate, respectively. These are not on the reaction path and when present inhibit the reaction. The LDH reaction is described by the mechanism: LDH + NADH <--> LDH-NADH LDH-NADH + pyruvate <--> LDH-NADH-pyruvate (LDH-NADH-pyruvate == LDH-NAD-lactate) LDH-NAD-lactate <--> LDH-NAD + lactate LDH-NAD <--> LDH + NAD LDH-NADH + lactate <--> LDH-NADH-lactate LDH-NAD + pyruvate <--> LDH-NAD-pyruvate which gives the rate equation: v0 = Vmax / (1 + C1 * p + KmA/a + KmB/b * (1 + C1 * p) * (1 + C2 * p) + KmAB/(a * b) * (1 + C1 * p) + C3 * b) eq. 1 where: v0 is the initial rate of reaction. a, b, and p are the concentrations of NADH, pyruvate, and lactate, respectively. Vmax, KmA, KmB, and KmAB have the usual meaning as kinetic constants for a two-substrate reaction. C1 = KmQ/KmPQ is the kinetic constant that describes inhibition by the first product. C2 = 1/KiP is the equilibrium constant that describes formation of the first abortive complex, LDH-NADH-lactate. C3 is the kinetic constant that describes formation of the LDH-NAD-pyruvate abortive complex. C3 can be shown to be unimportant for these data, and so can be fixed at a very small value or eliminated from the rate equation. We use here two methods of fitting data to a model that is nonlinear in the parameters. Each method minimizes a test function, the commonly-used sum of residuals squared, so obtaining a least-squares best estimate of the parameter values. The simplex method does this by trying systematically various values for each parameter, i.e. by a guided search over a region of parameter space. This is accomplished by constructing a polygon, the simplex, that initially is made so large as to be sure to include the best estimate of the parameters. During the search the polygon is shrunk about the (unknown) best estimate until it is arbitrarily small. The average of the parameter values at the vertices then is taken as the best estimate of the parameter values for the model. The simplex method is particularly effective in handling ill-conditioned functions, such as those with singularities or discontinuities. The method easily handles bounds on the parameter values, etc. Gradient methods locate the best estimate by adjusting the current estimate of the parameter values according to the gradient of the least-squares function. Iteration (repetition of the gradient calculation and adjustment of the parameters) moves the current estimate arbitrarily close to the best estimate of the parameters, where the gradient is zero. Gradient methods are expected to be faster than a trial and error search like the simplex. Gradient methods are less able to handle ill-conditioned functions. For a description of the simplex method of function optimization, see the paper by Nelder and Mead, Computer J. 7, 308, 1965. For general discussion of the simplex and gradient methods of fitting data and function optimization, see Press et al, "Numerical Recipes in C", 2nd edition, Cambridge, 1992. ########################################################################### #################### NONLINEAR FIT -- SIMPLEX SEARCH #################### ########################################################################### The simplex fitting program simpfit() must be given the following: a set of data to be fit to the model, a function describing the model, and initial estimates for the parameters of the model. The matrix ldhfn.dat has initial rate measurements (v0) for lactate dehydrogenase (LDH). The v0 values are for varied concentrations of first substrate, NADH (A), second substrate, pyruvate (B), and first product released, lactate (P). The model described in eq. (1) is coded in the function ldhfn(). ldhfn() is written for use with the data matrix ldhfn.dat. The matrix ldhfn.p has parameter estimates that describe the vertices of a starting simplex for the function and data of ldhfn() and ldhfn.dat. Please look at function, data and initial parameter estimates. ldhfn ldhfn.dat ldhfn.p You may want to look at the online documentation for simpfit(). ? simpfit Run the simplex fitting, storing the results in ldh.simpfit.out. ldh.simpfit.out<-simpfit(ldhfn,ldhfn.p,,,ldhfn.dat) simpfit() returns a list of values that describe the fit, including best estimates of the parameters (Vmax, etc.), their standard deviations, the unscaled covariance matrix, the variance of the sample, etc. To view these results. ldh.simpfit.out Further discussion of the results is deferred until after introduction of another method of fitting the data. ########################################################################### #################### NONLINEAR FIT -- GRADIENT METHOD ################### ########################################################################### The function simpfit() does not return a list compatible with the summary and plotting functions of the 1992 version of S. Also, although functions like ldhfn() for some particular model may not be difficult to code, still it is necessary to have written such a separate model function in order to use simpfit(). Generally it should be more convenient, faster, and easier to fit the data by use of a standard 1992 S function, nls(). This function carries out a gradient search for the best values of the parameters. nls() may not handle ill-conditioned functions or parameter bounds and other frills. If these are a problem, then perhaps resort to simpfit(). nls() and related functions use data frames, a structure introduced with the 1992 S. Their usefulness will become apparent. Construct a data frame from the LDH data in ldhfn.dat. Change the names of column headers so that they do not contain the characters "[]". The use of attach() allows addressing the columns of a matrix or data frame by name. Create a new data frame without the unneeded columns (weights and v0calc). xxxdf<-as.data.frame(ldhfn.dat) names(xxxdf)<-c("v0","v0calc","weight","a","b","p") attach(xxxdf) objects(2) ldh.df<-data.frame(v0,a,b,p) A data frame can be used like a matrix. xxxdf ldhfn.dat xxxdf[1,5] ldhfn.dat[1,5] But a data frame is _not_ a matrix. (It is really a list.) as.list(xxxdf) as.list(ldhfn.dat) Cleanup. search() detach(2) rm(xxxdf) View the ldhfn.p parameter estimates used for the simplex fit. ldhfn.p Introduce the same values into ldh.df. C3 for the LDH-NAD-pyruvate abortive complex does not contribute. It was fixed at a value 1e-10 that removed it from the simplex fit. We remove it entirely from this fit. parameters(ldh.df)<-list(Vmax=1e+00,KmA=1e-03,KmB=1e-02,KmAB=1e-03,C1=1e-03,C2=1e-06) View the results. ldh.df Read the documentation on nls(). ?nls Carry out a fit using nls() with the rate law described in eq. (1). ldh.out1<-nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), ldh.df) Note that no separate function needed to be written (the model is expressed on the command line and nls() estimates gradients numerically). Also, the fit is faster than for the simplex method (a gradient search is expected to be faster than a search like the simplex that samples all or much of a region of parameter space). Look at the results. as.list(ldh.out1) summary(ldh.out1) residuals(ldh.out1) fitted.values(ldh.out1) ldh.df$v0-fitted.values(ldh.out1) - residuals(ldh.out1) coef(ldh.out1) summary(ldh.out1)$parameters sigma = standard error of the fit = standard deviation = sqrt(variance) = sqrt ( (sum residuals^2)/degrees_of_freedom ) = rmssq = rmsd = rmse summary(ldh.out1)$sigma sqrt(sum(residuals(ldh.out1)^2)/summary(ldh.out1)$df[2]) Compare the parameter estimates from nls() with the parameter estimates from simpfit(). The difference in the estimates is 1 part in 10^5 or less. (ldh.simpfit.out$coef[1:6] - coef(ldh.out1))/coef(ldh.out1) Compare the standard errors for the parameters obtained from nls() with those from simpfit(). The standard errors for each parameter are close, but as is reasonable, different fitting methods lead to slightly different descriptions of the least-squares surface about the minimum. The fitting methods define the minimum more accurately than the shape of the LSQ surface about it. ldh.simpfit.out$std.err summary(ldh.out1)$parameters[,"Std. Error"] (ldh.simpfit.out$std.err - summary(ldh.out1)$parameters[,2])/ldh.simpfit.out$std.err The specification of the starting estimates of the parameters can be handled differently. They can be given as an argument of the function. They can be given as an array, i.e., as done below, an array C of length 6. The array indices in the model function can be data values (not done below). nls(v0 ~ C[1]/(1 + C[5]*p + C[2]/a + (C[3]/b)*(1 + C[5]*p)*(1 + C[6]*p) + (C[4]/(a*b))*(1 + C[5]*p)), data=ldh.df,start=list(C=c(1e+00,1e-03,1e-02,1e-03,1e-03,1e-06))) ########################################################################### #################### QUALITY OF THE NONLINEAR FIT ####################### ########################################################################### We now ask about the quality of the fit and of the data. ########################################################################### 1. Are the residuals (deviations between measured and calculated values = v0 - v0calc) normally distributed? Least-squares fitting assumes this, and failure of the assumption may be cause for concern. The function qqnorm() sorts the data (here the residuals) and plots them against the corresponding theoretical values of the quantiles for the normal distribution. If the points in the plot approximate a straight line, the the assumption of normality is supported. (For a sample x, the quantile corresponding to a probability P is the value X such that the fraction of the sample x having values less than X is P, e.g., for P=.5, X=mean of sample). The function qqline() draws a line through the 1st and 3rd quartiles of the data, which include the central (most probable) part of the distribution and of course the majority of the sample points. Display the results of qqnorm() and qqline() operating on the residuals of the fit done with nls(). x11() qqnorm(resid(ldh.out1)) qqline(resid(ldh.out1)) The data in this case satisfy the Gaussian assumption. There is some deviation for the tails of the distribution. To identify the tail points, construct a sorted list of the residuals with the ordinal in the data set attached. attach(ldh.df) x<-cbind(sort.list(resid(ldh.out1)),resid(ldh.out1),a,b,p) x[x[,1],] No pattern is apparent. Can identify points by point-and-click. xx<-residuals(ldh.out1) plot(qnorm(ppoints(length(xx)))[order(order(xx))],xx) identify(qnorm(ppoints(length(xx)))[order(order(xx))],xx) Left mouse to label. Right mouse to exit. Cleanup. rm(x, xx) detach(2) ########################################################################### 2. Is each parameter significant in influencing the fit? What is the confidence level for each of the parameters being different from 0? First display the values of the parameters, their std. errors, and their values of t = parameter value/std. error. Then evaluate the probability corresponding to this value of t and the number of degrees of freedom of the fit. This probability is 1 - confidence level. Note that the parameter estimates obtained in a particular experiment are a single sample drawn from a population of possible results. The associated standard errors can be used to estimate the variance of this population of parameter values. Use of Student's t distribution takes into account the lack of knowledge of the true variance and the fact that the standard errors themselves have uncertainty. For an experiment with a large number of data points (degrees of freedom), the t and normal distributions are the same. For the lowest (least significant) t value: summary(ldh.out1)$parameters summary(ldh.out1)$df 1-pt(2.554559,20) 1-pt(min(summary(ldh.out1)$parameters[,3]),summary(ldh.out1)$df[2]) The result, 0.009445276, shows that the least well-defined parameter differs from zero at the 0.99 confidence level. ########################################################################### 3. Is the std. error of the fit reasonable from an experimental view? Compare the std. error of the fit to the min, max, and mean of the quantity fit. attach(ldh.df) range(v0) mean(v0) summary(ldh.out1)$sigma x<-c(min(v0), mean(v0), max(v0)) cbind(x,summary(ldh.out1)$sigma/x*100) detach(2) rm(x) The standard error of the fit is 1.5 to 6 percent of the v0 values. This percentage error is reasonable, if anything small for a steady-state kinetic experiment. ########################################################################### 4. Are the same parameter estimates (the same fit) obtained with different starting values of the parameters? If not, then the fit is suspect. For convenience, construct a data frame from ldh.df without the starting parameter estimates. xxx.df<-ldh.df parameters(xxx.df)<-NULL Carry out the fit with the previously-used starting parameter estimates and with estimates uniformly increased and decreased by powers of 2 or 10. Previous estimates: nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-03,"C2"=1e-06)) 2x smaller -- result = as for previous: nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=.5e+00,"KmA"=.5e-03,"KmB"=.5e-02,"KmAB"=.5e-03,"C1"=.5e-03,"C2"=.5e-06)) 4x smaller -- result = as for previous: nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=.25e+00,"KmA"=.25e-03,"KmB"=.25e-02,"KmAB"=.25e-03,"C1"=.25e-03,"C2"=.25e-06)) 10x smaller -- result = no fit, singular gradient matrix: nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=1e-01,"KmA"=1e-04,"KmB"=1e-03,"KmAB"=1e-04,"C1"=1e-04,"C2"=1e-07)) 10x larger -- result = as for previous: nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=1e+01,"KmA"=1e-02,"KmB"=1e-01,"KmAB"=1e-02,"C1"=1e-02,"C2"=1e-05)) 100x larger -- result = no fit, singular gradient matrix: nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=1e+02,"KmA"=1e-01,"KmB"=1e-00,"KmAB"=1e-01,"C1"=1e-01,"C2"=1e-04)) Clearly, a range of starting estimates are acceptable, and the fit is independent of the starting point within this range. The gradient algorithm of nls() cannot handle a rough LSQ surface, and starting estimates far from the fit values cannot be accommodated (10x smaller or 100x larger). The simplex algorithm of simpfit() is less sensitive. For a range of 100x above to 100x below the previous estimates: x.porig<-ldhfn.p[1,] x.p<-x.porig*c(rep(1.e2,6),1) x.pfactor<-c(rep(1.e-4,6),1) x.out<-simpfit(ldhfn,x.p,x.pfactor,,ldhfn.dat) x.out (x.out$coef - ldh.simpfit.out$coef)/x.out$coef (x.out$std.err - ldh.simpfit.out$std.err)/x.out$std.err (x.out$cov.unscaled - ldh.simpfit.out$cov.unscaled)/x.out$cov.unscaled rm(x.out, x.pfactor, x.p, x.porig) Note that a rather large number of iterations were needed and the fit is slow and memory growth is large. Iter = 1763 vs 447 for the previous fit. The results are equal within a small tolerance to the previous fit. Common sense should allow guessing initial parameter estimates closer than 100x above or below the best fit values. ########################################################################### 5. What is the effect of leaving parameters out of the fit? (xxx.df<-ldh.df) x.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-3,"C2"=1e-6)) x.c2.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p) + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-3)) x.c1.out<-nls(v0 ~ (Vmax/(1 + KmA/a + (KmB/b)*(1 + C2*p) + (KmAB/(a*b)))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C2"=1e-6)) x.kmab.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"C1"=1e-3,"C2"=1e-6)) x.kmb.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmAB"=1e-03,"C1"=1e-3,"C2"=1e-6)) x.kma.out<-nls(v0 ~ (Vmax/(1 + C1*p + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("Vmax"=1e+00,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-3,"C2"=1e-6)) x.vmax.out<-nls(v0 ~ (1/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-3,"C2"=1e-6)) y<-list(x.out,x.c2.out,x.c1.out,x.kmab.out) names(y)<-list("x.out","x.c2.out","x.c1.out","x.kmab.out") for (i in seq(along=y)){print(names(y)[[i]]);print(summary(y[[i]])$parameters)} for (i in seq(along=y)){print(names(y)[[i]]);print(summary(y[[i]])$sigma)} rm(x.c1.out,x.c2.out,x.kma.out,x.kmab.out,x.out,xxx.df) Removal of Vmax or KmB resulted in singular gradients and thus no fit. Although removal of any one of the other parameters was less drastic, The values of fit$sigma, to be compared to the full fit value 0.001616097, are for x.c2.out=0.002042659, x.c1.out=0.0021374, x.kmab.out=0.002025575, x.kma.out=0.006302654. Looking at the smallest increase in sigma, for KmAB: F(1,20) = increase in SSQ/mean residual SSQ = (0.002025575^2*21 - 0.001616097^2*20)/(0.001616097^2) = 12.98991 Although it is somewhat improper to use this F statistic to draw a conclusion about a nonlinear model from a linear approximation of it, the high probability corresponding to the above F value, pf(12.98991, 1, 20) = 0.9982288, suggests that KmAB and thus also the other more "well-defined" parameters are all significant. To examine the quality of the linear approximation, try profile(). ########################################################################### 6. What is the effect of introducing the C3 parameter into the fit? yyy.df<-ldh.df parameters(yyy.df)<-list(Vmax=0.2541638,KmA=0.03251107,KmB=0.3056165,KmAB=0.002343344,C1=0.003951771,C2=0.006755385,C3=1.e-10) y.c3.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p) + C3*p)), yyy.df, list(C3=1.e-3)) summary(y.c3.out) sum(y.c3.out$res^2) - sum(ldh.out1$res^2) rm(yyy.df,y.c3.out) The t value for C3 is -0.000186992 (effectively zero). The reduction in SSQ = 7.173252e-14 on introduction of the C3 parameter. The reduction ranges from 3.392665e-05 to 7.819570e-04 for the four parameters described above. Apparently, C3 has an insignificant effect upon the fit, in agreement with the earlier decision to omit C3 or fix it at a very small value. ########################################################################### #################### GRAPHICS DISPLAYS ################################## ########################################################################### The quality of the data and of the results of the fits can be examined by suitable graphics displays. Examination of multidimensional data, i.e., the dependence of a response on more than one variable, requires special strategies. ########################################################################### 1. Xgobi is a UNIX program for interactive dynamic graphics and data visualization. Hard copy and computer-readable versions of the xgobi manual are available. the S function xgobi() starts up the UNIX program with data from an S object (data frame, matrix, or set of vectors). Look at online help. !man xgobi ?xgobi Construct a data frame from the LDH data and the calculated values of the initial rates. ldh.plot.df<-as.data.frame(cbind(ldh.df,v0calc=(ldh.df$v0 - ldh.out1$res))) Start xgobi(). xgobi(ldh.plot.df) With focus in the xgobi window, examine the data. a. select xyplot, select y=v0 and x=p, select brush, color p=0 data red. b. select xyplot, select y=v0, fix y, cycle. v0-v0calc plot shows residuals small. c. transform v0, v0calc, a, b to inverse, then cycle again. 1/v0-1/a and 1/v0-1/b are Lineweaver-Burke plots d. select rotate, select pause, select 1/v0, 1/a, 1/b. learn to rotate the data cloud with the mouse. the p=0 data define a surface with a slight twist. replace v0 by v0calc, and see better the twist in the plane. the twist reflects the contribution of the KmAB/(a*b) term. e. cycle quickly between v0 and v0calc to get a feeling for deviations between measured and calculated values. f. select rock, deselect pause, use mouse to change view. g. this is a 5-dimensional data set. select tour and all 5 variables. although not useful for these data, tour can help discover patterns in multidimensional data and can isolate the important variables by principal component analysis or projection pursuit. there are sample data supplied with xgobi that illustrate the use of tour and its dependents. h. other options: dotplots, rescaling, erasing data, adding lines, postscript plots, etc. Quit. ########################################################################### 2. S on an sgi machine has a mouse- and menu-based facility for two- and three-dimensional display of data. Read documentation. ? brush ? iris4d Start the iris device driver and brush. brush(ldh.plot.df) With focus in the iris4d window, right-mouse brings up menus. left mouse highlights. middle mouse removes highlights. etc. Brush comes up in scatterplot mode. For these data there are 25 cells, each a plot of one variable against another. Labels for the variable are along the diagonal. Try the following: a. focus in a cell with something vs p. brush p=0 data red by clicking left with brush box over portion of data. b. focus in some cell of interest. blowup the plot with right-mouse menu selection. then reset. c. identify points by number, by selecting highlight and placing the cursor on the number list on the right of the display, or by selecting identify and putting the cursor on a point in a cell. d. change the size of the datum symbol, by selecting appropriate option and clicking middle mouse for larger, left for smaller. Prepare to go into rotation mode. Rotation mode is similar to rotation mode of xgobi, but perhaps a bit snazzier. With right-mouse menu and submenus, delete variables v0calc and p. Select rotation from menu. a. select velocity rotation mode, and play with data display. position of cursor about center functions like a joy stick in changing _rate_ of movement of data cube. left and middle mouse give left and right rotation of cube. b. select mouse rotation mode. as in a., except cursor changes position of cube instead of velocity of movement of cube. c. deselect variable v0 and select v0calc, etc., etc.. Quit. To get display of (1/v0, 1/z, 1/b, 1/v0calc, p), one must restart brush with appropriately modified data. attach(ldh.plot.df) brush(cbind("1/v0"=1/v0, "1/v0calc"=1/v0calc, "1/a"=1/a, "1/b"=1/b, "p"=p)) In scatterplot mode: a. brush p=0 data red b. go to cell v0 vs 1/b, select blowup. note that both the slope and intercept for p=150 (white data) differ from those for p=0 (lowest line of red data). this is diagnostic of a compulsory ordered ternary complex mechanism. Select rotation mode. a. select v0calc (or v0), 1/a, 1/b as variables b. select mouse or velocity rotation c. the effect of p=150 on slope and intercept is easily seen. d. the twist in the plane of the data is easily seen Quit. Clean up. detach(2) ########################################################################### 3. Plots made with S functions. S has a rich plotting language. The following are relatively unsophisticated examples. Start the X11 device driver (or the iris4d driver) and for convenience attach a data frame. iris4d() or x11() attach(ldh.plot.df) For scatterplots like that obtained with brush(): pairs( ~ v0 + a + b + p) pairs( ~ I(1/v0calc) + I(1/a) + I(1/b) + p) For plots of a single variable, first eliminate data with p>0. nop<-ldh.plot.df[ldh.plot.df$p==0,] Make a reciprocal plot of v0 for varied [NADH], and add points for calculated values. plot(1/nop$a,1/nop$v0) points(1/nop$a,1/nop$v0calc,pch="+") Plot the cumulative distribution function for each variable, which for a numeric variable is: sort(x) vs ppoints(x), where ppoints(x) is a vector of probabilities. plot() with a data frame object is the same as plot.data.frame(). plot(ldh.plot.df) Plot the cumulative distribution function for a single variable. plot(ppoints(v0), sort(v0)) plot(ppoints(v0-v0calc), sort(v0-v0calc)) Make a quantile plot of the residuals. qqnorm(v0-v0calc) qqline(v0-v0calc) Box plots also display the distribution of a variable. boxplot(v0,v0calc) boxplot(v0-v0calc) S can produce "3-d" plots == perspective plots. All duplicate values of the independent variables must be removed. Also, include data only for p=0, to simplify the surface. nop<-ldh.plot.df[ldh.plot.df$p==0,] nop.uniq <- nop[!duplicated(paste(nop[, "a"], nop[, "b"])),] Values of z (1/v0) are calculated over a grid of x (a) and y (b) points. i<-interp(1/nop.uniq$a, 1/nop.uniq$b, 1/nop.uniq$v0) j<-interp(1/nop.uniq$a, 1/nop.uniq$b, 1/nop.uniq$v0calc) k<-interp(1/nop.uniq$a, 1/nop.uniq$b, 1/nop.uniq$v0 - 1/nop.uniq$v0calc) l<-interp(nop.uniq$a, nop.uniq$b, nop.uniq$v0) m<-interp(nop.uniq$a, nop.uniq$b, nop.uniq$v0calc) n<-interp(nop.uniq$a, nop.uniq$b, nop.uniq$v0 - nop.uniq$v0calc) Draw perspective plots. persp(i) persp(j) persp(k) persp(l) persp(m) persp(n) Draw contour plots. contour(i,nint=30) contour(j,nint=30) contour(k,nint=30) contour(l,nint=30) contour(m,nint=30) contour(n,nint=30) Clean up. detach(2) rm(??????) ########################################################################### #################### LINEAR FITS ######################################## ########################################################################### S has powerful operators to handle linear models. The LDH rate equation can be thrown into linear form if the denominator terms with product are deleted, i.e., if measurements are only for p=0. Make the new data frame. attach(ldh.df) objects(2) ldh.nop.df<-data.frame(v0,a,b)[1:20,] For reference carry out a nonlinear fit for ldh.nop.df. ldh.nop.out1<-nls(v0 ~ (Vmax/(1 + KmA/a + KmB/b + KmAB/(a*b))), ldh.nop.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03)) And now a linear fit of the reciprocal rate equation with lm(). lm.out1<-lm(1/v0 ~ I(1/a)*I(1/b),data=ldh.nop.df) And weight the linear fit by the theoretical v0^4. lm.wt.v04.out1<-lm(1/v0 ~ I(1/a)*I(1/b),data=ldh.nop.df, weights=ldh.nop.df$v0^4) Compare the weighted linear fit and the nonlinear fit. x<-coef(lm.wt.v04.out1) (coef(ldh.nop.out1) - x/c(x[1]^2,x[1],x[1],x[1]))/coef(ldh.nop.out1) The differences in parameter estimates range from 0.25 to 1.3 percent. This is plausible, given the weighting introduced, which is itself subject to the uncertainty of the sample. Compare the unweighted linear fit and the nonlinear fit. x<-coef(lm.out1) coef(ldh.nop.out1) - x/c(x[1]^2,x[1],x[1],x[1]))/coef(ldh.nop.out1) The differences range from 7.6 to 28 percent, as expected because of the incorrect weighting. It is not nice to do an unweighted reciprocal plot. A difficulty with the linear fit to a reciprocal model is that the estimates of the standard error in the parameters are for (1/Vmax) and (KmX/Vmax), X={A, B, AB}. Typically one wants uncertainties for Vmax and KmX, as are obtained with the nonlinear model of the nls() fit. The uncertainties for Vmax and KmX are calculated as: var(z) = sum_over_i sum_over_j d(z)/d(P_i)*d(z)/d(P_j)*covar_i_j Where P_i are the linear model parameters, (1/Vmax) and (KmX/Vmax). Covar_i_j is the covariance matrix for the linear model. Var(z) is the variance to be calculated for the parameter Vmax or KmX. z is a function expressing the parameter Vmax or KmX in terms of the linear model parameters, (1/Vmax) and (KmX/Vmax). The calculation is appropriately carried out within a function. lm.to.nls <- function(lmout) { #for linear fit to LDH data #compute values and std.error of Vmax and KmA, etc. #from values and std.errors of 1/Vmax and KmA/Vmax, etc. # #value: data frame with estimate, std.error and t value for each parameter #argument: data frame returned by lm() # std.error<-numeric(4) parameters<-numeric(4) sigmasq<-sum(summary(lmout)$residuals^2)/16 cij<-summary(lmout)$cov*sigmasq x<-coef(lmout) parameters[1]<-1/x[1] std.error[1]<-sqrt(1/x[1]^4*cij[1,1]) for (i in 2:4) { parameters[i]<-x[i]/x[1] std.error[i]<-sqrt((1/x[1])^2*cij[i,i] + (-x[i]/(x[1]^2))^2*cij[1,1] + 2*(1/x[1])*(-x[i]/(x[1]^2))*cij[1,i]) } data.frame(parameters = parameters, std.error = std.error, t = parameters/std.error) } Run this function, lm.to.nls(), on the results of the weighted linear fit and compare with the results of the nonlinear fit. x<-lm.to.nls(lm.wt.v04.out1) x summary(ldh.nop.out1)$parameters (summary(ldh.nop.out1)$parameters - x)/x The ldh.nop.out1 values for Vmax and KmA are close to the values calculated from lm.wt.v04.out1. The largest differences are in KmAB: 1.3 percent in the parameter estimate, 3 percent in the std. error, and 4.5 percent in t. Analysis of variance, with the function aov(), can be used to assess the linear model. summary(aov(1/v0 ~ I(1/a) + I(1/b) + I(1/a):I(1/b),data=ldh.nop.df, weights=ldh.nop.df$v0^4)) The term with the coefficient (KmAB/Vmax), which is the least significant factor, enters into the fit at the .005 confidence level. The confidence level for the (KmAB/Vmax) parameter estimated from the t value of the weighted linear fit is .0025. When viewing the surface, 1/v0 = f(1/a, 1/b), a slight twist was seen. We can examine the origin of this twist. Carry out a linear fit without the term (KmAB/Vmax)*(1/a)*(1/b). lm.wt.nokmab.out1<-lm(1/v0 ~ I(1/a) + I(1/b),data=ldh.nop.df, weights=ldh.nop.df$v0^4) Construct data frames for use in graphics displays. ldh.nop.plot.df<-as.data.frame(cbind(ldh.nop.df,v0calc=(ldh.nop.df$v0 - ldh.nop.out1$res))) lm.nokmab.plot.df<-as.data.frame(cbind(ldh.nop.df, v0calc=lm.wt.nokmab.out1$fitted.values)) Display of the results with xgobi shows that the plane of the calculated values is not twisted for lm.nokmab.plot.df and is twisted for ldh.nop.plot.df, showing that the twist comes from a contribution of the cross product term 1/a * 1/b.