Tech Note 413 Detail

DataStudio Curve Fitting

Problem/Symptom:
What curve fitting algorithms does DataStudio use?

PASCO Solution:

DataStudio Modeling

DataStudio uses parametric analysis to determine the "best fit" for a data set. Parametric analysis is only valid for interval or ratio measurements whose distribution of error of a set of repeated measurements around the mean value would have a normal (Gaussian) distribution with a standard deviation is the same at every value of x and whose value of x is assumed to be known precisely. If you have data that is nominal, ordinal, or not normally distributed, then you would need to use nonparametric statistics instead. Fortunately, large data sets will tend to be normally-distributed as proved by the central limit theorem.

Linear Fit

DataStudio uses standard methods to obtain an exact solution for a best linear fit.

Because you are guaranteed to obtain a linear fit for any set of data (whether the relationship is significant or not), some have been tempted to always linearly transform the data and then fit the transformed data to a line. Unfortunately, linearization usually results in a poor fit of the untransformed data because the linearization of the original data changes the weighting of the experimental error of the data values.) What appears to be a good fit of the linearized data frequently represents a poor fit of the raw data.

The (Pearson's) correlation coefficient r that is reported by DataStudio will be a good measure of the correlation of the data if and only if all of the assumptions of parametric analysis are correct and if there is a sufficient amount of data for the correlation to be significant.

Nonlinear Curve Fits

All other curve fits in DataStudio 1.9+ use an iterative model/trust-region technique along with an adaptive choice of the model Hessian.

The algorithm is essentially a combination of the Gauss-Newton and Levenberg-Marquardt iterative methods that converges more reliably and quickly than either of these methods alone. Previous versions of DataStudio used a pure Levenberg-Marquardt method that appeared in Numerical Recipes in C, which frequently would fail to converge to a solution.

DataStudio 1.9+ computes the sum of the squared residuals for one set of parameter values and then slightly alters each parameter value and recomputes the sum of squared residuals to see how the parameter value change affects the sum of the squared residuals. By dividing the difference between the original and new sum of squared residual values by the amount the parameter was altered, DataStudio 1.9+ is able to determine the approximate partial derivative with respect to the parameter. This partial derivative is used by DataStudio 1.9+ to decide how to alter the value of the parameter for the next iteration.

If the function being modeled is well-behaved, and the starting value for the parameter is not too far from the optimum value, the procedure will eventually converge to the best estimate for the parameter. This procedure is carried out simultaneously for all parameters and is, in fact, a minimization problem in n-dimensional space, where n is the number of parameters. For a much more detailed explanation of the regression algorithm used by DataStudio 1.9+, see ACM Transactions on Mathematical Software 7,3 (Sept. 1981). Dennis, J.E., Gay, D.M., and Welsch, R.E. "An adaptive nonlinear least-squares algorithm."

Convergence Criterion

DataStudio 1.9+ has several convergence criteria that stop the iterative minimization procedure. The TOLERANCE statement can be used to alter the convergence tolerance value. Two internal variables are used to determine when convergence has occurred: Relative Function Tolerance RFCTOL has a default value of 1E-10 and can be altered by use of the TOLERANCE statement. Absolute Function Tolerance AFCTOL has a default value of 1E-20 and is only altered by the TOLERANCE statement if the value specified is less than the default value. In the discussion which follows the "function value' is half the sum of the squared residuals computed using the current parameter estimates.

"Relative function convergence' is reported if the predicted maximum possible function reduction is at most RFCTOL*ABS(F0) where F0 is the function value at the start of the current iteration, and if the last step attempted achieved no more than twice the predicted function decrease.

"Absolute function convergence' is reported if the function value is less than AFCTOL.

Usually it will be obvious by simply glancing at the data whether there is a good fit or not. If the curve fit passes within a standard deviation of all of the data values, the curve fit is good. If the curve fit systematically deviates from several data points, the curve fit is poor.



Creation Date: 05/30/2003
Last Modified: 03/28/2014
Mod Summary: