Chciałbym oszacować niepewność lub wiarygodność dopasowanej krzywej. Celowo nie wymieniam dokładnej wielkości matematycznej, której szukam, ponieważ nie wiem, co to jest.
Tutaj (energia) jest zmienną zależną (odpowiedź), a (objętość) jest zmienną niezależną. Chciałbym znaleźć krzywą energia-objętość, , jakiegoś materiału. Wykonałem więc obliczenia za pomocą komputerowego programu chemii kwantowej, aby uzyskać energię dla niektórych objętości próbek (zielone kółka na wykresie).
Następnie dopasowałem te próbki danych do funkcji Birch – Murnaghan : zależności od cztery parametry: . Zakładam również, że jest to właściwa funkcja dopasowania, więc wszystkie błędy pochodzą po prostu z szumu próbek. W dalszej części zamocowana funkcja zostaną zapisane jako funkcja .
Tutaj możesz zobaczyć wynik (dopasowanie z algorytmem najmniejszych kwadratów). Zmienna osi y i zmienne osi x . Niebieska linia jest dopasowana, a zielone kółka to punkty próbne.
Potrzebuję teraz pewnej miary niezawodności (najlepiej w zależności od objętości) tej dopasowanej krzywej , ponieważ potrzebuję jej do obliczenia dalszych wielkości, takich jak ciśnienia przejściowe lub entalpie.
Moja intuicja mówi mi, że dopasowana krzywa jest najbardziej niezawodna w środku, więc myślę, że niepewność (powiedzmy zakres niepewności) powinna wzrosnąć pod koniec przykładowych danych, jak na tym szkicu:
Jakiego rodzaju miary szukam i jak to obliczyć?
Mówiąc ściślej, w rzeczywistości istnieje tylko jedno źródło błędów: obliczone próbki są hałaśliwe z powodu ograniczeń obliczeniowych. Gdybym więc obliczył gęsty zestaw próbek danych, utworzyłyby nierówną krzywą.
Moim pomysłem na znalezienie pożądanego oszacowania niepewności jest obliczenie następującego „błędu” na podstawie parametrów podczas uczenia się w szkole ( propagacja niepewności ):
Is that an acceptable approach or am I doing it wrong?
PS: I know that I could also just sum up the squares of the residuals between my data samples and the curve to get some kind of ''standard error'' but this is not volume dependent.
źródło
Odpowiedzi:
This is an ordinary least squares problem!
Defining
the model can be rewritten
where the coefficientsβ=(βi)′ are algebraically related to the original coefficients via
This is straightforward to solve algebraically or numerically: pick the solution for whichB0,B′0 , and w are positive. The only reason to do this is to obtain estimates of B0,B′0,w , and E0 and to verify they are physically meaningful. All analyses of the fit can be carried out in terms of β .
This approach is not only much simpler than nonlinear fitting, it is also more accurate: the variance-covariance matrix for(E0,B0,B′0,V0) returned by a nonlinear fit is only a local quadratic approximation to the sampling distribution of these parameters, whereas (for normally distributed errors in measuring E , anyway) the OLS results are not approximations.
Confidence intervals, prediction intervals, etc. can be obtained in the usual way without needing to find these values: compute them in terms of the estimatesβ^ and their variance-covariance matrix. (Even Excel could do this!) Here is an example, followed by the (simple)
R
code that produced it.If you are interested in the joint distribution of the original parameter estimates, then it is easy to simulate from the OLS solution: simply generate multivariate Normal realizations ofβ and convert those into the parameters. Here is a scatterplot matrix of 2000 such realizations. The strong curvilinearity shows why the Delta method is likely to give poor results.
źródło
There is a standard approach for this called the delta method. You form the inverse of the Hessian of the log-likelihood wrt your four parameters. There is an extra parameter for the variance of the residuals, but it does not play a role in these calculations. Then you calculate predicted response for the desired values of the independent variable and calculate its gradient (the derivative wrt) these four parameters. Call the inverse of the HessianI and the
gradient vector g . You form the vector matrix product
This can be done for any dependent variable in AD Model Builder. One declares a variable in the appropriate spot in the code like this
and writes the code the evaluate the dependent variable like this
Note this is evaluated for a value of the independent variable 2 times the largest one observed in the model fitting. Fit the model and one obtains the standard deviation for this dependent variable
I have modified the program to include code for calculating the confidence limits for the enthalpy-volume function The code (TPL) file looks like
Then I refitted the model to get the standard devs for the estimates of H.
These are calculated for your observed V values, but could be easily calculated for any value of V.
It has been pointed out that this is actually a linear model for which there is simple R code to perform the parameter estimation via OLS. This is very appealing especially to naive users. However since the work of Huber over thirty years ago we know or should know that one should probably almost always replace OLS with a moderately robust alternative. The reason this is not routinely done I believe is that robust methods are inherently nonlinear. From this point of view the simple appealing OLS methods in R are more of a trap, rather than a feature. An advabtage of the AD Model Builder approach is its built in support for nonlinear modelling. To change the least squares code to a robust normal mixture only one line of the code needs to be changed. The line
is changed to
The amount of overdispersion in the models is measured by the parameter a. If a equals 1.0, the variance is the same as for the normal model. If there is inflation of the variance by outliers we expect that a will be smaller than 1.0. For these data the estimate of a is about 0.23 so that the variance is about 1/4 the variance for the normal model. The interpretation is that outliers have increased the variance estimate by a factor of about 4. The effect of this is to increase the size of the confidence bounds for parameters for the OLS model. This represents a loss of efficiency. For the normal mixture model the estimated standard deviations for the enthalpy-volume function are
One sees that there are small changes in the point estimates, while the confidence limits have been reduced to about 60% of the ones produced by OLS.
The main point I want to make is that all the modified calculations occur automatically once one changes the one line of code in the TPL file.
źródło
Cross-validation is a simple way to estimate reliability of your curve: https://en.wikipedia.org/wiki/Cross-validation_(statistics)
Propagation of uncertainty with partial differentials is great is you really knowΔE0,ΔV0,ΔB0 , and ΔB′ . However, the program you are using gives only fitting errors(?). Those will be too optimistic (unrealistically small).
You can calculate 1-fold validation error by leaving one of your points away from fitting and using the fitted curve to predict value of the point that was left away. Repeat this for all points so that each is left away once. Then, calculate validation error of your final curve (curve fitted with all points) as an average of prediction errors.
This will only tell you how sensitive your model is for any new data point. For example, it will not tell you how inaccurate your energy model is. However, this will be much more realistic error estimate mere fitting error.
Also, you can plot prediction errors as a function of volume if you want.
źródło