Rozkład Tweediego może modelować skośne dane z masą punktową równą zero, gdy parametr (wykładnik w relacji średnia-wariancja) wynosi od 1 do 2.
Podobnie model z napompowaniem zera (inaczej ciągły lub dyskretny) może mieć dużą liczbę zer.
Mam problem ze zrozumieniem, dlaczego jest tak, że kiedy przewiduję lub obliczam dopasowane wartości dla tego rodzaju modeli, wszystkie przewidywane wartości są niezerowe.
Czy te modele faktycznie przewidują dokładne zera?
Na przykład
library(tweedie)
library(statmod)
# generate data
y <- rtweedie( 100, xi=1.3, mu=1, phi=1) # xi=p
x <- y+rnorm( length(y), 0, 0.2)
# estimate p
out <- tweedie.profile( y~1, p.vec=seq(1.1, 1.9, length=9))
# fit glm
fit <- glm( y ~ x, family=tweedie(var.power=out$p.max, link.power=0))
# predict
pred <- predict.glm(fit, newdata=data.frame(x=x), type="response")
pred
teraz nie zawiera żadnych zer. Myślałem, że użyteczność modeli, takich jak rozkład Tweediego, wynika z jego zdolności do przewidywania dokładnych zer i części ciągłej.
Wiem, że w moim przykładzie zmienna x
nie jest bardzo przewidywalna.
Odpowiedzi:
Zauważ, że przewidywana wartość w GLM jest średnią.
Dla każdego rozkładu wartości nieujemnych, aby przewidzieć średnią 0, jej rozkład musiałby być całkowitym skokiem przy 0.
Jednak dzięki log-linkowi nigdy nie zmieścisz średniej dokładnie zero (ponieważ wymagałoby to aby przejść do - ∞ ).η −∞
So your problem isn't a problem with the Tweedie, but far more general; you'd have exactly the same issue with the Poisson (zero-inflated or ordinary Poisson GLM) for example.
Since predicting exact zeros isn't going to occur for any distribution over non-negative values with a log-link, your thinking on this must be mistaken.
One of its attractions is that it can model exact zeros in the data, not that the mean predictions will be 0. [Of course a fitted distribution with nonzero mean can still have a probability of being exactly zero, even though the mean must exceed 0. A suitable prediction interval could well include 0, for example.]
It matters not at all that the fitted distribution includes any substantial proportion of zeros - that doesn't make the fitted mean zero.
Note that if you change your link function to say an identity link, it doesn't really solve your problem -- the mean of a non-negative random variable that's not all-zeros will be positive.
źródło
Predicting the proportion of zeros
I am the author of the statmod package and joint author of the tweedie package. Everything in your example is working correctly. The code is accounting correctly for any zeros that might be in the data.
As Glen_b and Tim have explained, the predicted mean value will never be exactly zero, unless the probability of a zero is 100%. What might be of interest though is the predicted proportion of zeros, and this can easily be extracted from the model fit as I show below.
Here is a more sensible working example. First simulate some data:
The data contains 12 zeros.
Now fit a Tweedie glm:
Of course the regression onx is highly significant. The estimated value of the dispersion ϕ is 0.85786.
The predicted proportion of zeros for each value ofx can be computed from the following formula:
So the predicted proportion of zeros varies from 38.1% at the smallest mean values down to 4.5e-6 at the largest mean values.
The formula for the probability of an exact zero can be found in Dunn & Smyth (2001) Tweedie Family Densities: Methods of Evaluation or Dunn & Smyth (2005) Series evaluation of Tweedie exponential dispersion model densities.
źródło
This answer was merged from another thread asking about predictions zero-inflated regression model, but it also applies to the Tweedie GLM model.
Regression-like models predict mean of some distribution (normal for linear regression, Bernoulli for logistic regression, Poisson for Poisson regression etc.). In the case of zero-inflated regression you predict mean of the zero inflated-something distribution (e.g. Poisson, binomial). When the probability density function of the non-inflated distribution isf , then probability density function of zero-inflated distribution is a mixture of point mass at zero and f :
whereI is an indicator function. Zero-inflated regression model predicts mean of fzeroinfl(y) , i.e.
whereg−1 is an inverse of the link function. So since you are predicting the mean of this distribution, you won't see the excess zeros in your predictions since the zeros are not the mean of the distribution (while they shrink the mean towards zero), the same as linear regression does not predict the residuals.
This is illustrated on the plot below, where values of random variableY are plotted against X , where Y follows a zero-inflated Poisson distribution with mean conditional on X . The black points are the actual data that were used to fit the zero-inflated Poisson regression model, the red points are the predictions, and the blue points are means of Y within the six arbitrary groups of X values. As you can see, clearly the zero inflated Poisson regression model estimates E(Y|X) .
źródło