Czy model danych nieujemnych z grupowaniem zer (GLM Tweedie, GLM z zerowym napełnieniem itp.) Może przewidywać dokładne zera?

15

Rozkład Tweediego może modelować skośne dane z masą punktową równą zero, gdy parametr p (wykładnik w relacji średnia-wariancja) wynosi od 1 do 2.

Podobnie model z napompowaniem zera (inaczej ciągły lub dyskretny) może mieć dużą liczbę zer.

Mam problem ze zrozumieniem, dlaczego jest tak, że kiedy przewiduję lub obliczam dopasowane wartości dla tego rodzaju modeli, wszystkie przewidywane wartości są niezerowe.

Czy te modele faktycznie przewidują dokładne zera?

Na przykład

library(tweedie)
library(statmod)
# generate data
y <- rtweedie( 100, xi=1.3, mu=1, phi=1)  # xi=p
x <- y+rnorm( length(y), 0, 0.2)
# estimate p
out <- tweedie.profile( y~1, p.vec=seq(1.1, 1.9, length=9))
# fit glm
fit <- glm( y ~ x, family=tweedie(var.power=out$p.max, link.power=0))
# predict
pred <- predict.glm(fit, newdata=data.frame(x=x), type="response")

predteraz nie zawiera żadnych zer. Myślałem, że użyteczność modeli, takich jak rozkład Tweediego, wynika z jego zdolności do przewidywania dokładnych zer i części ciągłej.

Wiem, że w moim przykładzie zmienna xnie jest bardzo przewidywalna.

spore234
źródło
Również rozważyć parametryczne modele odpowiedzi porządkowe, które umożliwiają dowolne rozkładów dla . Y
Frank Harrell,

Odpowiedzi:

16

Zauważ, że przewidywana wartość w GLM jest średnią.

Dla każdego rozkładu wartości nieujemnych, aby przewidzieć średnią 0, jej rozkład musiałby być całkowitym skokiem przy 0.

Jednak dzięki log-linkowi nigdy nie zmieścisz średniej dokładnie zero (ponieważ wymagałoby to aby przejść do - ).η

So your problem isn't a problem with the Tweedie, but far more general; you'd have exactly the same issue with the Poisson (zero-inflated or ordinary Poisson GLM) for example.

I thought the usefulness of the Tweedie distribution comes from its ability to predict exact zeros and the continuous part.

Since predicting exact zeros isn't going to occur for any distribution over non-negative values with a log-link, your thinking on this must be mistaken.

One of its attractions is that it can model exact zeros in the data, not that the mean predictions will be 0. [Of course a fitted distribution with nonzero mean can still have a probability of being exactly zero, even though the mean must exceed 0. A suitable prediction interval could well include 0, for example.]

It matters not at all that the fitted distribution includes any substantial proportion of zeros - that doesn't make the fitted mean zero.

Note that if you change your link function to say an identity link, it doesn't really solve your problem -- the mean of a non-negative random variable that's not all-zeros will be positive.

Glen_b -Reinstate Monica
źródło
1
thanks for your explanation. I compared a tweedie glm to a gamma glm and the betas are almost exactly the same, no matter how many zeros the data contains (I change the zeros to a very small value for the gamma glm). And what is the proposed way to predict zeros and the continuous part simultaneously.
spore234
2
@spore234 You could roll your own gamma-hurdle model, which would have a binomial hurdle to predict 0/1 and a gamma model fitted to the non-zero data. Here's a link to a blog post that discusses this model and how to fit one by hand in R. As an aside, If something is continuous, how do you know that it is exactly zero? Is your measurement apparatus capable of such fine-grained measurements?
Reinstate Monica - G. Simpson
2
@spore, You're going to have to be more explicit about what you really mean by "predict the zeros"; my answer already establishes why no other distributional model used to replace the Tweedie will give a mean prediction of zero (NB zero-inflated and hurdle models have the same issue with their mean predictions as well). Given a mean prediction is what you meant by "predict" when you used a GLM, what do you mean by it now? If you change it to mean something where a 0-inflated or hurdle model makes sense, a Tweedie may well satisfy the same condition.
Glen_b -Reinstate Monica
1
It really depends on what you mean by "predict" (since you don't mean "forecast the mean" you need to say what it is you do seek -- do you want to forecast the probability of a zero? Do you want a median forecast? Something else?), and what kinds of things you regard as "better" so some comparison might be made.
Glen_b -Reinstate Monica
1
@spore234 The problem, yet again, is you use the word "predict" but fail to define what you mean by "predict" (I keep asking!). You appear to have ruled out both of the most obvious interpretations of the term in this situation, so you need to say what you do mean. When you say "predict how much this person's cost will be" what do you actually mean? Note that you can't get the exact cost for each person ... so what properties should this "prediction" have?
Glen_b -Reinstate Monica
10

Predicting the proportion of zeros

I am the author of the statmod package and joint author of the tweedie package. Everything in your example is working correctly. The code is accounting correctly for any zeros that might be in the data.

As Glen_b and Tim have explained, the predicted mean value will never be exactly zero, unless the probability of a zero is 100%. What might be of interest though is the predicted proportion of zeros, and this can easily be extracted from the model fit as I show below.

Here is a more sensible working example. First simulate some data:

> library(statmod)
> library(tweedie)
> x <- 1:100
> mutrue <- exp(-1+x/25)
> summary(mutrue)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3829  1.0306  2.7737  5.0287  7.4644 20.0855 
> y <- rtweedie(100, mu=mutrue, phi=1, power=1.3)
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.8482  2.9249  4.7164  6.1522 24.3897 
> sum(y==0)
[1] 12

The data contains 12 zeros.

Now fit a Tweedie glm:

> fit <- glm(y ~ x, family=tweedie(var.power=1.3, link.power=0))
> summary(fit)

Call:
glm(formula = y ~ x, family = tweedie(var.power = 1.3, link.power = 0))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.71253  -0.94685  -0.07556   0.69089   1.84013  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.816784   0.168764   -4.84 4.84e-06 ***
x            0.036748   0.002275   16.15  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1   1

(Dispersion parameter for Tweedie family taken to be 0.8578628)

    Null deviance: 363.26  on 99  degrees of freedom
Residual deviance: 103.70  on 98  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

Of course the regression on x is highly significant. The estimated value of the dispersion ϕ is 0.85786.

The predicted proportion of zeros for each value of x can be computed from the following formula:

> Phi <- 0.85786
> Mu <- fitted(fit)
> Power <- 1.3
> Prob.Zero <- exp(-Mu^(2-Power) / Phi / (2-Power))
> Prob.Zero[1:5]
        1         2         3         4         5 
0.3811336 0.3716732 0.3622103 0.3527512 0.3433024 
> Prob.Zero[96:100]
          96           97           98           99          100 
1.498569e-05 1.121936e-05 8.336499e-06 6.146648e-06 4.496188e-06 

So the predicted proportion of zeros varies from 38.1% at the smallest mean values down to 4.5e-6 at the largest mean values.

The formula for the probability of an exact zero can be found in Dunn & Smyth (2001) Tweedie Family Densities: Methods of Evaluation or Dunn & Smyth (2005) Series evaluation of Tweedie exponential dispersion model densities.

Gordon Smyth
źródło
thanks, useful! Any suggestions on how to compute the confidence interval for these probabilities of exact zero? Would it make sense at all? I am also puzzled by how to define the "95% likelihood region" from your 2005 paper, probably something known I cannot find. I would greatly appreciate a reference
irintch3
8

This answer was merged from another thread asking about predictions zero-inflated regression model, but it also applies to the Tweedie GLM model.

Regression-like models predict mean of some distribution (normal for linear regression, Bernoulli for logistic regression, Poisson for Poisson regression etc.). In the case of zero-inflated regression you predict mean of the zero inflated-something distribution (e.g. Poisson, binomial). When the probability density function of the non-inflated distribution is f, then probability density function of zero-inflated distribution is a mixture of point mass at zero and f:

fzeroinfl(y)=πI{0}(y)+(1π)f(y)

where I is an indicator function. Zero-inflated regression model predicts mean of fzeroinfl(y), i.e.

μi=π0+(1π)g1(xiβ)

where g1 is an inverse of the link function. So since you are predicting the mean of this distribution, you won't see the excess zeros in your predictions since the zeros are not the mean of the distribution (while they shrink the mean towards zero), the same as linear regression does not predict the residuals.

This is illustrated on the plot below, where values of random variable Y are plotted against X, where Y follows a zero-inflated Poisson distribution with mean conditional on X. The black points are the actual data that were used to fit the zero-inflated Poisson regression model, the red points are the predictions, and the blue points are means of Y within the six arbitrary groups of X values. As you can see, clearly the zero inflated Poisson regression model estimates E(Y|X).

Example

Tim
źródło
Tim, this is really a great answer and I am sorry for the timing of the close-and-merge. If you'd like anything about the question further modified to make it more canonical or to fit better (incorporate some of the one you answered perhaps), please go ahead, or I'll be happy to do it for you.
Glen_b -Reinstate Monica