Jak rygorystycznie zdefiniować prawdopodobieństwo?

30

Prawdopodobieństwo można określić na kilka sposobów, na przykład:

  • Funkcja L z Θ×X , który odwzorowuje (θ,x) do L(θx) to znaczy L:Θ×XR .

  • funkcja losowa L(X)

  • moglibyśmy również wziąć pod uwagę, że prawdopodobieństwo to tylko „zaobserwowane” prawdopodobieństwo L(xobs)

  • w praktyce prawdopodobieństwo doprowadza informację o θ tylko do stałej multiplikatywnej, dlatego możemy uznać prawdopodobieństwo za klasę równoważności funkcji, a nie za funkcję

Kolejne pytanie pojawia się przy rozważaniu zmiany parametryzacji: jeśli ϕ=θ2 jest nową parametryzacją, zwykle oznaczamy przez L(ϕx) prawdopodobieństwo na ϕ i nie jest to ocena poprzedniej funkcji L(x) przy θ2 ale o ϕ. This is an abusive but useful notation which could cause difficulties to beginners if it is not emphasized.

What is your favorite rigorous definition of the likelihood ?

In addition how do you call L(θx) ? I usually say something like "the likelihood on θ when x is observed".

EDIT: In view of some comments below, I realize I should have precised the context. I consider a statistical model given by a parametric family {f(θ),θΘ} of densities with respect to some dominating measure, with each f(θ) defined on the observations space X. Hence we define L(θx)=f(xθ) and the question is "what is L ?" (the question is not about a general definition of the likelihood)

Stéphane Laurent
źródło
2
(1) Because L(θ|x)dx=1 for all θ, I believe even the constant in L is defined. (2) If you think of parameters like ϕ and θ as merely being coordinates for a manifold of distributions, then change of parameterization has no intrinsic mathematical meaning; it's merely a change of description. (3) Native English speakers would more naturally say "likelihood of θ" rather than "on." (4) The clause "when x is observed" has philosophical difficulties, because most x will never be observed. Why not just say "likelihood of θ given x"?
whuber
1
@whuber: For (1), I don't think the constant is well-defined. See ET Jaynes's book where he writes: "that a likelihood is not a probability because its normalization is arbitrary."
Neil G
3
You appear to be confusing two kinds of normalization, Neil: Jaynes was referring to normalization by integration over θ, not x.
whuber
1
@whuber: I don't think a scaling factor will matter for the Cramer-Rao bound because changing k adds a constant amount to the log-likelihood, which then disappears when the partial derivative is taken.
Neil G
1
I agree with Neil, I do not see any application where the constant plays a role
Stéphane Laurent

Odpowiedzi:

13

Your third item is the one I have seen the most often used as rigorous definition.

The others are interesting too (+1). In particular the first is appealing, with the difficulty that the sample size not being (yet) defined, it is harder to define the "from" set.

To me, the fundamental intuition of the likelihood is that it is a function of the model + its parameters, not a function of the random variables (also an important point for teaching purposes). So I would stick to the third definition.

The source of the abuse of notation is that the "from" set of the likelihood is implicit, which is usually not the case for well defined functions. Here, the most rigorous approach is to realize that after the transformation, the likelihood relates to another model. It is equivalent to the first, but still another model. So the likelihood notation should show which model it refers to (by subscript or other). I never do it of course, but for teaching, I might.

Finally, to be consistent with my previous answers, I say the "likelihood of θ" in your last formula.

gui11aume
źródło
Thanks. And what is your advice about the equality up to a multiplicative constant ?
Stéphane Laurent
Personally I prefer to call it up when needed rather than hard code it in the definition. And think that for model selection/comparison this 'up-to-a-multiplicative-constant' equality does not hold.
gui11aume
Ok. Concerning the name, you could imagine you discuss about the likelihoods L(θx1) and L(θx2) for two possibles observations. In such a case, would you say "the likelihood of θ when x1 observed", or "the likelihood of θ for the observation x1", or something else ?
Stéphane Laurent
1
If you re-parametrize your model with ϕ=θ2 you actually compute the likelihood as a composition of functions L(.|x)g(.) where g(y)=y2. In this case, g goes from R to R+ so the set of definition (mentioned as "from" set) of the likelihood is no longer the same. You could call the first function L1(.|) and the second L2(.|) because they are not the same functions.
gui11aume
1
How is the third definition rigorous? And what is the problem with the sample size not being defined? Since we say P(x1,x2,,xnθ), which naturally brings into existence a corresponding sigma algebra for the sample space Ωn, why can't we have the parallel definition for likelihoods?
Neil G
8

I think I would call it something different. Likelihood is the probability density for the observed x given the value of the parameter θ expressed as a function of θ for the given x. I don't share the view about the proportionality constant. I think that only comes into play because maximizing any monotonic function of the likelihood gives the same solution for θ. So you can maximize cL(θx) for c>0 or other monotonic functions such as log(L(θx)) which is commonly done.

Michael R. Chernick
źródło
4
Not only the maximization: the up-to-proportionality also comes into play in the likelihood ratio notion, and in Bayes formula for Bayesian statistics
Stéphane Laurent
I thought someone might downvote my answer. But I think it is quite reasonable to define likelihood this way as a definitive probability without calling anything proprotional to it a likelihood. @StéphaneLaurent to your comment about priors, if the function is integrable it can be normalized to a density. The posterior is proportional to the likelihood times the prior. Since the posterior must be normalized by dividing by an integral we might as well specify the prior to be the distribution. It is only in an extended sense that this gets applied to improper priors.
Michael R. Chernick
1
I'm not quite sure why someone would downvote this answer. It seems you are trying to respond more to the OP's second and questions than the first. Perhaps that was not entirely clear to other readers. Cheers. :)
cardinal
@Michael I don't see the need to downvote this answer too. Concerning noninformative priors (this is another discussion and) I intend to open a new disucssion about this subject. I will not do it soon, because I am not easy with English, and this is more difficult for me to write "philosophy" than mathematics.
Stéphane Laurent
1
@Stephane: If you'd like, please consider posting your other question directly in French. We have several native French speakers on this site that likely would help translate any passages you're unsure about. This includes a moderator and also an editor of one of the very top English-language statistics journals. I look forward to the question.
cardinal
6

Here's an attempt at a rigorous mathematical definition:

Let X:ΩRn be a random vector which admits a density f(x|θ0) with respect to some measure ν on Rn, where for θΘ, {f(x|θ):θΘ} is a family of densities on Rn with respect to ν. Then, for any xRn we define the likelihood function L(θ|x) to be f(x|θ); for clarity, for each x we have Lx:ΘR. One can think of x to be a particular potential xobs and θ0 to be the "true" value of θ.

A couple of observations about this definition:

  1. The definition is robust enough to handle discrete, continuous, and other sorts of families of distributions for X.
  2. We are defining the likelihood at the level of density functions instead of at the level of probability distributions/measures. The reason for this is that densities are not unique, and it turns out that this isn't a situation where one can pass to equivalence classes of densities and still be safe: different choices of densities lead to different MLE's in the continuous case. However, in most cases there is a natural choice of family of densities that are desirable theoretically.
  3. I like this definition because it incorporates the random variables we are working with into it and, by design since we have to assign them a distribution, we have also rigorously built in the notion of the "true but unknown" value of θ, here denoted θ0. For me, as a student, the challenge of being rigorous about likelihood was always how to reconcile the real world concepts of a "true" θ and "observed" xobs with the mathematics; this was often not helped by instructors claiming that these concepts weren't formal but then turning around and using them formally when proving things! So we deal with them formally in this definition.
  4. EDIT: Of course, we are free to consider the usual random elements L(θ|X), S(θ|X) and I(θ|X) and under this definition with no real problems with rigor as long as you are careful (or even if you aren't if that level of rigor is not important to you).
guy
źródło
4
@Xi'an Let X1,...,Xn be uniform on (0,θ). Consider two densities f1(x)=θ1I[0<x<θ] versus f2(x)=θ1I[0xθ]. Both f1 and f2 are valid densities for U(0,θ), but under f2 the MLE exists and is equal to maxXi whereas under f1 we have jf1(xj|maxxi)=0 so that if you set θ^=maxXi you end up with a likelihood of 0, and in fact the MLE doesn't exist because supθjf1(x|θ) is not attained for any θ.
guy
1
@guy: thanks, I did not know about this interesting counter-example.
Xi'an
1
@guy You said that supθjf1(xj|θ) is not attained for any θ. However, this supremum is attained at some point as I show below:
L1(θ;x)=j=1nf1(xj|θ)=θnj=1nI(0<xj<θ)=θnI(0<M<θ),
where M=max{x1,,xn}. I am assuming that xj>0 for all j=1,,n. It is simple to see that 1. L1(θ;x)=0, if 0<θM; 2. L1(θ;x)=θn, if M<θ<. Continuing...
Alexandre Patriota
1
@guy: continuing... That is,
L1(θ;x)[0,Mn),
for all θ(0,). We do not have a maximum value but the supremum does exist and it is given by
supθ(0,)L1(θ,x)=Mn
and the argument is
M=argsupθ(0,)L1(θ;x).
Perhaps, the usual asymptotics are not applied here and some other tolls should be employed. But, the supremum of L1(θ;x) does exist or I missed some very basic concepts.
Alexandre Patriota
1
@AlexandrePatriota The supremum exists, obviously, but it is not attained by the function. I'm not sure what the notation argsup is supposed to mean - there is no argument of L1(θ;x) which yields the sup because L1(θ;M)=0. The MLE is defined as any θ^ which attains the sup (typically) and no θ^ attains the sup here. Obviously there are ways around it - the asymptotics we appeal to require that there exists a likelihood with such-and-such properties, and there does. It's just L2 rather than L1.
guy