Dostosowanie odległości Kullback-Leibler?

28

Spójrz na ten obrazek: wprowadź opis zdjęcia tutaj

Jeśli wyciągniemy próbkę z gęstości czerwonej, wówczas oczekuje się, że niektóre wartości będą mniejsze niż 0,25, podczas gdy niemożliwe jest wygenerowanie takiej próbki z rozkładu niebieskiego. W konsekwencji odległość Kullbacka-Leiblera od gęstości czerwonej do gęstości niebieskiej jest nieskończonością. Jednak dwie krzywe nie są tak wyraźne, w pewnym „naturalnym sensie”.

Oto moje pytanie: czy istnieje adaptacja odległości Kullbacka-Leiblera, która pozwoliłaby na skończoną odległość między tymi dwiema krzywymi?

ocram
źródło
1
W jakim „naturalnym sensie” te krzywe „nie są tak wyraźne”? W jaki sposób ta intuicyjna bliskość jest powiązana z dowolną właściwością statystyczną? (Mogę wymyślić kilka odpowiedzi, ale zastanawiam się, co masz na myśli.)
whuber
1
Cóż ... są całkiem blisko siebie w tym sensie, że oba są zdefiniowane na wartościach dodatnich; oba rosną, a następnie maleją; oba mają w rzeczywistości takie same oczekiwania; a odległość Kullbacka Leiblera jest „mała”, jeśli ograniczymy się do części osi X ... Ale aby połączyć te intuicyjne pojęcia z dowolną właściwością statystyczną, potrzebowałbym pewnej ścisłej definicji tych cech ...
ocram

Odpowiedzi:

18

Możesz zajrzeć do Rozdziału 3 Devroye, Gyorfi i Lugosi, A Probabilistic Theory of Pattern Recognition , Springer, 1996. Zobacz w szczególności rozdział o dywergencjach.f

f-Divergences can be viewed as a generalization of Kullback--Leibler (or, alternatively, KL can be viewed as a special case of an f-Divergence).

The general form is

Df(p,q)=q(x)f(p(x)q(x))λ(dx),

where λ is a measure that dominates the measures associated with p and q and f() is a convex function satisfying f(1)=0. (If p(x) and q(x) are densities with respect to Lebesgue measure, just substitute the notation dx for λ(dx) and you're good to go.)

We recover KL by taking f(x)=xlogx. We can get the Hellinger difference via f(x)=(1x)2 and we get the total-variation or L1 distance by taking f(x)=12|x1|. The latter gives

DTV(p,q)=12|p(x)q(x)|dx

Note that this last one at least gives you a finite answer.

In another little book entitled Density Estimation: The L1 View, Devroye argues strongly for the use of this latter distance due to its many nice invariance properties (among others). This latter book is probably a little harder to get a hold of than the former and, as the title suggests, a bit more specialized.


Addendum: Via this question, I became aware that it appears that the measure that @Didier proposes is (up to a constant) known as the Jensen-Shannon Divergence. If you follow the link to the answer provided in that question, you'll see that it turns out that the square-root of this quantity is actually a metric and was previously recognized in the literature to be a special case of an f-divergence. I found it interesting that we seem to have collectively "reinvented" the wheel (rather quickly) via the discussion of this question. The interpretation I gave to it in the comment below @Didier's response was also previously recognized. All around, kind of neat, actually.

cardinal
źródło
1
Very nice! I am going to try to find "A Probabilistic Theory of Pattern Recognition" and to understand its chapter 3!
ocram
1
good answer, note that most often DTV is defined another way which makes it half the L1 distance.
robin girard
1
@robin, thanks for your comment. Yes, I realize this. I was just trying to avoid a messy extraneous constant in the exposition. But, strictly speaking, you're correct. I've updated it accordingly.
cardinal
3
Your addendum is the most useful piece of information I ran into on stats.SE, so far. All my warmest thanks for this. I simply reproduce here the reference you gave: research-repository.st-andrews.ac.uk/bitstream/10023/1591/1/… Endres and Schindelin, A new metric for probability distributions, IEEE Trans. on Info. Thy., vol. 49, no. 3, Jul. 2003, pp. 1858-1860.
Did
1
@Didier, well, it was more a happy accident than anything else. No one was responding to the other question, so I decided to try to figure out what the Jensen-Shannon Divergence was in the first place. Once I found the definition, it seemed reasonable to connect the two questions via my addendum. I'm glad you found it useful. Regards.
cardinal
19

The Kullback-Leibler divergence κ(P|Q) of P with respect to Q is infinite when P is not absolutely continuous with respect to Q, that is, when there exists a measurable set A such that Q(A)=0 and P(A)0. Furthermore the KL divergence is not symmetric, in the sense that in general κ(PQ)κ(QP). Recall that

κ(PQ)=Plog(PQ).
A way out of both these drawbacks, still based on KL divergence, is to introduce the midpoint
R=12(P+Q).
Thus R is a probability measure, and P and Q are always absolutely continuous with respect to R. Hence one can consider a "distance" between P and Q, still based on KL divergence but using R, defined as
η(P,Q)=κ(PR)+κ(QR).
Then η(P,Q) is nonnegative and finite for every P and Q, η is symmetric in the sense that η(P,Q)=η(Q,P) for every P and Q, and η(P,Q)=0 iff P=Q.

An equivalent formulation is

η(P,Q)=2log(2)+(Plog(P)+Qlog(Q)(P+Q)log(P+Q)).

Addendum 1 The introduction of the midpoint of P and Q is not arbitrary in the sense that

η(P,Q)=min[κ(P)+κ(Q)],
where the minimum is over the set of probability measures.

Addendum 2 @cardinal remarks that η is also an f-divergence, for the convex function

f(x)=xlog(x)(1+x)log(1+x)+(1+x)log(2).
Did
źródło
2
@Marco, @Didier Piau, it might be noted that @Didier's suggestion is another special case of an f-divergence where f(x)=xlogx(1+x)log(1+x2).
cardinal
1
@Marco, @Didier Piau, an alternative formulation which has some evocative nature is η(P,Q)=PlogP+QlogQ2RlogR=2H(R)(H(P)+H(Q)) and so η(P,Q)=2(H(μ(P,Q))μ(H(P),H(Q)) where μ(x,y)=x+y2. In other words, 12η(P,Q) is "difference between the entropy of the average measure and the average entropy of the measures".
cardinal
3
Isn't this just the Jensen-Shannon divergence?
Memming
"where the minimum is over the set of probability measures." I like this characterization of the Jensen–Shannon divergence. Is there a proof of it somewhere?
user76284
10

The Kolmogorov distance between two distributions P and Q is the sup norm of their CDFs. (This is the largest vertical discrepancy between the two graphs of the CDFs.) It is used in distributional testing where P is an hypothesized distribution and Q is the empirical distribution function of a dataset.

It is hard to characterize this as an "adaptation" of the KL distance, but it does meet the other requirements of being "natural" and finite.

Incidentally, because the KL divergence is not a true "distance," we don't have to worry about preserving all the axiomatic properties of a distance. We can maintain the non-negativity property while making the values finite by applying any monotonic transformation R+[0,C] for some finite value C. The inverse tangent will do fine, for instance.

whuber
źródło
1
Thank you for your suggestion about the Kolmogorov distance. Can you make your comment about the monotonic transformation a little bit more explicit? Thx
ocram
1
@Marco I don't understand how one could be any more explicit. Do you mean restating what I wrote in terms of a formula such as arctan(KL(P,Q)) or f(KL(P,Q)) for f:R+[0,C] with xy implies f(x)f(y) for all x,y0?
whuber
1
Yes, that's what I meant :-) I was not sure on what to apply the transformation. Now, it is clear, thx
ocram
1
@Marco: I am lost. Do you settle for the Kolmogorov distance (which is always finite but has nothing in common with KL divergence)? Or for a bounded monotone transform of KL divergence (such as arctan)? In the example of your post (and in any other not absolutely continuous example), the latter produces the supremum of the transform (π/2 if you settle for arctan). In effect, this abandons any idea of estimating a distance between such probability measures more precisely than saying they are far far away (whether you encode this by π/2 or by + is irrelevant).
Did
@Didier Yes, the transformed KL divergence (when symmetrized, as you describe) might not satisfy the triangle inequality and therefore would not be a distance, but it would still define a topology (which would likely be metrizable). You would thereby give up little or nothing. I remain agnostic about the merits of doing any of this: it seems to me this is just a way of papering over the difficulties associated with infinite values of the KL divergence in the first place.
whuber
2

Yes there does, Bernardo and Reuda defined something called the "intrinsic discrepancy" which for all purposes is a "symmetrised" version of the KL-divergence. Taking the KL divergence from P to Q to be κ(PQ) The intrinsic discrepancy is given by:

δ(P,Q)min[κ(PQ),κ(QP)]

Searching intrinsic discrepancy (or bayesian reference criterion) will give you some articles on this measure.

In your case, you would just take the KL-divergence which is finite.

Another alternative measure to KL is Hellinger distance

EDIT: clarification, some comments raised suggested that the intrinsic discrepancy will not be finite when one density 0 when the other is not. This is not true if the operation of evaluating the zero density is carried out as a limit Q0 or P0 . The limit is well defined, and it is equal to 0 for one of the KL divergences, while the other one will diverge. To see this note:

δ(P,Q)min[Plog(PQ),Qlog(QP)]

Taking limit as P0 over a region of the integral, the second integral diverges, and the first integral converges to 0 over this region (assuming the conditions are such that one can interchange limits and integration). This is because limz0zlog(z)=0. Because of the symmetry in P and Q the result also holds for Q.

probabilityislogic
źródło
1
Even the "intrinsic discrepancy" will be infinite when P is zero with positive probability for Q and vice versa, even if P and Q are otherwise identical.
whuber
1
Yes... I am afraid that the intrinsic discrepancy does not fulfil the requirement. But thank you for the suggestion. Any other suggestion would be appreciated.
ocram
1
It does fulfil the requirement, if you restrict the support of the blue density to be where it has strictly positive support, just as you have for the red one (>0)
probabilityislogic
3
@probabilityislogic: I do not unerstand your last remarks. First, let us give their proper names to the notions involved and say that P is absolutely continuous with respect to Q (denoted PQ) if, for every measurable A, Q(A)=0 implies P(A)=0. Now, notwithstanding your somewhat mysterious (to me) limit considerations, your δ(P,Q) is finite iff PQ or QP. .../...
Did
2
.../... A way out of the conundrum you seem to be dug into might be to introduce the mid-point measure P+Q. Since PP+Q and QP+Q, the quantity η(P,Q):=κ(P|P+Q)+κ(Q|P+Q) is always finite. Furthermore η(P,Q)=0 iff P=Q and η is symmetric. Hence η(P,Q) indeed measures a kind of "distance" between P and Q.
Did