Dostosowanie odległości Kullback-Leibler?

28

Spójrz na ten obrazek: wprowadź opis zdjęcia tutaj

Jeśli wyciągniemy próbkę z gęstości czerwonej, wówczas oczekuje się, że niektóre wartości będą mniejsze niż 0,25, podczas gdy niemożliwe jest wygenerowanie takiej próbki z rozkładu niebieskiego. W konsekwencji odległość Kullbacka-Leiblera od gęstości czerwonej do gęstości niebieskiej jest nieskończonością. Jednak dwie krzywe nie są tak wyraźne, w pewnym „naturalnym sensie”.

Oto moje pytanie: czy istnieje adaptacja odległości Kullbacka-Leiblera, która pozwoliłaby na skończoną odległość między tymi dwiema krzywymi?

kullback-leibler ocram
źródło

1

W jakim „naturalnym sensie” te krzywe „nie są tak wyraźne”? W jaki sposób ta intuicyjna bliskość jest powiązana z dowolną właściwością statystyczną? (Mogę wymyślić kilka odpowiedzi, ale zastanawiam się, co masz na myśli.)

whuber

1

Cóż ... są całkiem blisko siebie w tym sensie, że oba są zdefiniowane na wartościach dodatnich; oba rosną, a następnie maleją; oba mają w rzeczywistości takie same oczekiwania; a odległość Kullbacka Leiblera jest „mała”, jeśli ograniczymy się do części osi X ... Ale aby połączyć te intuicyjne pojęcia z dowolną właściwością statystyczną, potrzebowałbym pewnej ścisłej definicji tych cech ...

ocram

en.wikipedia.org/wiki/Statistic_distance

Memming

18

Możesz zajrzeć do Rozdziału 3 Devroye, Gyorfi i Lugosi, A Probabilistic Theory of Pattern Recognition , Springer, 1996. Zobacz w szczególności rozdział o dywergencjach. $f$

$f$ -Divergences can be viewed as a generalization of Kullback--Leibler (or, alternatively, KL can be viewed as a special case of an $f$ -Divergence).

The general form is

D_{f} (p, q) = \int q (x) f (\frac{p (x)}{q (x)}) λ (d x),

$D_f(p, q) = \int q(x) f\left(\frac{p(x)}{q(x)}\right) \, \lambda(dx) ,$

where $\lambda$ is a measure that dominates the measures associated with $p$ and $q$ and $f(\cdot)$ is a convex function satisfying $f(1) = 0$ . (If $p(x)$ and $q(x)$ are densities with respect to Lebesgue measure, just substitute the notation $dx$ for $\lambda(dx)$ and you're good to go.)

We recover KL by taking $f(x) = x \log x$ . We can get the Hellinger difference via $f(x) = (1 - \sqrt{x})^2$ and we get the total-variation or $L_1$ distance by taking $f(x) = \frac{1}{2} |x - 1|$ . The latter gives

D_{T V} (p, q) = \frac{1}{2} \int | p (x) - q (x) | d x

$D_{\mathrm{TV}}(p, q) = \frac{1}{2} \int |p(x) - q(x)| \, dx$

Note that this last one at least gives you a finite answer.

In another little book entitled Density Estimation: The $L_1$ View, Devroye argues strongly for the use of this latter distance due to its many nice invariance properties (among others). This latter book is probably a little harder to get a hold of than the former and, as the title suggests, a bit more specialized.

Addendum: Via this question, I became aware that it appears that the measure that @Didier proposes is (up to a constant) known as the Jensen-Shannon Divergence. If you follow the link to the answer provided in that question, you'll see that it turns out that the square-root of this quantity is actually a metric and was previously recognized in the literature to be a special case of an $f$ -divergence. I found it interesting that we seem to have collectively "reinvented" the wheel (rather quickly) via the discussion of this question. The interpretation I gave to it in the comment below @Didier's response was also previously recognized. All around, kind of neat, actually.

cardinal
źródło

1

Very nice! I am going to try to find "A Probabilistic Theory of Pattern Recognition" and to understand its chapter 3!

ocram

1

good answer, note that most often

D_{T V}

$D_{TV}$ is defined another way which makes it half the

L_{1}

$L_1$ distance.

robin girard

1

@robin, thanks for your comment. Yes, I realize this. I was just trying to avoid a messy extraneous constant in the exposition. But, strictly speaking, you're correct. I've updated it accordingly.

cardinal

3

Your addendum is the most useful piece of information I ran into on stats.SE, so far. All my warmest thanks for this. I simply reproduce here the reference you gave: research-repository.st-andrews.ac.uk/bitstream/10023/1591/1/… Endres and Schindelin, A new metric for probability distributions, IEEE Trans. on Info. Thy., vol. 49, no. 3, Jul. 2003, pp. 1858-1860.

Did

1

@Didier, well, it was more a happy accident than anything else. No one was responding to the other question, so I decided to try to figure out what the Jensen-Shannon Divergence was in the first place. Once I found the definition, it seemed reasonable to connect the two questions via my addendum. I'm glad you found it useful. Regards.

cardinal

19

The Kullback-Leibler divergence $\kappa(P|Q)$ of $P$ with respect to $Q$ is infinite when $P$ is not absolutely continuous with respect to $Q$ , that is, when there exists a measurable set $A$ such that $Q(A)=0$ and $P(A)\ne0$ . Furthermore the KL divergence is not symmetric, in the sense that in general $\kappa(P\mid Q)\ne\kappa(Q\mid P)$ . Recall that

κ (P ∣ Q) = \int P \log (\frac{P}{Q}) .

$\kappa(P\mid Q)=\int P\log\left(\frac{P}{Q}\right).$ A way out of both these drawbacks, still based on KL divergence, is to introduce the midpoint

R = \frac{1}{2} (P + Q) .

$R=\tfrac12(P+Q).$ Thus

R

$R$ is a probability measure, and

P

$P$ and

Q

$Q$ are always absolutely continuous with respect to

R

$R$ . Hence one can consider a "distance" between

P

$P$ and

Q

$Q$ , still based on KL divergence but using

R

$R$ , defined as

η (P, Q) = κ (P ∣ R) + κ (Q ∣ R) .

$\eta(P,Q)=\kappa(P\mid R)+\kappa(Q\mid R).$ Then

η (P, Q)

$\eta(P,Q)$ is nonnegative and finite for every

P

$P$ and

Q

$Q$ ,

η

$\eta$ is symmetric in the sense that

η (P, Q) = η (Q, P)

$\eta(P,Q)=\eta(Q,P)$ for every

P

$P$ and

Q

$Q$ , and

η (P, Q) = 0

$\eta(P,Q)=0$ iff

P = Q

$P=Q$ .

An equivalent formulation is

η (P, Q) = 2 \log (2) + \int (P \log (P) + Q \log (Q) - (P + Q) \log (P + Q)) .

$\eta(P,Q)=2\log(2)+\int \left(P\log(P)+Q\log(Q)-(P+Q)\log(P+Q)\right).$

Addendum 1 The introduction of the midpoint of $P$ and $Q$ is not arbitrary in the sense that

η (P, Q) = min [κ (P ∣ \cdot) + κ (Q ∣ \cdot)],

$\eta(P,Q)=\min [\kappa(P\mid \cdot)+\kappa(Q\mid \cdot)],$ where the minimum is over the set of probability measures.

Addendum 2 @cardinal remarks that $\eta$ is also an $f$ -divergence, for the convex function

f (x) = x \log (x) - (1 + x) \log (1 + x) + (1 + x) \log (2) .

$f(x)=x\log(x)−(1+x)\log(1+x)+(1+x)\log(2).$

Did
źródło

2

@Marco, @Didier Piau, it might be noted that @Didier's suggestion is another special case of an

f

$f$ -divergence where

f (x) = x \log x - (1 + x) \log (\frac{1 + x}{2})

$f(x) = x \log x - (1+x) \log( \frac{1+x}{2} )$ .

cardinal

1

@Marco, @Didier Piau, an alternative formulation which has some evocative nature is

η (P, Q) = \int P \log P + \int Q \log Q - 2 \int R \log R = 2 H (R) - (H (P) + H (Q))

$\eta(P, Q) = \int P \log P + \int Q \log Q - 2 \int R \log R = 2 H(R) - (H(P) + H(Q))$ and so

η (P, Q) = 2 (H (μ (P, Q)) - μ (H (P), H (Q))

$\eta(P,Q) = 2 ( H(\mu(P,Q)) - \mu(H(P), H(Q))$ where

μ (x, y) = \frac{x + y}{2}

$\mu(x,y) = \frac{x+y}{2}$ . In other words,

\frac{1}{2} η (P, Q)

$\frac{1}{2} \eta(P,Q)$ is "difference between the entropy of the average measure and the average entropy of the measures".

cardinal

3

Isn't this just the Jensen-Shannon divergence?

Memming

Seems to be.

Did

"where the minimum is over the set of probability measures." I like this characterization of the Jensen–Shannon divergence. Is there a proof of it somewhere?

user76284

10

The Kolmogorov distance between two distributions $P$ and $Q$ is the sup norm of their CDFs. (This is the largest vertical discrepancy between the two graphs of the CDFs.) It is used in distributional testing where $P$ is an hypothesized distribution and $Q$ is the empirical distribution function of a dataset.

It is hard to characterize this as an "adaptation" of the KL distance, but it does meet the other requirements of being "natural" and finite.

Incidentally, because the KL divergence is not a true "distance," we don't have to worry about preserving all the axiomatic properties of a distance. We can maintain the non-negativity property while making the values finite by applying any monotonic transformation $\mathbb{R_+} \to [0,C]$ for some finite value $C$ . The inverse tangent will do fine, for instance.

whuber
źródło

1

Thank you for your suggestion about the Kolmogorov distance. Can you make your comment about the monotonic transformation a little bit more explicit? Thx

ocram

1

@Marco I don't understand how one could be any more explicit. Do you mean restating what I wrote in terms of a formula such as

\arctan (K L (P, Q))

$\arctan(KL(P,Q))$ or

f (K L (P, Q))

$f(KL(P,Q))$ for

f : R_{+} \to [0, C]

$f:\mathbb{R_+} \to [0,C]$ with

x \geq y

$x \ge y$ implies

f (x) \geq f (y)

$f(x) \ge f(y)$ for all

x, y \geq 0

$x,y \ge 0$ ?

whuber

1

Yes, that's what I meant :-) I was not sure on what to apply the transformation. Now, it is clear, thx

ocram

1

@Marco: I am lost. Do you settle for the Kolmogorov distance (which is always finite but has nothing in common with KL divergence)? Or for a bounded monotone transform of KL divergence (such as

\arctan

$\arctan$ )? In the example of your post (and in any other not absolutely continuous example), the latter produces the supremum of the transform (

π / 2

$\pi/2$ if you settle for

\arctan

$\arctan$ ). In effect, this abandons any idea of estimating a distance between such probability measures more precisely than saying they are far far away (whether you encode this by

π / 2

$\pi/2$ or by

+ \infty

$+\infty$ is irrelevant).

Did

@Didier Yes, the transformed KL divergence (when symmetrized, as you describe) might not satisfy the triangle inequality and therefore would not be a distance, but it would still define a topology (which would likely be metrizable). You would thereby give up little or nothing. I remain agnostic about the merits of doing any of this: it seems to me this is just a way of papering over the difficulties associated with infinite values of the KL divergence in the first place.

whuber

2

Yes there does, Bernardo and Reuda defined something called the "intrinsic discrepancy" which for all purposes is a "symmetrised" version of the KL-divergence. Taking the KL divergence from $P$ to $Q$ to be $\kappa(P \mid Q)$ The intrinsic discrepancy is given by:

δ (P, Q) \equiv min [κ (P ∣ Q), κ (Q ∣ P)]

$\delta(P,Q)\equiv \min \big[\kappa(P \mid Q),\kappa(Q \mid P)\big]$

Searching intrinsic discrepancy (or bayesian reference criterion) will give you some articles on this measure.

In your case, you would just take the KL-divergence which is finite.

Another alternative measure to KL is Hellinger distance

EDIT: clarification, some comments raised suggested that the intrinsic discrepancy will not be finite when one density 0 when the other is not. This is not true if the operation of evaluating the zero density is carried out as a limit $Q\rightarrow 0$ or $P\rightarrow 0$ . The limit is well defined, and it is equal to $0$ for one of the KL divergences, while the other one will diverge. To see this note:

δ (P, Q) \equiv min [\int P \log (\frac{P}{Q}), \int Q \log (\frac{Q}{P})]

$\delta(P,Q)\equiv \min \Big[\int P \,\log \big(\frac{P}{Q}\big),\int Q \log \big(\frac{Q}{P}\big)\Big]$

Taking limit as $P\rightarrow 0$ over a region of the integral, the second integral diverges, and the first integral converges to $0$ over this region (assuming the conditions are such that one can interchange limits and integration). This is because $\lim_{z\rightarrow 0} z \log(z) =0$ . Because of the symmetry in $P$ and $Q$ the result also holds for $Q$ .

probabilityislogic
źródło

1

Even the "intrinsic discrepancy" will be infinite when

P

$P$ is zero with positive probability for

Q

$Q$ and vice versa, even if

P

$P$ and

Q

$Q$ are otherwise identical.

whuber

1

Yes... I am afraid that the intrinsic discrepancy does not fulfil the requirement. But thank you for the suggestion. Any other suggestion would be appreciated.

ocram

1

It does fulfil the requirement, if you restrict the support of the blue density to be where it has strictly positive support, just as you have for the red one (>0)

probabilityislogic

3

@probabilityislogic: I do not unerstand your last remarks. First, let us give their proper names to the notions involved and say that

P

$P$ is absolutely continuous with respect to

Q

$Q$ (denoted

P ≪ Q

$P\ll Q$ ) if, for every measurable

A

$A$ ,

Q (A) = 0

$Q(A)=0$ implies

P (A) = 0

$P(A)=0$ . Now, notwithstanding your somewhat mysterious (to me) limit considerations, your

δ (P, Q)

$\delta(P,Q)$ is finite iff

P ≪ Q

$P\ll Q$ or

Q ≪ P

$Q\ll P$ . .../...

Did

2

.../... A way out of the conundrum you seem to be dug into might be to introduce the mid-point measure

P + Q

$P+Q$ . Since

P ≪ P + Q

$P\ll P+Q$ and

Q ≪ P + Q

$Q\ll P+Q$ , the quantity

η (P, Q) := κ (P | P + Q) + κ (Q | P + Q)

$\eta(P,Q):=\kappa(P|P+Q)+\kappa(Q|P+Q)$ is always finite. Furthermore

η (P, Q) = 0

$\eta(P,Q)=0$ iff

P = Q

$P=Q$ and

η

$\eta$ is symmetric. Hence

η (P, Q)

$\eta(P,Q)$ indeed measures a kind of "distance" between

P

$P$ and

Q

$Q$ .

Did

Dostosowanie odległości Kullback-Leibler?

Odpowiedzi: