Jak Naive Bayes jest klasyfikatorem liniowym?

Widziałem tutaj inny wątek , ale nie sądzę, by odpowiedź zaspokoiła faktyczne pytanie. Ciągle czytam, że Naive Bayes jest klasyfikatorem liniowym (np. Tutaj ) (takim, że wyznacza liniową granicę decyzyjną) za pomocą demonstracji logarytmicznych szans.

Symulowałem jednak dwie chmury Gaussa i dopasowałem granicę decyzyjną i otrzymałem wyniki jako takie (biblioteka e1071 wr, przy użyciu naiveBayes ()) 1- zielony, 0 - czerwony

Jak widzimy, granica decyzji jest nieliniowa. Czy próbuje powiedzieć, że parametry (prawdopodobieństwa warunkowe) są kombinacją liniową w przestrzeni logów, a nie powiedzieć, że sam klasyfikator oddziela dane liniowo?

classification naive-bayes Kevin Pei
źródło

jak stworzyłeś granicę decyzji? Podejrzewam, że ma to związek z twoją rutyną dopasowania, a nie z prawdziwą granicą decyzji klasyfikatora. normalnie można wygenerować granicę decyzji, obliczając decyzję w każdym punkcie kwadrantu.

seanv507

Właśnie to zrobiłem, wziąłem dwa zakresy X = [Min (x), Max (x)] i Y = [Min (Y), Max (Y)] z odstępem 0,1. Następnie dopasowałem wszystkie te punkty danych do wyszkolonego klasyfikatora i znalazłem takie punkty, że logarytmiczne szanse wynosiły od -0,05 do 0,05

Kevin Pei

Odpowiedzi:

Zasadniczo naiwny klasyfikator Bayesa nie jest liniowy, ale jeśli czynniki prawdopodobieństwa pochodzą z rodzin wykładniczych , naiwny klasyfikator Bayesa odpowiada klasyfikatorowi liniowemu w określonej przestrzeni cech. Oto jak to zobaczyć. $p(x_i \mid c)$

Możesz napisać dowolny naiwny klasyfikator Bayesa jako *

p (c = 1 ∣ x) = σ (\sum_{i} \log \frac{p (x_{i} ∣ c = 1)}{p (x_{i} ∣ c = 0)} + \log \frac{p (c = 1)}{p (c = 0)}),

$p(c = 1 \mid \mathbf{x}) = \sigma\left( \sum_i \log \frac{p(x_i \mid c = 1)}{p(x_i \mid c = 0)} + \log \frac{p(c = 1)}{p(c = 0)} \right),$

gdzie jest funkcją logistyczną . Jeśli pochodzi z rodziny wykładniczej, możemy zapisać jako $\sigma$ $p(x_i \mid c)$

p (x_{i} ∣ c) = h_{i} (x_{i}) \exp (u_{i c}^{⊤} ϕ_{i} (x_{i}) - A_{i} (u_{i c})),

$p(x_i \mid c) = h_i(x_i)\exp\left(\mathbf{u}_{ic}^\top \phi_i(x_i) - A_i(\mathbf{u}_{ic})\right),$

and hence

p (c = 1 ∣ x) = σ (\sum_{i} w_{i}^{⊤} ϕ_{i} (x_{i}) + b),

$p(c = 1 \mid \mathbf{x}) = \sigma\left( \sum_i \mathbf{w}_i^\top \phi_i(x_i) + b \right),$

where

\begin{aligned} w_{i} & = u_{i 1} - u_{i 0}, \\ b & = \log \frac{p (c = 1)}{p (c = 0)} - \sum_{i} (A_{i} (u_{i 1}) - A_{i} (u_{i 0})) . \end{aligned}

$\begin{align} \mathbf{w}_i &= \mathbf{u}_{i1} - \mathbf{u}_{i0}, \\ b &= \log \frac{p(c = 1)}{p(c = 0)} - \sum_i \left( A_i(\mathbf{u}_{i1}) - A_i(\mathbf{u}_{i0}) \right). \end{align}$

Note that this is similar to logistic regression – a linear classifier – in the feature space defined by the $\phi_i$ . For more than two classes, we analogously get multinomial logistic (or softmax) regression.

If $p(x_i \mid c)$ is Gaussian, then $\phi_i(x_i) = (x_i, x_i^2)$ and we should have

\begin{aligned} w_{i 1} & = σ_{1}^{- 2} μ_{1} - σ_{0}^{- 2} μ_{0}, \\ w_{i 2} & = 2 σ_{0}^{- 2} - 2 σ_{1}^{- 2}, \\ b_{i} & = \log σ_{0} - \log σ_{1}, \end{aligned}

$\begin{align} w_{i1} &= \sigma_1^{-2}\mu_1 - \sigma_0^{-2}\mu_0, \\ w_{i2} &= 2\sigma_0^{-2} - 2\sigma_1^{-2}, \\ b_i &= \log \sigma_0 - \log \sigma_1, \end{align}$

assuming $p(c = 1) = p(c = 0) = \frac{1}{2}$ .

*Here is how to derive this result:

\begin{aligned} p (c = 1 ∣ x) & = \frac{p (x ∣ c = 1) p (c = 1)}{p (x ∣ c = 1) p (c = 1) + p (x ∣ c = 0) p (c = 0)} \\ = \frac{1}{1 + \frac{p (x ∣ c = 0) p (c = 0)}{p (x ∣ c = 1) p (c = 1)}} \\ = \frac{1}{1 + \exp (- \log \frac{p (x ∣ c = 1) p (c = 1)}{p (x ∣ c = 0) p (c = 0)})} \\ = σ (\sum_{i} \log \frac{p (x_{i} ∣ c = 1)}{p (x_{i} ∣ c = 0)} + \log \frac{p (c = 1)}{p (c = 0)}) \end{aligned}

$\begin{align} p(c = 1 \mid \mathbf{x}) &= \frac{p(\mathbf{x} \mid c = 1) p(c = 1)}{p(\mathbf{x} \mid c = 1) p(c = 1) + p(\mathbf{x} \mid c = 0) p(c = 0)} \\ &= \frac{1}{1 + \frac{p(\mathbf{x} \mid c = 0) p(c = 0)}{p(\mathbf{x} \mid c = 1) p(c = 1)}} \\ &= \frac{1}{1 + \exp\left( -\log\frac{p(\mathbf{x} \mid c = 1) p(c = 1)}{p(\mathbf{x} \mid c = 0) p(c = 0)} \right)} \\ &= \sigma\left( \sum_i \log \frac{p(x_i \mid c = 1)}{p(x_i \mid c = 0)} + \log \frac{p(c = 1)}{p(c = 0)} \right) \end{align}$

Lucas
źródło

Thank you for the derivation, which I now understand, can you explain the notations in equation 2 and below? (u, h(x_i), phi(x_i), etc) Is P(x_i | c) under an exponential family just simply taking the value from the pdf?

Kevin Pei

There are different ways you can express one and the same distribution. The second equation is an exponential family distribution in canonical form. Many distributions are exponential families (Gaussian, Laplace, Dirichlet, Bernoulli, binomial, just to name a few), but their density/mass function is typically not given in canonical form. So you first have to reparametrize the distribution. This table tells you how to compute

u

$\mathbf{u}$ (natural parameters) and

ϕ

$\phi$ (sufficient statistics) for various distributions: en.wikipedia.org/wiki/Exponential_family#Table_of_distributions

Lucas

Notice the important point that

ϕ (x) = (x, x^{2})

$\phi(x) = (x, x^2)$ . What this means is that linear classifiers are a linear combination of weights

w

$\mathbf{w}$ and potentially non-linear functions of the features! So, to the original poster's point, a plot of the datapoints may not show that they are separable by a line.

RMurphy

I find this answer misleading: as pointed out in the comment just about, and the answer just below, the Gaussian naive Bayes is not linear in the original feature space, but in a non-linear transform of these. Hence it is not a conventional linear classifier.

Gael Varoquaux

why

p (x_{i} | c)

$p(x_i|c)$ is Gaussian,then

ϕ_{i} (x_{i}) = (x_{i}, x_{i}^{2})

$\phi_i(x_i)=(x_i,x_i^2)$ ? I think the sufficient statistic

T (x)

$T(x)$ for Gaussian distribution should be

x / σ

$x/\sigma$ .

Naomi

It is linear only if the class conditional variance matrices are the same for both classes. To see this write down the ration of the log posteriors and you'll only get a linear function out of it if the corresponding variances are the same. Otherwise it is quadratic.

axk
źródło

I'd like add one additional point: the reason for some of the confusion rests on what it means to be performing "Naive Bayes classification".

Under the broad topic of "Gaussian Discriminant Analysis (GDA)" there are several techniques: QDA, LDA, GNB, and DLDA (quadratic DA, linear DA, gaussian naive bayes, diagonal LDA). [UPDATED] LDA and DLDA should be linear in the space of the given predictors. (See, e.g., Murphy, 4.2, pg. 101 for DA and pg. 82 for NB. Note: GNB is not necessarily linear. Discrete NB (which uses a multinomial distribution under the hood) is linear. You can also check out Duda, Hart & Stork section 2.6). QDA is quadratic as other answers have pointed out (and which I think is what is happening in your graphic - see below).

These techniques form a lattice with a nice set of constraints on the "class-wise covariance matrices" $\Sigma_c$ :

QDA: $\Sigma_c$ arbitrary: arbitrary ftr. cov. matrix per class
LDA: $\Sigma_c = \Sigma$ : shared cov. matrix (over classes)
GNB: $\Sigma_c = {diag}_c$ : class wise diagonal cov. matrices (the assumption of ind. in the model $\rightarrow$ diagonal cov. matrix)
DLDA: $\Sigma_c = diag$ : shared & diagonal cov. matrix

While the docs for e1071 claim that it is assuming class-conditional independence (i.e., GNB), I'm suspicious that it is actually doing QDA. Some people conflate "naive Bayes" (making independence assumptions) with "simple Bayesian classification rule". All of the GDA methods are derived from the later; but only GNB and DLDA use the former.

A big warning, I haven't read the e1071 source code to confirm what it is doing.

MrDrFenner
źródło