LASSO i grzbiet z perspektywy Bayesa: co z parametrem strojenia?

Mówi się, że estymatory regresji karnej, takie jak LASSO i kalenica, odpowiadają estymatorom bayesowskim z pewnymi priorytetami.

Tak to jest poprawne. Ilekroć mamy problem z optymalizacją, polegający na maksymalizacji funkcji logarytmu prawdopodobieństwa plus funkcji kary na parametrach, jest to matematycznie równoważne późniejszej maksymalizacji, gdzie funkcja kary jest logarytmem poprzedniego jądra. Aby to zobaczyć, załóżmy, że mamy funkcję kary używając parametru strojenia . Funkcję celu w tych przypadkach można zapisać jako: $^\dagger$ $w$ $\lambda$

\begin{aligned} H_{x} (θ | λ) & = ℓ_{x} (θ) - w (θ | λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ)}{\int L_{x} (θ) π (θ | λ) d θ}) + const \\ = \ln π (θ | x, λ) + const, \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta|\lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta|\mathbf{x}, \lambda) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

gdzie używamy wcześniejszego $\pi(\theta|\lambda) \propto \exp ( -w(\theta|\lambda))$ . Zauważ, że parametr strojenia w optymalizacji jest traktowany jako stały hiperparametr we wcześniejszym rozkładzie. Jeśli przeprowadzasz klasyczną optymalizację ze stałym parametrem strojenia, jest to równoważne z przeprowadzeniem optymalizacji bayesowskiej ze stałym hiperparametrem. W przypadku regresji LASSO i Ridge funkcje karne i odpowiadające im wcześniejsze odpowiedniki to:

\begin{aligned} LASSO Regression & π (θ | λ) & = \prod_{k = 1}^{m} Laplace (0, \frac{1}{λ}) = \prod_{k = 1}^{m} \frac{λ}{2} \cdot \exp (- λ | θ_{k} |), \\ Ridge Regression & π (θ | λ) & = \prod_{k = 1}^{m} Normal (0, \frac{1}{2 λ}) = \prod_{k = 1}^{m} \sqrt{λ / π} \cdot \exp (- λ θ_{k}^{2}) . \end{aligned}

$\begin{equation} \begin{aligned} \text{LASSO Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Laplace} \Big( 0, \frac{1}{\lambda} \Big) = \prod_{k=1}^m \frac{\lambda}{2} \cdot \exp ( -\lambda |\theta_k| ), \\[6pt] \text{Ridge Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Normal} \Big( 0, \frac{1}{2\lambda} \Big) = \prod_{k=1}^m \sqrt{\lambda/\pi} \cdot \exp ( -\lambda \theta_k^2 ). \\[6pt] \end{aligned} \end{equation}$

The former method penalises the regression coefficients according to their absolute magnitude, which is the equivalent of imposing a Laplace prior located at zero. The latter method penalises the regression coefficients according to their squared magnitude, which is the equivalent of imposing a normal prior located at zero.

Now a frequentist would optimize the tuning parameter by cross validation. Is there a Bayesian equivalent of doing so, and is it used at all?

Tak długo, jak metoda częstokrzyska może być przedstawiana jako problem optymalizacyjny (zamiast powiedzieć, że obejmuje test hipotezy lub coś w tym rodzaju), będzie analogia Bayesa z użyciem równoważnego wcześniejszego. Podobnie jak częste osoby mogą traktować parametr strojenia $\lambda$ jako nieznany i oszacować to na podstawie danych, tak Bayesian może podobnie traktować hiperparametr $\lambda$ jako nieznany. W pełnej analizie bayesowskiej wymagałoby to nadania hiperparametru własnemu przejęciu i znalezienia tylnego maksimum pod tym przełożeniem, co byłoby analogiczne do maksymalizacji następującej funkcji celu:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - h (λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ)) \cdot \exp (- h (λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ) π (λ)}{\int L_{x} (θ) π (θ | λ) π (λ) d θ}) + const \\ = \ln π (θ, λ | x) + const . \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - h(\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \cdot \exp ( -h(\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta, \lambda|\mathbf{x}) + \text{const}. \\[6pt] \end{aligned} \end{equation}$

This method is indeed used in Bayesian analysis in cases where the analyst is not comfortable choosing a specific hyperparameter for their prior, and seeks to make the prior more diffuse by treating it as unknown and giving it a distribution. (Note that this is just an implicit way of giving a more diffuse prior to the parameter of interest $\theta$ .)

(Comment from statslearner2 below) I'm looking for numerical equivalent MAP estimates. For instance, for a fixed penalty Ridge there is a gaussian prior that will give me the MAP estimate exactly equal the ridge estimate. Now, for k-fold CV ridge, what is the hyper-prior that would give me the MAP estimate which is similar to the CV-ridge estimate?

Before proceeding to look at $K$ -fold cross-validation, it is first worth noting that, mathematically, the maximum a posteriori (MAP) method is simply an optimisation of a function of the parameter $\theta$ and the data $\mathbf{x}$ . If you are willing to allow improper priors then the scope encapsulates any optimisation problem involving a function of these variables. Thus, any frequentist method that can be framed as a single optimisation problem of this kind has a MAP analogy, and any frequentist method that cannot be framed as a single optimisation of this kind does not have a MAP analogy.

$K$ $\lambda$ $\mathbb{x}$ $K$ $\mathbf{x}_1,...,\mathbf{x}_K$ $k=1,...,K$ $\mathbf{x}_{-k}$ and then measure the fit of the model with the "testing" data $\mathbf{x}_k$ . In each fit you get an estimator for the model parameters, which then gives you predictions of the testing data, which can then be compared to the actual testing data to give a measure of "loss":

\begin{matrix} Estimator & \hat{θ} (x_{- k}, λ), \\ Predictions & {\hat{x}}_{k} (x_{- k}, λ), \\ Testing loss & L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ) . \end{matrix}

$\begin{matrix} \text{Estimator} & & \hat{\theta}(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Predictions} & & \hat{\mathbf{x}}_k(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Testing loss} & & \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda). \\[6pt] \end{matrix}$

The loss measures for each of the $K$ "folds" can then be aggregated to get an overall loss measure for the cross-validation:

L (x, λ) = \sum_{k} L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ)

$\mathscr{L}(\mathbf{x}, \lambda) = \sum_k \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda)$

One then estimates the tuning parameter by minimising the overall loss measure:

\hat{λ} \equiv \hat{λ} (x) \equiv \underset{λ}{arg min} L (x, λ) .

$\hat{\lambda} \equiv \hat{\lambda}(\mathbf{x}) \equiv \underset{\lambda}{\text{arg min }} \mathscr{L}(\mathbf{x}, \lambda).$

We can see that this is an optimisation problem, and so we now have two seperate optimisation problems (i.e., the one described in the sections above for $\theta$ , and the one described here for $\lambda$ ). Since the latter optimisation does not involve $\theta$ , we can combine these optimisations into a single problem, with some technicalities that I discuss below. To do this, consider the optimisation problem with objective function:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ), \end{aligned}

where $\delta > 0$ is a weighting value on the tuning-loss. As $\delta \rightarrow \infty$ the weight on optimisation of the tuning-loss becomes infinite and so the optimisation problem yields the estimated tuning parameter from $K$ -fold cross-validation (in the limit). The remaining part of the objective function is the standard objective function conditional on this estimated value of the tuning parameter. Now, unfortunately, taking $\delta = \infty$ screws up the optimisation problem, but if we take $\delta$ to be a very large (but still finite) value, we can approximate the combination of the two optimisation problems up to arbitrary accuracy.

From the above analysis we can see that it is possible to form a MAP analogy to the model-fitting and $K$ -fold cross-validation process. This is not an exact analogy, but it is a close analogy, up to arbitrarily accuracy. It is also important to note that the MAP analogy no longer shares the same likelihood function as the original problem, since the loss function depends on the data and is thus absorbed as part of the likelihood rather than the prior. In fact, the full analogy is as follows:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ) \\ = \ln (\frac{L_{x}^{*} (θ, λ) π (θ, λ)}{\int L_{x}^{*} (θ, λ) π (θ, λ) d θ}) + const, \end{aligned}

$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda)}{\int L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda) d\theta} \Bigg) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

where $L_\mathbf{x}^*(\theta, \lambda) \propto \exp( \ell_\mathbf{x}(\theta) - \delta \mathscr{L}(\mathbf{x}, \lambda))$ and $\pi (\theta, \lambda) \propto \exp( -w(\theta|\lambda))$ , with a fixed (and very large) hyper-parameter $\delta$ .

$^\dagger$ This gives an improper prior in cases where the penalty does not correspond to the logarithm of a sigma-finite density.

Reinstate Monica
źródło

Ok +1 already, but for the bounty I'm looking for these more precise answers.

statslearner2

1. I do not get how (since frequentists generally use classical hypothesis tests, etc., which have no Bayesian equivalent) connects to the rest of what I or you are saying; parameter tuning has nothing to do with hypothesis tests, or does it? 2. Do I understand you correctly that there is no Bayesian equivalent to frequentist regularized estimation when the tuning parameter is selected by cross validation? What about empirical Bayes that amoeba mentions in the comments to the OP?

Richard Hardy

3. Since regularization with cross validation seems to be quite effective for, say, prediction, doesn't point 2. suggest that the Bayesian approach is somehow inferior?

Richard Hardy

@Ben, thanks for your explicit answer and the subsequent clarifications. You have once again done a wonderful job! Regarding 3., yes, it was quite a jump; it certainly is not a strict logical conclusion. But looking at your points w.r.t. 2. (that a Bayesian method can approximate the frequentist penalized optimization with cross validation), I no longer think that Bayesian must be "inferior". The last quibble on my side is, could you perhaps explain how the last, complicated formula could arise in practice in the Bayesian paradigm? Is it something people would normally use or not?

Richard Hardy

@Ben (ctd) My problem is that I know little about Bayes. Once it gets technical, I may easily lose the perspective. So I wonder whether this complicated analogy (the last formula) is something that is just a technical possibility or rather something that people routinely use. In other words, I am interested in whether the idea behind cross validation (here in the context of penalized estimation) is resounding in the Bayesian world, whether its advantages are utilized there. Perhaps this could be a separate question, but a short description will suffice for this particular case.

Richard Hardy

LASSO i grzbiet z perspektywy Bayesa: co z parametrem strojenia?

Odpowiedzi: