Czytałem najpopularniejsze książki w nauce statystycznej
1- Elementy uczenia statystycznego.
2- Wprowadzenie do uczenia statystycznego .
Obaj wspominają, że regresja kalenicy ma dwie równoważne formuły. Czy istnieje zrozumiały matematyczny dowód tego wyniku?
Przeszedłem również przez Cross Validated , ale nie mogę znaleźć tam konkretnego dowodu.
Ponadto, czy LASSO będzie korzystać z tego samego rodzaju dowodu?
Odpowiedzi:
The classic Ridge Regression (Tikhonov Regularization) is given by:
The claim above is that the following problem is equivalent:
Let's definex^ as the optimal solution of the first problem and x~ as the optimal solution of the second problem.
The claim of equivalence means that∀t,∃λ≥0:x^=x~ .t and λ≥0 such the solution of the problem is the same.
Namely you can always have a pair of
How could we find a pair?
Well, by solving the problems and looking at the properties of the solution.
Both problems are Convex and smooth so it should make things simpler.
The solution for the first problem is given at the point the gradient vanishes which means:
The KKT Conditions of the second problem states:
and
The last equation suggests that eitherμ=0 or ∥x~∥22=t .
Pay attention that the 2 base equations are equivalent.x^=x~ and μ=λ both equations hold.
Namely if
So it means that in case∥y∥22≤t one must set μ=0 which means that for t large enough in order for both to be equivalent one must set λ=0 .
On the other case one should findμ where:
This is basically when∥x~∥22=t
Once you find thatμ the solutions will collide.
Regarding theL1 (LASSO) case, well, it works with the same idea.
The only difference is we don't have closed for solution hence deriving the connection is trickier.
Have a look at my answer at StackExchange Cross Validated Q291962 and StackExchange Signal Processing Q21730 - Significance ofλ in Basis Pursuit.
Remarkx tries to be as close as possible to y .x=y will vanish the first term (The L2 distance) and in the second case it will make the objective function vanish.L2 Norm of x . As λ gets higher the balance means you should make x smaller.x closer and closer to y until you hit the wall which is the constraint on its Norm (By t ).t ) and enough depends on the norm of y then i has no meaning, just like λ is relevant only of its value multiplied by the norm of y starts to be meaningful.
What's actually happening?
In both problems,
In the first case,
The difference is that in the first case one must balance
In the second case there is a wall, you bring
If the wall is far enough (High value of
The exact connection is by the Lagrangian stated above.
Resources
I found this paper today (03/04/2019):
źródło
A less mathematically rigorous, but possibly more intuitive, approach to understanding what is going on is to start with the constraint version (equation 3.42 in the question) and solve it using the methods of "Lagrange Multiplier" (https://en.wikipedia.org/wiki/Lagrange_multiplier or your favorite multivariable calculus text). Just remember that in calculusx is the vector of variables, but in our case x is constant and β is the variable vector. Once you apply the Lagrange multiplier technique you end up with the first equation (3.41) (after throwing away the extra −λt which is constant relative to the minimization and can be ignored).
This also shows that this works for lasso and other constraints.
źródło
It's perhaps worth reading about Lagrangian duality and a broader relation (at times equivalence) between:
Quick intro to weak duality and strong duality
Assume we have some functionf(x,y) of two variables. For any x^ and y^ , we have:
Since that holds for anyx^ and y^ it also holds that:
This is known as weak duality. In certain circumstances, you have also have strong duality (also known as the saddle point property):
When strong duality holds, solving the dual problem also solves the primal problem. They're in a sense the same problem!
Lagrangian for constrained Ridge Regression
Let me define the functionL as:
The min-max interpretation of the Lagrangian
The Ridge regression problem subject to hard constraints is:
You pickb to minimize the objective, cognizant that after b is picked, your opponent will set λ to infinity if you chose b such that ∑pj=1b2j>t .
If strong duality holds (which it does here because Slater's condition is satisfied fort>0 ), you then achieve the same result by reversing the order:
Here, your opponent choosesλ first! You then choose b to minimize the objective, already knowing their choice of λ . The minbL(b,λ) part (taken λ as given) is equivalent to the 2nd form of your Ridge Regression problem.
As you can see, this isn't a result particular to Ridge regression. It is a broader concept.
References
(I started this post following an exposition I read from Rockafellar.)
Rockafellar, R.T., Convex Analysis
You might also examine lectures 7 and lecture 8 from Prof. Stephen Boyd's course on convex optimization.
źródło
They are not equivalent.
For a constrained minimization problem
we solve by minimize overb the corresponding Lagrangean
Here,t is a bound given exogenously, λ≥0 is a Karush-Kuhn-Tucker non-negative multiplier, and both the beta vector and λ are to be determined optimally through the minimization procedure given t .
Comparing(2) and eq (3.41) in the OP's post, it appears that the Ridge estimator can be obtained as the solution to
Since in(3) the function to be minimized appears to be the Lagrangean of the constrained minimization problem plus a term that does not involve b , it would appear that indeed the two approaches are equivalent...
But this is not correct because in the Ridge regression we minimize overb given λ>0 . But, in the lens of the constrained minimization problem, assuming λ>0 imposes the condition that the constraint is binding, i.e that
The general constrained minimization problem allows forλ=0 also, and essentially it is a formulation that includes as special cases the basic least-squares estimator (λ∗=0 ) and the Ridge estimator (λ∗>0 ).
So the two formulation are not equivalent. Nevertheless, Matthew Gunn's post shows in another and very intuitive way how the two are very closely connected. But duality is not equivalence.
źródło