Funkcja kosztu sieci neuronowej jest niewypukła?

36

Funkcja kosztu sieci neuronowej to J(W,b) i twierdzi się, że nie jest wypukła . Nie do końca rozumiem, dlaczego tak jest, ponieważ, jak widzę, jest dość podobny do funkcji kosztu regresji logistycznej, prawda?

Jeśli nie jest wypukła, to pochodna drugiego rzędu JW<0, prawda?

AKTUALIZACJA

Dzięki poniższym odpowiedziom, a także komentarzowi @ gung, zrozumiałem, że jeśli nie ma żadnych ukrytych warstw, jest wypukły, podobnie jak regresja logistyczna. Ale jeśli istnieją ukryte warstwy, przez permutację węzłów w ukrytych warstwach, a także wag w kolejnych połączeniach, możemy mieć wiele rozwiązań wag, które skutkują tą samą stratą.

Teraz więcej pytań,

1) Istnieje wiele lokalnych minimów, a niektóre z nich powinny mieć tę samą wartość, ponieważ odpowiadają niektórym kombinacjom węzłów i wag, prawda?

2) Jeśli węzły i ciężary nie zostaną w ogóle permutowane, to jest wypukłe, prawda? A minima będą minima globalne. Jeśli tak, odpowiedź na 1) brzmi: wszystkie lokalne minima będą miały tę samą wartość, prawda?

awokado
źródło
Nie jest wypukły, ponieważ może istnieć wiele lokalnych minimów.
Gung - Przywróć Monikę
2
Zależy od sieci neuronowej. Sieci neuronowe z liniowymi funkcjami aktywacji i utratą kwadratu przyniosą optymalizację wypukłą (jeśli moja pamięć służy mi również dla sieci radialnych z podstawowymi wariancjami). Jednak sieci neuronowe są najczęściej używane z nieliniowymi funkcjami aktywacyjnymi (tj. Sigmoidalnymi), dlatego optymalizacja staje się nie wypukła.
Cagdas Ozgenc
@ Gung, mam rację, a teraz mam więcej pytań, proszę zobaczyć moją aktualizację :-)
awokado
5
W tym momencie (2 lata później) może być lepiej przenieść pytanie z powrotem do poprzedniej wersji, zaakceptować jedną z poniższych odpowiedzi i zadać nowe, uzupełniające pytanie, które prowadzi do tego kontekstu.
gung - Przywróć Monikę
1
@ Gung, tak, masz rację, ale teraz nie jestem pewien co do niektórych aspektów odpowiedzi, którą wcześniej głosowałem. Cóż, ponieważ zostawiłem kilka nowych komentarzy do poniższych odpowiedzi, poczekam chwilę, aby sprawdzić, czy konieczne jest zapytanie o nowe.
awokado

Odpowiedzi:

25

The cost function of a neural network is in general neither convex nor concave. This means that the matrix of all second partial derivatives (the Hessian) is neither positive semidefinite, nor negative semidefinite. Since the second derivative is a matrix, it's possible that it's neither one or the other.

To make this analogous to one-variable functions, one could say that the cost function is neither shaped like the graph of x2 nor like the graph of x2. Another example of a non-convex, non-concave function is sin(x) on R. One of the most striking differences is that ±x2 has only one extremum, whereas sin has infinitely many maxima and minima.

How does this relate to our neural network? A cost function J(W,b) has also a number of local maxima and minima, as you can see in this picture, for example.

The fact that J has multiple minima can also be interpreted in a nice way. In each layer, you use multiple nodes which are assigned different parameters to make the cost function small. Except for the values of the parameters, these nodes are the same. So you could exchange the parameters of the first node in one layer with those of the second node in the same layer, and accounting for this change in the subsequent layers. You'd end up with a different set of parameters, but the value of the cost function can't be distinguished by (basically you just moved a node, to another place, but kept all the inputs/outputs the same).

Roland
źródło
OK, I understand the permutation explanation you made, I think it makes sense, but now I wonder is this the authentic one to explain why neural net is non-convex?
avocado
1
What do you mean with 'authentic one'?
Roland
I mean, this is how it should be interpreted, not just an analogy.
avocado
4
@loganecolss You are correct that this is not the only reason why cost functions are non-convex, but one of the most obvious reasons. Depdending on the network and the training set, there might be other reasons why there are multiple minima. But the bottom line is: The permuation alone creates non-convexity, regardless of other effects.
Roland
1
Sorry, I can not understand the last paragraph. But also I missunderstand why I mentioned max(0,x) here. In any case - I think the correct way to show that there maybe multiple mode (multiple local minimum) is prove it in some way. p.s. If Hessian is indefinite it said nothing - the quasiconvex function can have indefinite Hessian but it's still unimodal.
bruziuz
17

If you permute the neurons in the hidden layer and do the same permutation on the weights of the adjacent layers then the loss doesn't change. Hence if there is a non-zero global minimum as a function of weights, then it can't be unique since the permutation of weights gives another minimum. Hence the function is not convex.

Abhinav
źródło
5

Whether the objective function is convex or not depends on the details of the network. In the case where multiple local minima exist, you ask whether they're all equivalent. In general, the answer is no, but the chance of finding a local minimum with good generalization performance appears to increase with network size.

This paper is of interest:

Choromanska et al. (2015). The Loss Surfaces of Multilayer Networks

http://arxiv.org/pdf/1412.0233v3.pdf

From the introduction:

  • For large-size networks, most local minima are equivalent and yield similar performance on a test set.

  • The probability of finding a "bad" (high value) local minimum is non-zero for small-size networks and decreases quickly with network size.

  • Struggling to find the global minimum on the training set (as opposed to one of the many good local ones) is not useful in practice and may lead to overfitting.

They also cite some papers describing how saddle points are a bigger issue than local minima when training large networks.

user20160
źródło
4

Some answers for your updates:

  1. Yes, there are in general multiple local minima. (If there was only one, it would be called the global minimum.) The local minima will not necessarily be of the same value. In general, there may be no local minima sharing the same value.

  2. No, it's not convex unless it's a one-layer network. In the general multiple-layer case, the parameters of the later layers (the weights and activation parameters) can be highly recursive functions of the parameters in previous layers. Generally, multiplication of decision variables introduced by some recursive structure tends to destroy convexity. Another great example of this is MA(q) models in times series analysis.

Side note: I don't really know what you mean by permuting nodes and weights. If the activation function varies across nodes, for instance, and you permute the nodes, you're essentially optimizing a different neural network. That is, while the minima of this permuted network may be the same minima, this is not the same network so you can't make a statement about the multiplicity of the same minima. For an analogy of this in the least-squares framework, you are for example swapping some rows of y and X and saying that since the minimum of yXβ is the same as before that there are as many minimizers as there are permutations.

Mustafa S Eisa
źródło
1
"one-layer network" would be just what "softmax" or logistic regression looks like, right?
avocado
By "permuting nodes and weights", I mean "swapping", and that's what I got from the above 2 old answers, and as I understood their answers, by "swapping" nodes and weights in hidden layers, we might end up having the same output in theory, and that's why we might have multiple minima. You mean this explanation is not correct?
avocado
You have the right idea, but its not quite the same. For networks, the loss may not necessarily be binomial loss, the activation functions may not necessarily be sigmoids, etc.
Mustafa S Eisa
Yes, I don't think it's correct. Even though it's true that you'll get the same performance whether you permute these terms or not, this doesn't define the convexity or non-convexity of any problem. The optimization problem is convex if, for a fixed loss function (not any permutation of the terms in the loss), the objective function is convex in the model parameters and the feasible region upon which you are optimizing is convex and closed.
Mustafa S Eisa
I see, so if it's "one-layer", it might not be "softmax".
avocado
2

You will have one global minimum if problem is convex or quasiconvex.

About convex "building blocks" during building neural networks (Computer Science version)

I think there are several of them which can be mentioned:

  1. max(0,x) - convex and increasing

  2. log-sum-exp - convex and increasing in each parameter

  3. y = Ax is affine and so convex in (A), maybe increasing maybe decreasing. y = Ax is affine and so convex in (x), maybe increasing maybe decreasing.

Unfortunately it is not convex in (A, x) because it looks like indefinite quadratic form.

  1. Usual math discrete convolution (by "usual" I mean defined with repeating signal) Y=h*X Looks that it is affine function of h or of variable X. So it's a convex in variable h or in variable X. About both variables - I don't think so because when h and X are scalars convolution will reduce to indefinite quadratic form.

  2. max(f,g) - if f and g are convex then max(f,g) is also convex.

If you substitute one function into another and create compositions then to still in the convex room for y=h(g(x),q(x)), but h should be convex and should increase (non-decrease) in each argument....

Why neural netwoks in non-convex:

  1. I think the convolution Y=h*X is not nessesary increasing in h. So if you not use any extra assumptions about kernel you will go out from convex optimization immediatly after you apply convolution. So there is no all fine with composition.

  2. Also convolution and matrix multiplication is not convex if consider couple parameters as mentioned above. So there is evean a problems with matrix multiplication: it is non-convex operation in parameters (A,x)

  3. y = Ax can be quasiconvex in (A,x) but also extra assumptions should be taken into account.

Please let me know if you disagree or have any extra consideration. The question is also very interesting to me.

p.s. max-pooling - which is downsamping with selecting max looks like some modification of elementwise max operations with affine precomposition (to pull need blocks) and it looks convex for me.

About other questions

  1. No, logistic regression is not convex or concave, but it is log-concave. This means that after apply logarithm you will have concave function in explanatory variables. So here max log-likelihood trick is great.

  2. If there are not only one global minimum. Nothing can be said about relation between local minimums. Or at least you can not use convex optimization and it's extensions for it, because this area of math is deeply based on global underestimator.

Maybe you have confusion about this. Because really people who create such schemas just do "something" and they receive "something". Unfortunately because we don't have perfect mechanism for tackle with non-convex optimization (in general).

But there are even more simple things beside Neural Networks - which can not be solved like non-linear least squares -- https://youtu.be/l1X4tOoIHYo?t=2992 (EE263, L8, 50:10)

bruziuz
źródło