Dlaczego tanh prawie zawsze jest lepszy niż sigmoid jako funkcja aktywacyjna?

33

W Andrzej zNg sieci neuronowe i głęboki learning na Coursera mówi, że przy $tanh$ jest prawie zawsze korzystniejsze $sigmoid$ .

Powodem jest to, że daje on wyjść przy użyciu $tanh$ centrum niż około 0 $sigmoid$ „a 0,5, a to«sprawia, że uczenie się do następnej warstwy trochę łatwiejsze».

Dlaczego centrowanie uczenia się prędkości wyjściowej aktywacji? Zakładam, że odnosi się on do poprzedniej warstwy, ponieważ nauka odbywa się podczas backprop?
Czy są jakieś inne cechy, które sprawiają, $tanh$ korzystne? Czy bardziej stromy gradient opóźniałby zanikanie gradientów?
Są sytuacje, w których nie wszystkie $sigmoid$ byłoby korzystne?

Preferowane są matematyczne, intuicyjne odpowiedzi.

machine-learning neural-networks backpropagation sigmoid-curve Tom Hale
źródło

13

Funkcja sigmoidalna ma kształt litery S (stąd nazwa). Przypuszczalnie mówisz o funkcji logistycznej

. Oprócz skali i lokalizacji oba są zasadniczo takie same:

\frac{e^{x}}{1 + e^{x}}

$\frac{e^x}{1+e^x}$

. Tak więc prawdziwym wyborem jest to, czy chcesz uzyskać dane wyjściowe w przedziale

czy w przedziale

logistic (x) = \frac{1}{2} + \frac{1}{2} \tanh (\frac{x}{2})

$\text{logistic}(x)=\frac12 +\frac12\tanh(\frac{x}2)$

(- 1, 1)

$(-1,1)$

(0, 1)

$(0,1)$

Henry

21

Yan LeCun i inni argumentują w Efficient BackProp to

Konwergencja jest zwykle szybsza, jeśli średnia każdej zmiennej wejściowej w zestawie treningowym jest bliska zeru. Aby to zobaczyć, rozważ ekstremalny przypadek, w którym wszystkie dane wejściowe są dodatnie. Wagi dla określonego węzła w pierwszej warstwie wagi są aktualizowane o kwotę proporcjonalną do $\delta x$ gdzie $\delta$ jest błędem (skalarnym) w tym węźle, a $x$ jest wektorem wejściowym (patrz równania (5) i (10)). Gdy wszystkie składniki wektora wejściowego są dodatnie, wszystkie aktualizacje wag, które zasilają węzeł, będą miały ten sam znak (tj. Znak ( $\delta$ )). W rezultacie te ciężary mogą tylko zmniejszyć się lub wzrosnąć łączniedla danego wzorca wejściowego. Zatem, jeśli wektor ciężaru musi zmienić kierunek, może to zrobić tylko poprzez zygzakowanie, co jest nieefektywne, a zatem bardzo wolne.

Dlatego powinieneś znormalizować swoje dane wejściowe, aby średnia wynosiła zero.

Ta sama logika dotyczy warstw środkowych:

Tę heurystykę należy zastosować na wszystkich warstwach, co oznacza, że chcemy, aby średnia wyników węzła była bliska zeru, ponieważ te dane wyjściowe są danymi wejściowymi do następnej warstwy.

Postscript @craq wskazuje, że ten cytat nie ma sensu dla ReLU (x) = max (0, x), który stał się bardzo popularną funkcją aktywacyjną. Chociaż ReLU unika pierwszego zygzakowatego problemu wspomnianego przez LeCun, nie rozwiązuje tego drugiego punktu przez LeCun, który mówi, że ważne jest, aby przesunąć średnią na zero. Chciałbym wiedzieć, co LeCun ma do powiedzenia na ten temat. W każdym razie istnieje artykuł o nazwie Batch Normalization , który jest oparty na pracy LeCun i oferuje sposób rozwiązania tego problemu:

Od dawna wiadomo (LeCun i in., 1998b; Wiesler i Ney, 2011), że trening sieci zbiega się szybciej, jeśli jego dane wejściowe są bielone - tj. Liniowo przekształcane w celu uzyskania zerowych średnich i wariancji jednostek i dekorelowane. Ponieważ każda warstwa obserwuje nakłady wytwarzane przez poniższe warstwy, korzystne byłoby uzyskanie takiego samego wybielenia nakładów każdej warstwy.

Nawiasem mówiąc, ten film Siraja wyjaśnia wiele na temat funkcji aktywacyjnych w 10 zabawnych minut.

@elkout mówi „Prawdziwym powodem, dla którego tanh jest preferowany w porównaniu do sigmoidu (...), jest to, że pochodne tanh są większe niż pochodne sigmoidu”.

Myślę, że to nie jest problem. Nigdy nie widziałem, aby był to problem w literaturze. Jeśli przeszkadza Ci, że jedna pochodna jest mniejsza od innej, możesz ją po prostu skalować.

Funkcja logistyczna ma kształt $\sigma(x)=\frac{1}{1+e^{-kx}}$ . Zwykle używamy $k=1$ , ale nic nie zabrania ci używania innej wartości dla $k$ aby poszerzyć pochodne, jeśli to był twój problem.

Nitpick: tanh jest również funkcją sigmoidalną . Każda funkcja o kształcie S jest sigmoidem. To, co nazywacie sigmoid, to funkcja logistyczna. Powodem, dla którego funkcja logistyczna jest bardziej popularna, są przyczyny historyczne. Od dłuższego czasu jest używany przez statystyków. Poza tym niektórzy uważają, że jest to bardziej prawdopodobne biologicznie.

Ricardo Cruz
źródło

1

Nie potrzebujesz cytatu, aby pokazać, że

, tylko rachunek licealny.

Wiemy, że jest to prawda, ponieważ

, więc wystarczy zmaksymalizować kwadrat wklęsły.

max_{x} σ^{'} (x) < max_{x} \tanh^{'} (x)

$\max_x \sigma^\prime(x) < \max_x \tanh^\prime(x)$

σ^{'} (x) = σ (x) (1 - σ (x)) \leq 0.25

$\sigma^\prime(x) = \sigma(x) (1 - \sigma(x)) \le 0.25$

0 < σ (x) < 1

$0 < \sigma(x) < 1$

\tanh^{'} (x) = {sech}^{2} (x) = \frac{2}{\exp (x) + \exp (- x))} \leq 1.0

$\tanh^\prime(x) = \text{sech}^2(x) = \frac{2}{\exp(x) + \exp(-x))} \le 1.0$ which can be verified by inspection.

Sycorax says Reinstate Monica

Apart from that I said that in most cases the derivatives of tanh are larger than the derivatives of the sigmoid. This happens mostly when we are around 0. You are welcome to have a look at this link and at the clear answers provided here question which they also state that the derivates of

\tanh

$\tanh$ are usually larger than the derivates of the

sigmoid

$\text{sigmoid}$ .

ekoulier

hang on... that sounds plausible, but if middle layers should have an average output of zero, how come ReLU works so well? Isn't that a contradiction?

craq

@ekoulier, the derivative of

tanh

$\text{tanh}$ being larger than

sigmoid

$\text{sigmoid}$ is a non-issue. You can just scale it if it bothers you.

Ricardo Cruz

@craq, good point, I think that's a flaw in LeCun's argument indeed. I have added a link to the batch normalization paper where it discusses more about that issue and how it can be ameliorated. Unfortunately, that paper doesn't compare relu with tanh, it only compares relu with logistic (sigmoid).

Ricardo Cruz

14

It's not that it is necessarily better than $\text{sigmoid}$ . In other words, it's not the center of an activation fuction that makes it better. And the idea behind both functions is the same, and they also share a similar "trend". Needless to say that the $\tanh$ function is called a shifted version of the $\text{sigmoid}$ function.

The real reason that $\text{tanh}$ is preferred compared to $\text{sigmoid}$ , especially when it comes to big data when you are usually struggling to find quickly the local (or global) minimum, is that the derivatives of the $\text{tanh}$ are larger than the derivatives of the $\text{sigmoid}$ . In other words, you minimize your cost function faster if you use $\text{tanh}$ as an activation fuction.

But why does the hyperbolic tangent have larger derivatives? Just to give you a very simple intuition you may observe the following graph:

The fact that the range is between -1 and 1 compared to 0 and 1, makes the function to be more convenient for neural networks. Apart from that, if I use some math, I can prove that:

\tanh x = 2 σ (2 x) - 1

$\tanh{x} = 2σ(2x)-1$

And in general, we may prove that in most cases $\Big|\frac{\partial\tanh (x)}{\partial x}\Big| > \Big|\frac{\partial\text{σ} (x)}{\partial x}\Big|$ .

ekoulier
źródło

So why would Prof. Ng say that it's an advantage to have the output of the function averaging around

0

$0$ ?

Tom Hale

2

It's not the fact that the average is around 0 that makes

\tanh

$\tanh$ faster. It's the fact that being around zero means that the range is also grater (compared to being around 0.5 in the case of

sigmoid

$\text{sigmoid}$ ), which leads to larger derivatives, which almost always leads to faster convergence to the minimum. I hope that it is clear now. Ng is right that we prefer the

\tanh

$\tanh$ function because it is centered around 0, but he just didn't provide the complete justification.

ekoulier

Zero-centering is more important than

2 x

$2x$ ratio, because it skews the distribution of activations and that hurts the performance. If you take sigmoid(x) - 0.5 and

2 x

$2x$ smaller learning rate, it will learn on par with tanh.

Maxim

@Maxim Which "it" skews the distribution of activations, zero-centering or

2 x

$2x$ ? If zero-centering is a Good Thing, I still don't feel that the "why" of that has been answered.

Tom Hale

3

Answering the part of the question so far unaddressed:

Andrew Ng says that using the logistic function (commonly know as sigmoid) really only makes sense in the final layer of a binary classification network.

As the output of the network is expected to be between $0$ and $1$ , the logistic is a perfect choice as it's range is exactly $(0, 1)$ . No scaling and shifting of $tanh$ required.

Tom Hale
źródło

For the output, the logistic function makes sense if you want to produce probabilities, we can all agree on that. What is being discussed is why tanh is preferred over the logistic function as an activation for the middle layers.

Ricardo Cruz

How do you know that's what the OP intended? It seems he was asking a general question.

Tom Hale

2

It all essentially depends on the derivatives of the activation function, the main problem with the sigmoid function is that the max value of its derivative is 0.25, this means that the update of the values of W and b will be small.

The tanh function on the other hand, has a derivativ of up to 1.0, making the updates of W and b much larger.

This makes the tanh function almost always better as an activation function (for hidden layers) rather than the sigmoid function.

To prove this myself (at least in a simple case), I coded a simple neural network and used sigmoid, tanh and relu as activation functions, then I plotted how the error value evolved and this is what I got.

The full notebook I wrote is here https://www.kaggle.com/moriano/a-showcase-of-how-relus-can-speed-up-the-learning

If it helps, here are the charts of the derivatives of the tanh function and the sigmoid one (pay attention to the vertical axis!)

Juan Antonio Gomez Moriano
źródło

(-1) Although this is an interesting idea, it doesn't stand on it's own. In particular, most optimization methods used for DL/NN are first order gradient methods, which have a learning rate

α

$\alpha$ . If the max derivative with regards to one activation function is too small, one could easily just increase the learning rate.

Cliff AB

Don't you run the risk of not having a stable learning curve with a higher learning rate?

Juan Antonio Gomez Moriano

Well, if the derivatives are more stable, then increasing the learning rate is less likely to destablize the estimation.

Cliff AB

That's a fair point, do you have a link where I could learn more of this?

Juan Antonio Gomez Moriano

Dlaczego tanh prawie zawsze jest lepszy niż sigmoid jako funkcja aktywacyjna?

Odpowiedzi: