When do I need Softmax activation function and when Sigmoid activation function?

Ask Questions Forum: ask Machine Learning Questions to our readersCategory: Deep LearningWhen do I need Softmax activation function and when Sigmoid activation function?

When training a neural network I sometimes have to use Sigmoid as an activation function and Softmax in other models. What are the differences and when do I need to pick one.

The Sigmoid activation function converts any input into a number ranging from zero to one:

$$\begin{equation} y: f(x) = \frac{1}{1 + e^{-x}} \end{equation}$$ For this reason, it is used in  i.e. classification problems where you must predict between one of two classes: it gives the prediction that the outcome belongs to class 1 (from classes 0, 1).

The output 0.33 means that it’s more likely class 0; 0.50 is precisely in between and anything > 0.50 signals class 1.

For this reason, the actual class from a Sigmoid outcome can be determined by taking the argmax value of the outcome, i.e. $$\text{argmax} \ Sigmoid(\textbf{x})$$.

The Softmax activation function, however, converts an array with inputs into a discrete probability distribution by dividing the exponent for the input value by the sum of exponents of all the input values.

$$Softmax(x _i ) = \frac{exp(x_i)}{\sum{_j}^ {} {} exp(x_j))}$$

The inputs, called the logits, and the output of the Softmax activated layer, in number equal the number of classes. For this reason, and because Softmax generates an output that adheres to Kolmogorov’s probability axioms (i.e. for each output, $$\text{output} \in [0, 1]$$ and the sum of outputs equals $$1.0$$ i.e. 100%), we saw that Softmax generates a probability distribution over the classes. While it can definitely be used for binary classification problems, it’s more commonly used in multiclass classifiers, i.e. where the number of classes > 2. Taking the class value with Softmax is taking the arg max of the output array, i.e. $$\text{argmax} \ Softmax(\textbf{x})$$.

If you are training a binary classifier with a neural network, you can therefore choose between Sigmoid (and then use binary crossentropy loss) and Softmax (categorical or sparse categorical crossentropy loss dependent on the structure of your targets); if it’s multiclass you use Softmax (with one of the two loss functions).

Hope this helps!

MachineCurve. (2019, September 4). ReLU, sigmoid and tanh: Today’s most used activation functions – MachineCurvehttps://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/
MachineCurve. (2020, January 12). How does the Softmax activation function work? – MachineCurvehttps://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/
MachineCurve. (2020, November 11). 3 variants of classification problems in machine learning – MachineCurvehttps://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/#variant-1-binary-classification