Why is Softmax called ‘Softmax’?
A week back, I was studying for the finals of Machine Learning for Graphs and Sequential Data. Everything was usual. I was working out the heavy math that was present in almost every slide of the course. However, I came across something which might have been overlooked by many people who are in the field of Machine Learning. And believe me when I say this, I or someone else would have never cared enough to figure this out on our own.
There was an optimization problem involving the $argmax()$ function and the instructor said
Since the argmax function is not differentiable, we use softmax instead
I paused the lecture, went back and heard it again (thanks to online classes ;)). I know that $argmax()$ is not differentiable when there are equal arguments. I also know that $softmax()$ transforms an input vector into a probability distribution. What came as a surprise that there is possibly a relation between the two. At that moment, I had to let this thought go away. I could not afford another adventure besides the one where I had to study and pass the course.
Fast forward today, I searched on Google “relation between argmax and softmax”. I was blown away by what I was seeing. It just so happens that the $softmax()$ is a smooth (or soft) approximation of the $argmax()$ function. But then, why is it named softmax? Actually, the name is a misnomer. As per the wikipedia page for softmax:
The name “softmax” is misleading; the function is not a smooth maximum (a smooth approximation to the maximum function), but is rather a smooth approximation to the arg max function: the function whose value is which index has the maximum. In fact, the term “softmax” is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term “softargmax”, but the term “softmax” is conventional in machine learning.
In other words, $max()$ gives the hard maximum of the vector whereas $logsumexp()$ is a smooth approximation to it. Similarly, $softmax()$ is a smooth approximation of the $argmax()$. For this particular reason, the softmax should ideally be called softargmax and then the LogSumExp is called (drum rolls) RealSoftMax.
This information has always been there but it seemed that I came across a very recent development in Machine Learning. To confirm this result myself, I fired up a jupyter notebook to write some code. Before we begin, let’s look at the mathematical expressions of softmax and logsumexp.
The softmax function is defined as:
\[\sigma(\mathbf{z}_i) = \frac{e^{z_i}}{\sum^{K}_{j=1}e^{z_j}}\] \[\text{for }i=1,...,K\text{ and }\mathbf{z} = (z_1,...,z_K) \in \mathbb{R}^K\]And the LogSumExp is defined as:
\[\text{LSE}(\mathbf{z}) = \log(e^{z_1}\text{ + ... + }e^{z_K})\] \[\mathbf{z} = (z_1,...,z_K) \in \mathbb{R}^K\]Let’s create functions for both using NumPy.
Now, let’s create a simple 1D array. Since the softmax returns a probability distribution over the array, I created a one-hot representation of the argmax instead of the categorical output for better comparison.
From the results, the softmax is pretty close to the one-hot encoded argmax output. Neat!
Now, for the max and LogSumExp relation:
Again, the results are pretty similar.
Now that you know it, these relations are actually very helpful. Both softmax and LogSumExp are differentiable while argmax and max are not for equal arguments. This is why we use softmax and LogSumExp extensively in Machine Learning as an approximation. If you liked this post, be sure to share it with others and leave a comment below.