https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html

The majority of deep learned classification models end with a linear layer fed into a

softmax, so one can interpret the role of the deep neural network to be to find anon-linear embeddingsuch that the target classes can be separated cleanly byhyperplanes.

Hi,@astonzhang Could you please explain more about how **softmax** and **hyperplane** are related? Is it **softmax** that is the **non-linear embedding** method here?

Hi @ming_chen, great question. Here is the answer from the author Brent (his account is current locked):

Consider the case of a two component soft-max (the others work similarly) on top of a linear layer. In this case the two outputs are \frac{e^{-w_1\cdot x}}{e^{-w_1 \cdot x}+e^{-w_2 \cdot x}} and \frac{e^{-w_2\cdot x}}{e^{-w_1 \cdot x}+e^{-w_2 \cdot x}}. To make our decision, we need to see which one of these probabilities is larger, and thus we care if the first is bigger than the second or the second is bigger than the first. The separation between them, occurs when the two are equal, which happens when \frac{e^{-w_1\cdot x}}{e^{-w_1 \cdot x}+e^{-w_2 \cdot x}}=\frac{e^{-w_2\cdot x}}{e^{-w_1 \cdot x}+e^{-w_2 \cdot x}}. The denominators are the same, so can be discarded leaving e^{-w_1\cdot x} = e^{-w_2\cdot x}. Taking logarithms gives w_1\cdot x = w_2\cdot x, which can be rearranged (subtract right side from left) to see that (w_1-w_2)\cdot x = 0 which is the equation of a hyperplane.

Indeed, if you have many potential outputs, the boundary between any pair of them is a hyperplane, so you can see the action of the softmax on a linear layer as dividing the input space into outputs by hyperplanes.