# How to do backpropogation with Softmax and Mean Square Error?

I want to compare the losses for NumPy and PyTorch implementation with a softmax layer and mean square error loss. Here is my code with 2 hidden layers and final softmax layer (output layer) and MSE loss.

Below is my code for NumPy and PyTorch:

``````import numpy as np
from copy import deepcopy
np.random.seed(99)

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 16, 100, 10, 2

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

xx = np.random.randn(D_in, H)
yy = np.random.randn(H, D_out)
w1 = deepcopy(xx)
w2 = deepcopy(yy)

def softmax(x):
func = np.exp(x - np.max(x))
return func / func.sum(axis=0)

learning_rate = 1e-4
for t in range(500):
# Forward pass: compute predicted y
h = np.dot(x, w1)
u = np.maximum(h, 0)

z = np.dot(u, w2)
y_pred = softmax(z)

# Compute and print loss
loss = np.square(y_pred - y).sum()
if t % 50 == 0:
print("Epoch [{:3d}/{:3d}] Loss: {:.10f}".format(t, 500, loss))

# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)

# Update weights

print()

# ~~~~~~~~~~~~~torch implemantation~~~~~~~~~~~~~~

import torch
import torch.nn as nn

m = nn.Softmax(dim=0)

x_ = torch.from_numpy(x).type(torch.float64)
y_ = torch.from_numpy(y).type(torch.float64)

w1_ = torch.from_numpy(deepcopy(xx)).type(torch.float64)
w2_ = torch.from_numpy(deepcopy(yy)).type(torch.float64)

for t in range(500):
h_ = x_.mm(w1_)
u_ = h_.clamp(min = 0)
z = u_.mm(w2_)
y_pred = m(z)

loss_ = (y_pred - y_).pow(2).sum()
if t % 50 == 0:
print("Epoch [{:3d}/{:3d}] Loss: {:.10f}".format(t, 500, loss_.item()))
loss_.backward()

``````

and here is the output.

``````Output for NumPy implementation:

Epoch [  0/500] Loss: 29.7599750339
Epoch [ 50/500] Loss: 29.7568945381
Epoch [100/500] Loss: 29.7530540477
Epoch [150/500] Loss: 29.7481408013
Epoch [200/500] Loss: 29.7416476795
Epoch [250/500] Loss: 29.7326987989
Epoch [300/500] Loss: 29.7196529397
Epoch [350/500] Loss: 29.6990826814
Epoch [400/500] Loss: 29.6626811106
Epoch [450/500] Loss: 29.5857445210

Output for PyTorch implementation:

Epoch [  0/500] Loss: 29.7599750339
Epoch [ 50/500] Loss: 29.7555138011
Epoch [100/500] Loss: 29.7493562321
Epoch [150/500] Loss: 29.7403440862
Epoch [200/500] Loss: 29.7260022571
Epoch [250/500] Loss: 29.7000586454
Epoch [300/500] Loss: 29.6418893830
Epoch [350/500] Loss: 29.4537482662
Epoch [400/500] Loss: 28.9717000115
Epoch [450/500] Loss: 28.8537886818
``````

But I am not getting a similar output. Am I doing anything wrong while backpropagating in my NumPy implementation? Please help. Thank in advance.

I guess, the reason is that pytorchâ€™s internal encapsulation reduces the margin of error of float `grad`. So it goes down quickly.
I havenâ€™t seen the source code to confirm it.

1 Like

And Iâ€™m confused about your `softmax(x)`

``````def softmax(x):
func = np.exp(x - np.max(x))
return func / func.sum(axis=0)
``````

And my `softmax(x)` is different.

``````def softmax(x):
func = np.exp(x)
return func / func.sum(axis=0)
``````

But the result is same:

Why?

Hi @StevenJokes,

Thanks for the reply. Implementation is different but the output are the same. Here you can verify that.

``````import numpy as np

def mySoftmax(x):
func = np.exp(x - np.max(x))
return func / func.sum(axis=0)

def StevenSoftmax(x):
func = np.exp(x)
return func / func.sum(axis=0)

t = np.arange(0, 1, 0.1)
print("My Output: ", mySoftmax(t))
print()
print("Steven Output: ", StevenSoftmax(t))
``````

Ouput:

``````My Output:  [0.06120702 0.06764422 0.07475843 0.08262084 0.09131015 0.10091332
0.11152647 0.12325581 0.13621874 0.15054499]

Steven Output:  [0.06120702 0.06764422 0.07475843 0.08262084 0.09131015 0.10091332
0.11152647 0.12325581 0.13621874 0.15054499]``````

We minus the â€śmax(x)â€ť to avoid numerical overflow from large elements.

I got it, But if the \$x - max(x)\$ is a too small negative number, it also could cause numerical overflow.
Or we should judge how to avoid numerical overflow from the overall numbersâ€™ distribution
Because \$x - any_number\$ works too. any_number should be close to any x.

Thanks. Divided by a same number wonâ€™t change the value. I forgot it!

Good call. Check https://d2l.ai/chapter_linear-networks/softmax-regression-concise.html#softmax-implementation-revisited for more tricks!