Implementation of Multilayer Perceptrons - pytorch

Jun '20

StevenJokes

1.When test accuaracy increases most quickly and high, can we say that this hyperparameter is the best value?

1 reply

goldpiggy

Unless you are sure the given optimization function is convex, we hardly ever say the “best” model or “best” set of hyperparameters.

Jul '20

Kushagra_Chaturvedy

def net(X):
X = X.reshape((-1, num_inputs))
H = relu(X@W1 + b1)   # Here '@' stands for dot product operation
return (H@W2 + b2)

In the last line shouldn’t we have applied the softmax function to the return value H@W2 + b2? Isn’t there a chance that this operation would yield a negative value or a value greater than 1?

When I do use the softmax function, the train loss dissapears and the accuracy suddenly drops to 0. What could be the cause behind this?

1 reply

Jul '20 ▶ Kushagra_Chaturvedy

goldpiggy

Hi @Kushagra_Chaturvedy,

In the last line shouldn’t we have applied the softmax function to the return value H@W2 + b2 ? Isn’t there a chance that this operation would yield a negative value or a value greater than 1?

We use the loss function to process the output values of net(X), so it does not need to yield a value in (0, 1).

When I do use the softmax function, the train loss dissapears and the accuracy suddenly drops to 0. What could be the cause behind this?

Could you show us the code so that we can reproduce the results?

2 replies

Jul '20 ▶ goldpiggy

Kushagra_Chaturvedy

Here is my code:

1 reply

Jul '20 ▶ Kushagra_Chaturvedy

manuel-arno-korfmann

Hey,

The CrossEntropyLoss function already computes the Softmax.

Also see: https://d2l.ai/chapter_linear-networks/softmax-regression-concise.html#softmax-implementation-revisited

1 reply

Jul '20 ▶ manuel-arno-korfmann

Kushagra_Chaturvedy

Oh I see. Thanks @manuel-arno-korfmann

2 replies

Jul '20 ▶ Kushagra_Chaturvedy

StevenJokes

What’s your IDE? I’m curious. Thanks.

Aug '20

Xiaomut

I think it’s vscode with plugins about viewing notebook

Aug '20 ▶ Kushagra_Chaturvedy

StevenJokes

What’s your IDE?
@Xiaomut
I am using vscode, and I can’t find this.
@Kushagra_Chaturvedy

Aug '20 ▶ goldpiggy

how do you use the softmax during testing? since softmax is implemented in loss function, the output of net(x) doesn’t apply softmax to its output. and the result of the argmax is the max of logits not the probability. I wonder how do you use softmax when testing

2 replies

Aug '20 ▶ ccpvirus

StevenJokes

@ccpvirus
We use softmax to calculate probablility first, and then find the max probabillity one.

Aug '20

goldpiggy

Hi @ccpvirus, as @StevenJokes mentioned we use the “maximum” value across the 10 class outputs as our final label. As softmax is just a “rescaling” function, it doesn’t affect whether a prediction output (i.e., a class lable) is or isn’t the maximum over all classes. Let me know whether this is clear to you.

Aug '20

Gavin

Dear all, may I know why we use torch.randn to initialize the parameter here instead of using torch.normal as in Softmax Regression implementation? Are there any advantages? Or actually there are no big differences, we can use both of them? Thanks.

W1 = nn.Parameter(torch.randn(
num_inputs, num_hiddens, requires_grad=True) * 0.01)

1 reply

Aug '20

StevenJokes

@Gavin
For a standard normal distribution (i.e. mean=0 and variance=1 ), you can use torch.randn
For your case of custom mean and std , you can use torch.normal

Sep '20

Abinash_Sahu

Hello. Can you please advise why 0.01 is being multiplied after generating the random numbers?
W1 = nn.Parameter(torch.randn(
num_inputs, num_hiddens, requires_grad=True) * 0.01)

Sep '20

StevenJokes

let stddev = 0.1
@Abinash_Sahu

Oct '20

Luis_Ramirez

Hi, I have a question on the last question. Which would be a smart way to search over hyperparameters. Is it possible to apply GridSearch, or RandomGridSearch for hiperparameters “like” scikit learn -algorithms???
If not, then how to iterate through a set of hyperparameters?

Thanks in advance for awnssers

Oct '20

machine_machine

Hi

How can i download book in html as like from website?
It is more interative than learning pdf book?
does d2l.ai book consists all course codes in pytorch?
Thanks, i am starting it

1 reply

Oct '20 ▶ machine_machine

goldpiggy

Hi @machine_machine, please check my response at Full pytorch code book for d2l.ai [help]. Thanks for your patience.

Dec '20

jaysonruz

while creating parameters we are multiplying tensor with 0.01 ,can anyone explain why are we doing so ?
also while initializing weights is it required to lie between 0 - 1 ?

1 reply

Dec '20 ▶ jaysonruz

anirudh

torch.randn from pytorch docs is generated with mean 0 and variance 1. We multiply the tensor by 0.01 to scale the parameters to this range.

Ans 2. Initializing with small numbers is required for stable training, not particularly 0 to 1 we can take a distribution from -1 to 1 as well. The only thing to keep in mind for stable training of Deep Neural Nets is that we should set the parameters in a way that avoids exploding gradients as well as avoid vanishing gradients.

Jul '21

fanbyprinciple

My answers

Exercises

Change the value of the hyperparameter num_hiddens and see how this hyperparameter influences your results. Determine the best value of this hyperparameter, keeping all others

constant.

Try adding an additional hidden layer to see how it affects the results.


# increasing number of hidden layers

W1 = nn.Parameter(torch.randn(num_inputs, 128) * 0.01,requires_grad=True)

b1 = nn.Parameter(torch.zeros(128),requires_grad=True)

W2 = nn.Parameter(torch.randn(128, 64) * 0.01, requires_grad=True)

b2 = nn.Parameter(torch.zeros(64), requires_grad=True)

W3 = nn.Parameter(torch.randn(64,num_outputs)*0.01, requires_grad=True)

b3 = nn.Parameter(torch.zeros(num_outputs),requires_grad=True)

def net(X):

    X=X.reshape(-1,num_inputs)

    out = relu(torch.matmul(X,W1) + b1)

    out = relu(torch.matmul(out,W2)+b2)

    return torch.matmul(out,W3) + b3

How does changing the learning rate alter your results? Fixing the model architecture and

other hyperparameters (including number of epochs), what learning rate gives you the best

results?

changes the rate of convergence

What is the best result you can get by optimizing over all the hyperparameters (learning rate,

number of epochs, number of hidden layers, number of hidden units per layer) jointly?

loss of 0.5

Describe why it is much more challenging to deal with multiple hyperparameters.

combinatorial explosion because of more combination of hyperparameters

What is the smartest strategy you can think of for structuring a search over multiple hyperparameters?

creating a matrices of all parameters and then optimally training over the combinationto find the result. some heuristic may be required.

Sep '21

Akshay_Pansari

Pass
For me it was better if I increased the hidden layer from 1 to 2
pass
How do we define best result? Is it the minimum test loss, or maximum accuracy on test dataset? Do we keep epochs constant?
If we have multiple hyperparameters, so mix-and-match of all the parameters will create an exponential amount of parameters to optimize.
Grid search can be a good way to go about it. Increase the value exponentially. Maybe try binary search. The idea is to not go linearly but in order of log.

Thanks

Dec '21

karbonski

My answers: (I tried to maximize test accuracy)

The knee of the curve is between 8 and 128 neurons. 8 neurons produced a surprisingly high test accuracy of 82.55% whereas 128 neurons achieved a test accuracy of 85.33%. There were very little gains in test accuracy (if any) for numbers of neurons > 128.
Additional hidden layers did not increase my test accuracy at h = 256 and other default parameters
Slower learning rates appear to achieve the same test accuracy as 0.1, however they require more epochs. Learning rates > 0.4 can get numerically unstable.
About 86% with lr = 0.2, epochs = 20, # hidden layers = 2, h = 256
Optimizing multiple hyperparameters is difficult because the optimization function is not necessarily convex and a sensitivity study requires a lot of computation time to run multiple trainings.
Except when adjusting learning rate, keep epochs constant at a low number like 5 and then iterate through a number of hyper parameters and try to find a maximum for test accuracy?

Apr '22

kevinmo

def net(X):
X = X.reshape((-1, num_inputs))
H = relu(X@W1 + b1) # Here ‘@’ stands for matrix multiplication
return (H@W2 + b2)

For this block of code, I understand we want to use reshape method to flatten the input, but I don’t understand why we need to use the -1 in reshape method to let the computer automatically match the first axis, I think it will be always the shape of (1, num_inputs), so I think there is no need to auto-match the first axis.

Please let me know where my thinking is wrong.

Thank you.

1 reply

Apr '22 ▶ kevinmo

zxul767

@kevinmo Keep in mind that the first axis (the batch axis) can vary in size. In the present example, it is 256 (not 1), but it’s best if we let it be automatically inferred.

Aug '23

pandalabme

My solutions to the exs: 5.2

11 Feb

zhang2023-byte

My exercise:
Q1: when num_hiddens < 4, val_acc decrease dramatically. But when increase num_hidden >> 256, val_acc keep constant.
Q2: new hidden layer with sufficient neurons didn’t affect the result, but with insufficient neurons will decrease accuracy.
Q3: Linear predictor can only use one feature provided by representation, it have not enough information to work properly.
Q4: Too small learning rate, with given epoch, model may not converge; Too large learning rate, model can’t be trained efficiently (accuracy not smooth).
Q5: 5.2 Too large parameters space.
5.3 MCMC?
Q6: TBD
Q7: Didn’t see big difference in my device. But theoretically the speed of well-aligned should be faster?
Q8: Test Sigmoid and Tanh. When other paras fixed, find accuracy: ReLu ~ Tanh > Sigmoid.
Q9: Given enough epoch, don’t matter.