Here is my code:
Hey,
The CrossEntropyLoss
function already computes the Softmax.
What’s your IDE? I’m curious. Thanks.
I think it’s vscode with plugins about viewing notebook
how do you use the softmax during testing? since softmax is implemented in loss function, the output of net(x) doesn’t apply softmax to its output. and the result of the argmax is the max of logits not the probability. I wonder how do you use softmax when testing
Hi @ccpvirus, as @StevenJokes mentioned we use the “maximum” value across the 10 class outputs as our final label. As softmax is just a “rescaling” function, it doesn’t affect whether a prediction output (i.e., a class lable) is or isn’t the maximum over all classes. Let me know whether this is clear to you.
Dear all, may I know why we use torch.randn to initialize the parameter here instead of using torch.normal as in Softmax Regression implementation? Are there any advantages? Or actually there are no big differences, we can use both of them? Thanks.
W1 = nn.Parameter(torch.randn(
num_inputs, num_hiddens, requires_grad=True) * 0.01)
@Gavin
For a standard normal distribution (i.e. mean=0
and variance=1
), you can use torch.randn
For your case of custom mean
and std
, you can use torch.normal
Hello. Can you please advise why 0.01 is being multiplied after generating the random numbers?
W1 = nn.Parameter(torch.randn(
num_inputs, num_hiddens, requires_grad=True) * 0.01)
Hi, I have a question on the last question. Which would be a smart way to search over hyperparameters. Is it possible to apply GridSearch, or RandomGridSearch for hiperparameters “like” scikit learn -algorithms???
If not, then how to iterate through a set of hyperparameters?
Thanks in advance for awnssers
Hi
How can i download book in html as like from website?
It is more interative than learning pdf book?
does d2l.ai book consists all course codes in pytorch?
Thanks, i am starting it
Hi @machine_machine, please check my response at Full pytorch code book for d2l.ai [help]. Thanks for your patience.
while creating parameters we are multiplying tensor with 0.01 ,can anyone explain why are we doing so ?
also while initializing weights is it required to lie between 0 - 1 ?
torch.randn
from pytorch docs is generated with mean 0 and variance 1. We multiply the tensor by 0.01 to scale the parameters to this range.
Ans 2. Initializing with small numbers is required for stable training, not particularly 0 to 1 we can take a distribution from -1 to 1 as well. The only thing to keep in mind for stable training of Deep Neural Nets is that we should set the parameters in a way that avoids exploding gradients as well as avoid vanishing gradients.
My answers
Exercises
- Change the value of the hyperparameter num_hiddens and see how this hyperparameter influences your results. Determine the best value of this hyperparameter, keeping all others
constant.
- Try adding an additional hidden layer to see how it affects the results.
# increasing number of hidden layers
W1 = nn.Parameter(torch.randn(num_inputs, 128) * 0.01,requires_grad=True)
b1 = nn.Parameter(torch.zeros(128),requires_grad=True)
W2 = nn.Parameter(torch.randn(128, 64) * 0.01, requires_grad=True)
b2 = nn.Parameter(torch.zeros(64), requires_grad=True)
W3 = nn.Parameter(torch.randn(64,num_outputs)*0.01, requires_grad=True)
b3 = nn.Parameter(torch.zeros(num_outputs),requires_grad=True)
def net(X):
X=X.reshape(-1,num_inputs)
out = relu(torch.matmul(X,W1) + b1)
out = relu(torch.matmul(out,W2)+b2)
return torch.matmul(out,W3) + b3
- How does changing the learning rate alter your results? Fixing the model architecture and
other hyperparameters (including number of epochs), what learning rate gives you the best
results?
- changes the rate of convergence
- What is the best result you can get by optimizing over all the hyperparameters (learning rate,
number of epochs, number of hidden layers, number of hidden units per layer) jointly?
- loss of 0.5
- Describe why it is much more challenging to deal with multiple hyperparameters.
- combinatorial explosion because of more combination of hyperparameters
- What is the smartest strategy you can think of for structuring a search over multiple hyperparameters?
- creating a matrices of all parameters and then optimally training over the combinationto find the result. some heuristic may be required.
- Pass
- For me it was better if I increased the hidden layer from 1 to 2
- pass
- How do we define best result? Is it the minimum test loss, or maximum accuracy on test dataset? Do we keep epochs constant?
- If we have multiple hyperparameters, so mix-and-match of all the parameters will create an exponential amount of parameters to optimize.
- Grid search can be a good way to go about it. Increase the value exponentially. Maybe try binary search. The idea is to not go linearly but in order of log.
Thanks