Here the code uses a method called “clip_gradients” which involves knowledge about gradient clipping. I searched through the book; this concept is not introduced until 9.5.3. Because this section is to implement linear regression “from scratch”. I suggest we replace this method with something more intuitive for students to understand.
Regarding question #7 in the exercises: 7. If the number of examples cannot be divided by the batch size, what happens to data_iter
at the end of an epoch?
This topic can be a good teachable moment, but the question needs to be designed more methodically. What happens to a variable in the code is irrelevant here. The real question worth thinking about is - “What should we (or the code) do when a dataset of size M cannot be evenly divided into batches of size N”?
Just thinking about it logically, we have a few options:
A) throw an error
B) force the batch size to closest fit where it evenly divides the dataset
C) divide the dataset and leave one final batch of differing size
D) divide the dataset and trim the final batch of differing size
A) can work, but is not ideal as this problem can be feasibly solved. B) can also work - however in the case where the length of the dataset is a prime number, this effectively will turn mini-batch gradient descent off and just do one big batch with all of the data. For example, a dataset of 101 samples will just be one big batch of 101. This may not be ideal for whoever is using our code.
Ultimately, C) and D) are what it seems modern libraries go for. Pytorch, for example, addresses this in their DataLoader class constructor, specifically the drop_last argument.
Would like for the authors/editors here to be a bit more focused on asking thought-provoking questions and exploring problems (that are pertinent) more deeply. There are a lot of exercises that go way off tangent from the lesson, and others that dive into more advanced topics - which is a problem when you have not ensured that the student has, on a consistent basis, been given the necessary content to master the fundamentals.
In 3.4.2 they say
In the implementation, we need to transform the true value
y
into the predicted value’s shapey_hat
Isn’t this inconsistent? We should be transforming the predicted value’s shape to the true value right?
with torch.no_grad():
loss.backward()
shouldnt loss.backward be before torch.no_grad
For question 3 in the exercises.
I thought about using a wide range of lambdas and temperatures to calculate the spectral density function then using the temperature as my X or predictor and spectral density function as my y or target.
Am I missing something?
I put this answer for the very first question in the exercise, in case that there are people like me, who (after reading all the discussions about breaking symmetry in weight initialization, etc.) still don’t really understand why initializing weights to zeros still work in this specific exercise, and why it doesn’t work in general.
So, for this specific exercise, where we only try to fit one linear function, it works even if we initialize weight vector with all zeros. Our optimization objective is to minimize weight vector, and the gradient is computed w.r.t weight (dy / dw), which is equal to input vector. Since the gradient is non-zero (and each component in the input vector is probably different from each other), weight is still updated and everything works.
However, for a multi-layer neural network, it doesn’t work anymore, because with a zero weight vector, every hidden layers from the first one will have zero activation values in the first batch, and then continue to have the same value for next batches. Therefore, the gradient flows backward for functions between every hidden layers, as well as output layer will be all the same, leading to the so-called symmetry problem, and that we’re learning the same function for every hidden neurons.
I feel that this Quora answer explains quite well what happens for stubborn head like mine
I’m confused by a few of the exercises – for #4, there is no reshape
in the loss function. In question #7 – there is no data_iter
. Are these out of date?