Linear Regression Implementation from Scratch

Wei_Yang · September 12, 2023, 12:25am

Here the code uses a method called “clip_gradients” which involves knowledge about gradient clipping. I searched through the book; this concept is not introduced until 9.5.3. Because this section is to implement linear regression “from scratch”. I suggest we replace this method with something more intuitive for students to understand.

rrrrp · March 28, 2024, 5:43pm

Regarding question #7 in the exercises: 7. If the number of examples cannot be divided by the batch size, what happens to data_iter at the end of an epoch?

This topic can be a good teachable moment, but the question needs to be designed more methodically. What happens to a variable in the code is irrelevant here. The real question worth thinking about is - “What should we (or the code) do when a dataset of size M cannot be evenly divided into batches of size N”?

Just thinking about it logically, we have a few options:

A) throw an error
B) force the batch size to closest fit where it evenly divides the dataset
C) divide the dataset and leave one final batch of differing size
D) divide the dataset and trim the final batch of differing size

A) can work, but is not ideal as this problem can be feasibly solved. B) can also work - however in the case where the length of the dataset is a prime number, this effectively will turn mini-batch gradient descent off and just do one big batch with all of the data. For example, a dataset of 101 samples will just be one big batch of 101. This may not be ideal for whoever is using our code.

Ultimately, C) and D) are what it seems modern libraries go for. Pytorch, for example, addresses this in their DataLoader class constructor, specifically the drop_last argument.

Would like for the authors/editors here to be a bit more focused on asking thought-provoking questions and exploring problems (that are pertinent) more deeply. There are a lot of exercises that go way off tangent from the lesson, and others that dive into more advanced topics - which is a problem when you have not ensured that the student has, on a consistent basis, been given the necessary content to master the fundamentals.

Lucas · April 19, 2024, 12:19am

In 3.4.2 they say

In the implementation, we need to transform the true value y into the predicted value’s shape y_hat

Isn’t this inconsistent? We should be transforming the predicted value’s shape to the true value right?

art_music · May 29, 2024, 6:01am

with torch.no_grad():
loss.backward()
shouldnt loss.backward be before torch.no_grad

Mohammed_Tarek · June 5, 2024, 10:38am

For question 3 in the exercises.
I thought about using a wide range of lambdas and temperatures to calculate the spectral density function then using the temperature as my X or predictor and spectral density function as my y or target.
Am I missing something?

trungdinh · June 17, 2024, 10:04am

I put this answer for the very first question in the exercise, in case that there are people like me, who (after reading all the discussions about breaking symmetry in weight initialization, etc.) still don’t really understand why initializing weights to zeros still work in this specific exercise, and why it doesn’t work in general.

So, for this specific exercise, where we only try to fit one linear function, it works even if we initialize weight vector with all zeros. Our optimization objective is to minimize weight vector, and the gradient is computed w.r.t weight (dy / dw), which is equal to input vector. Since the gradient is non-zero (and each component in the input vector is probably different from each other), weight is still updated and everything works.

However, for a multi-layer neural network, it doesn’t work anymore, because with a zero weight vector, every hidden layers from the first one will have zero activation values in the first batch, and then continue to have the same value for next batches. Therefore, the gradient flows backward for functions between every hidden layers, as well as output layer will be all the same, leading to the so-called symmetry problem, and that we’re learning the same function for every hidden neurons.

I feel that this Quora answer explains quite well what happens for stubborn head like mine

filipv · June 21, 2024, 8:15pm

I’m confused by a few of the exercises – for #4, there is no reshape in the loss function. In question #7 – there is no data_iter. Are these out of date?

heishuaiguo · January 19, 2025, 6:58am

@anirudh Hello, this has confused me for a long time too. Essentially, it’s just about updating the values of the weights. I think removing with torch.no_grad() and changing in step() to: param.data -= self.lr * param.grad . Would it be clearer in this way? I have tested this modification and there is no problem.

skixer2 · May 12, 2025, 10:18am

When “with torch.no_grad():” is called, all the variables resulting from calculations involving variables with the automatic gradient set, will not have a gradient. Nevertheless, the variables calculated earlier will retain their gradient.
In this case, “loss” will still have its gradient, but any new variable resulting from an operation containing “w” and/or “b” wont.
Try this:

x = torch.randn(3, requires_grad=True)
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)
    print(x.requires_grad)

Result:

True
True
False
True