Linear Regression Implementation from Scratch

Here the code uses a method called “clip_gradients” which involves knowledge about gradient clipping. I searched through the book; this concept is not introduced until 9.5.3. Because this section is to implement linear regression “from scratch”. I suggest we replace this method with something more intuitive for students to understand.

Regarding question #7 in the exercises: 7. If the number of examples cannot be divided by the batch size, what happens to data_iter at the end of an epoch?

This topic can be a good teachable moment, but the question needs to be designed more methodically. What happens to a variable in the code is irrelevant here. The real question worth thinking about is - “What should we (or the code) do when a dataset of size M cannot be evenly divided into batches of size N”?

Just thinking about it logically, we have a few options:

A) throw an error
B) force the batch size to closest fit where it evenly divides the dataset
C) divide the dataset and leave one final batch of differing size
D) divide the dataset and trim the final batch of differing size

A) can work, but is not ideal as this problem can be feasibly solved. B) can also work - however in the case where the length of the dataset is a prime number, this effectively will turn mini-batch gradient descent off and just do one big batch with all of the data. For example, a dataset of 101 samples will just be one big batch of 101. This may not be ideal for whoever is using our code.

Ultimately, C) and D) are what it seems modern libraries go for. Pytorch, for example, addresses this in their DataLoader class constructor, specifically the drop_last argument.

Would like for the authors/editors here to be a bit more focused on asking thought-provoking questions and exploring problems (that are pertinent) more deeply. There are a lot of exercises that go way off tangent from the lesson, and others that dive into more advanced topics - which is a problem when you have not ensured that the student has, on a consistent basis, been given the necessary content to master the fundamentals.

In 3.4.2 they say

In the implementation, we need to transform the true value y into the predicted value’s shape y_hat

Isn’t this inconsistent? We should be transforming the predicted value’s shape to the true value right?