http://d2l.ai/chapter_linear-networks/linear-regression.html

After training for some predetermined number of iterations (or until some other stopping criteria are met), we record the estimated model parameters, denoted π°Μ ,πΜ w^,b^. Note that even if our function is truly linear and noiseless, these parameters

will not be the exact minimizers of the lossbecause, although the algorithm converges slowly towards the minimizers it cannot achieve it exactly in a finite number of steps.

I have a question about the part in bold. If we choose a large learning rate, then the algorithm can overshoot the parameter values for which loss function is minimized. So, that tells me that we should be able to find w, b that minimize the loss exactly. What could I be missing?

For Q1 from the exercises, the solution for b would be the sample mean of the data. How does it relate to the normal distribution - could someone help?

Assume that we have some data π₯1,β¦,π₯πββx1,β¦,xnβR. Our goal is to find a constant πb such that βπ(π₯πβπ)2βi(xiβb)2 is minimized.

- Find a analytic solution for the optimal value of πb.
- How does this problem and its solution relate to the normal distribution?

I believe the optimal value of b is equal to the mean of the whole dataset which represents the Mean of a normal distribution. This makes (X_i - b) is the same as the exponent of e (X_i - mu)

In question 3 should the distribution be laplace or double exponential?

I found that the code canβt be run in COLAB because mxnet canβt be imported.