I think one way to circumvent this is to use the square loss function when the error is small (near stationary point) and use absolute error otherwise. This loss function is known as the Huber Loss function

Info on Huber loss

I think one way to circumvent this is to use the square loss function when the error is small (near stationary point) and use absolute error otherwise. This loss function is known as the Huber Loss function

Info on Huber loss

Wanted to point out that this chapter is missing a fairly important point. Although linear regression is often used on linear data, it can also be used on exponential data. With Gradient Descent on Mean Squared Error - the derivative of MSE is always linear **with respect to** the weights. Although often labeled “polynomial regression” - it is essentially linear regression under the hood.

In Chapter 2.3 the authors mention that they use “dimensionality” to refer to the number of elements along an axis: “Oftentimes, the word “dimension” gets overloaded to mean both the number of axes and the length along a particular axis. To avoid this confusion, we use *order* to refer to the number of axes and *dimensionality* exclusively to refer to the number of components.”

Talking about the polynomial regression, what do you mean by linear regression under the hood? Linear regression is a special case of polynomial regression but as far as I know they are fairly different.