Weight Decay

Ok, I didn’t know that the learning rate could cause this much disparity in results. Thanks!

Is b in b^2 same as the b in wx+b?

I think the idea is that W is a really high-dimensional vector because there are so many weights. b is relatively low-dimensional, so regularizing b has a much smaller effect.

1 Like

Hello. I have two questions about this notebook. (1) Do I understand correctly that “weight decay” is a generic term for any regularization that involves adding some kind of norm of W (regardless of which norm) to the loss? (2) Is it not possible to add L1 regularization to the optim.SGD input? The docstring only lists L2 norm.

Thanks!

Great question! Weight decay ideally should be mathematically equivalent to L2 Regularization. But be carefully when you implement it in code, this article indicates that different frameworks might be slightly different.

1 Like

You are maybe right.
Does that mean we can use w and b as bias penalty at the same time or separately?

I think it’s possible. Can we add? @goldpiggy

Hi @StevenJokes and @Steven_Hearnt, great questions! L1 regularization is applicable if you specify how to handle the in-differentiable case (x=0) .

Hi @StevenJokes, I might be late to the party but would like to take a stab at it from a theoretical point of view rather than practical (which has already been covered). If we regularize the bias term (b) by adding the penalty term b^2, the network would end up learning a very small value of bias term in the case where we constraint the model a lot (i.e. lambda is very big). In such a case, model would not have any average value and thus it would be predicting some value near to zero every time. So, in order to avoid such a scenario, maybe bias term isn’t regularized in the last layer.

1 Like

Why
“the network would end up learning a very small value of bias term”?

What does your “lambda” mean?

Lambda is the constraint imposed on the L2 norm in the loss function. It is defined in 4.5.2. If we set lambda to a very large number and include bias term in the L2 norm as well, gradient descent will set bias term to an extremely small number as well. All of this is mentioned in the section 4.5


Is it?
@kusur

Yep, If you include bias term in the l2_penalty, you set this parameter to a huge value, and train the model, you will notice that bias itself becomes very small. This removes the affine aspect of the transformation in the neural network

…still can’t understand “the affine aspect”.
Some related papers? @kusur

In 4.5.1

The technique is motivated by the basic intuition that among all functions f, the function f=0 (assigning the value 0 to all inputs) is in some sense the simplest , and that we can measure the complexity of a function by its distance from zero.

Why is f = 0 considered the simplest? What does it mean for a function to be simple/complex in this context? Why is a function’s distance from zero, a measure of its complexity?

The most common method for ensuring a small weight vector is to add its norm as a penalty term to the problem of minimizing the loss

What does a weight vector being “small” mean? Like, number of elements in the weight vector (length of the weight vector)?

Hi @tinkuge, great questions!

f = 0 means no regularization, that’s why it is the simplest (no computation here :wink: ). Check the Lp distance and it should give you better answer.

“small” refers each weight element’s value is small, such as following within range [-1, 1].

Any idea on how do I implement an L1 regularized weight decay using

torch.optim.SGD

As per PyTorch documentation torch.optim — PyTorch 2.1 documentation it seems like the weight_decay parameter only implements L2 regularization.

It looks like l1 norm has to be implemented using lower-level api

any one has idea about question 6? thx

Is there someone can share the idea about question 6? thx