Networks Using Blocks (VGG)

I was wondering if anyone can help me with a weird bug I’ve encountered. I’ll try to summarise. I’m training the network on fashion-mnist images resized to be 96x96 in size. I’ve adjusted the architecture appropriately, the problem is pretty strange: if I define the optimizer first and then pass it into my training function:

optimizer=torch.optim.SGD(net.parameters(), lr=lr)
train(net, all_iter, test_iter, num_epochs, lr, optimizer)

The training breaks:

But if I copy-paste the exact same line of code that defines the optimizer inside the train function, and change nothing else:

def train(net, train_iter, test_iter, num_epochs, lr, optimizer=None, device=d2l.try_gpu()):
  optimizer = torch.optim.SGD(params=net.parameters(), lr=lr)

Then it works fine:

Here is my train function code - same as d2l basically, I just wanted to type it out myself:

def train(net, train_iter, test_iter, num_epochs, lr, optimizer=None, device=d2l.try_gpu()):
    Trains a network 'net'. Assumes that net.init_weights exists
    # 1: initialise weights
#     net.apply(net.init_weights)
    def init_weights_test(m):
        if type(m) == nn.Linear or type(m) == nn.Conv2d:

    # 2: move model to device for training
    # 3: set up optimizer, loss function, and animation stuff
    loss = nn.CrossEntropyLoss()
#     optimizer = torch.optim.Adam(params=net.parameters(), lr=lr)
    if optimizer is None:
        optimizer = torch.optim.SGD(params=net.parameters(), lr=lr)
    animator = d2l.Animator(xlabel="epoch number", xlim=[0, num_epochs], legend=["train loss", "train acc", "test acc"])
    # 4: training loop
    for epoch in range(num_epochs):
        metric = d2l.Accumulator(3)
        for i, (X, y) in enumerate(train_iter):
            X, y =,
            y_hat = net(X)
            l = loss(y_hat, y)
            # temporarily disable grad to calculate metrics
            with torch.no_grad():
                train_loss = l
#                 import ipdb; ipdb.set_trace()
                _, preds = torch.max(y_hat, 1)
                train_acc = ((preds == y).sum()) / float(X.shape[0])
            if (i + 1) % 50 == 0:
                animator.add(epoch + (i / len(train_iter)), (train_loss, train_acc, None))
        test_acc = evaluate_accuracy_gpu(net, test_iter, device)
        animator.add(epoch + 1, (None, None, test_acc))
    print(f'loss {train_loss:.3f}, train acc {train_acc:.3f}, test acc {test_acc:.3f}')

The only thing I’m training is not passing in a value for the optimizer parameter so that it takes the default value of None and is set inside the function body.

Any idea why this could cause an issue?

Can you publish all your code, and throw me a github URL?
I still wonder the part that you don’t show.

Ok, I figured it out, I was just being stupid. Basically was doing something like this:

optimizer = optimizer(net)
net = net()
train(net, optimizer)

So of course the optimizer was not attached to the network. If you still want to laugh at my mistake you can read it all here

Hey @Nish, you are not stupid. Asking questions is always smarter than hiding there. :wink:

1 Like

@Nish Wow, I learnt a lot from your github and know I need to learn more.
Have stared your repo. Keep going and communicating. :rofl:

1 Like

Thank you both! I finished up with this chapter by testing out VGGNet 19 on our upscaled Fashion-MNIST -

Not bad!

1 Like

The vgg_block function in Pytorch is different than others. It has 3 inputs and it does not match the book description.

Thanks for raising this, appreciate it. Fixed now in master.

@anirudh @goldpiggy I was trying to run this in my colab environment. Any reason the training is slow.
I have rate of 675 examples/ sec however in the chapter we have 2547.7 examples/sec . I am using GPU

@sushmit86 all GPUs are not same in terms of the FLOPs they offer. Afaik Collab uses a Tesla K80 which is okay for some basic deep learning but may not be as good as some of the other high end GPU offerings.

Hence the difference!


I am slightly frustrated at not being able to run VGG models even their reduced cousins using my gpu or kaggle. When I try my own however more often than not I am not able to train the network.

with much headache was finally able to work it out refer after question 4.

  1. When printing out the dimensions of the layers we only saw 8 results rather than 11. Where
    did the remaining 3 layer information go?
  • in maxpool
  1. Compared with AlexNet, VGG is much slower in terms of computation, and it also needs
    more GPU memory. Analyze the reasons for this.
  • it has more conv layers and the face that it has 3 linear layers. When looking at the network I found that the linear layers take most of the memory compared to conv layers.
  1. Try changing the height and width of the images in Fashion-MNIST from 224 to 96. What
    influence does this have on the experiments?
  • it wil lrun faster. This needs to be checked. I tried changing to some other value but it breaks. If its below 200 most probably you would be faced with error of too small size of image after convolution.
  1. Refer to Table 1 in the VGG paper (Simonyan & Zisserman, 2014) to construct other common
    models, such as VGG-16 or VGG-19.
  • it can be done but my GPU says hi.
VGG_19_arch = ((2,64), (2,128), (2,256), (4,512), (4,512))

VGG_16_arch = ((2,64), (2,128), (2,256), (3,512), (3,512))

finally was able to train VGG but had to remove one of the Linear layers of 4096 and replace with 512.