==
The PyTorch adaptation of this section was initially contributed by @StevenJokess, as reviewed and revised by @anirudh. The PR may not show up on Git due to the suspended account of the former.

With this loss where reduction is ‘sum’, I think the model does not consider the data size(batch size) in gradient descent.

Isn’t it better to use loss = nn.BCEWithLogitsLoss(reduction='mean') with metric.add(update_D(X, Z, net_D, net_G, loss, trainer_D)* batch_size ,update_G(Z, net_D, net_G, loss, trainer_G)* batch_size, batch_size)

If the generator does a perfect job, then 𝐷(𝐱′)≈1D(x′)≈1 so the above loss near 0, which results the gradients are too small to make a good progress for the discriminator. So commonly we minimize the following loss: … which is just feed 𝐱′=𝐺(𝐳)x′=G(z) into the discriminator but giving label 𝑦=1y=1.

The sentences above look quite confusing to me… There might be some grammatical errors in there…
Could you please rewrite these so they look coherent and natural?

The equation states that we need the parameters of the generator that maximize the loss, and the parameters of the discriminator that minimize the loss. However in the loss plot we see that the discriminator loss increases, while the generator loss decreases. Can someone please clarify?

As the model gets trained, the discriminator loss decreases as it is increasingly being fooled by the generator. The generator loss decreases as the the generator outputs are increasingly being predicted as 1 (Loss_G = loss(D(G(x)), ones)), so we have smaller and smaller loss_G