It is so quick and it only spent <5 mins to train on my cpu.
loss 0.246, train acc 0.909, test acc 0.878
3313.6 examples/sec on cpu
Exercises and mysilly answers
 Can we remove the bias parameter from the fullyconnected layer or the convolutional layer
before the batch normalization? Why?
 Since batchnorm already includes bias, and also shift by mean any constant would be cancelled out therefore, it makes no sense to add bias.

Compare the learning rates for LeNet with and without batch normalization.

Plot the increase in training and test accuracy.
 with batchnorm the accuracy = 0.962 at lr 1.0
 without batchnorm the accuracy 0.931 at lr 0.1

How large can you make the learning rate?
 tried with 1.0

Do we need batch normalization in every layer? Experiment with it?
 okay. tried it is surprisingly giving accuracy of 94.6

Can you replace dropout by batch normalization? How does the behavior change?
In the Ioffe and Szegedy 2015, the authors state that â€śwe would like to ensure that for any parameter values, the network always produces activations with the desired distributionâ€ť. So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.
As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.
So in summary, the order of using batch normalization and dropout is:
> CONV/FC > BatchNorm > ReLu(or other activation) > Dropout > CONV/FC >

Fix the parameters beta and gamma, and observe and analyze the results.
 okay. If it means fixing moving averages then I did it.

Review the online documentation for BatchNorm from the highlevel APIs to see the other
applications for batch normalization.
https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html
``Our method draws its strength from making normalization a part of the model architecture and performing the
normalization for each training minibatch. Batch Normalization allows us to use much higher learning rates and
be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout.
Applied to a stateoftheart image classification model,
Batch Normalization achieves the same accuracy with 14
times fewer training steps, and beats the original model
by a significant margin.```
 Research ideas: think of other normalization transforms that you can apply?
Can you apply the probability integral transform?
in probability theory, the probability integral transform (also known as universality of the uniform) relates to the result that data values that are modeled as being random variables from any given continuous distribution can be converted to random variables having a standard uniform distribution.[1] This holds exactly provided that the distribution being used is the true distribution of the random variables; if the distribution is one fitted to the data, the result will hold approximately in large samples.
How about a full rank covariance estimate?
 did not get it
yeah lenet is refreshingly easy to train.
Do we need to extract the attribute data
from two tensors moving_mean
and moving_var
as in the last line of the function batch_norm
? I understand that the attribute getter data
of a torch.nn.Parameter
object gives us its backbone torch.Tensor
, but in this case moving_mean
and moving_var
are already torch.Tensor
objects. In other words, shoudnâ€™t
return Y, moving_mean.data, moving_var.data
be simply
return Y, moving_mean, moving_var
How the scale parameter đťś¸ and shift parameter đťś· are being updated? Does it update the same way as conv layerâ€™s weights?
Yes, they are trainable parameters too, just like every other weight or bias in the network.
In the final section the author leave a few sections about analyzing the convergence and trainabillity of DNNâ€™s in the air.
I reacommend the book: Asymptotic Analysis and Peturbation Theory by William Paulsen.
This covers a range of tequniques for constructing such proofs.