thank you, good explanation

I just wondering that in the description is not mentioned that skip connections diminish the vanishing gradient problem (similar like highway networks from Schmidthuber). Or have I overlooked something? Nevertheless: Great Book.

@chris_elgoog I believe you bring up a fair point here.

Although vanishing-gradients aren’t the primary concern of the authors it is indeed mentioned (along with the highway networks you mention) in the *Shortcut Connections* section of their paper. I’m guessing here, but perhaps they felt like relu activations sufficiently dealt with vanishing-gradient so was no longer a primary concern by that point?

For the curious, the primary concern the authors are addressing is the *degradation problem*:

“When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error […]”

How can I understand that “If the identity mapping f(x)=x is the desired underlying mapping, the residual mapping is easier to learn: we only need to push the weights and biases of the upper weight layer within the dotted-line box to zero”？

i think we can think the process as:

if we add the model of the red box into our bigmodel, but we find the loss is larger than before, what happen in the backward, the backward will make the parameter of the redbox smaller than before, do you remember dropout, if we want to drop some nerual, we will make the output of the nerual to zero, we also can make the parameter (w,b) of the model to zero, so the output will be zero too, so the model of redbox is useless.

so we can pruning the nerual, and we can pruning the part of the model too.

The ResNet paper you reference (He et al, 2016a) uses softmax after the penultimate linear layer, and they use Kaiming He weight initialization. I got the impression that you were replicating the architecture described in the paper, but it seems your code does not softmax the final output and that the weight initialization is Xavier.

Great catch! Yes we did make a simple modification for the best practice. Most of researchers now prefer to use Xavier for initialization since it is more stable.

Residual Block mentioned in Fig: 7.6.2 shows that the network learns F(x) - x, from the original paper it shows it otherwise. Am I missing something ?