Hi, this is a great chapter on pooling, but I think it could be made more comprehensive by also stating the explicit formula for the output dimensions just as was done for convolutions, padding and strides. What do you think?
This probably wasn’t done b/c pooling simply collapses the channel dimension
Hi! Thank you for this great book. I have a question from this section, namely from the third paragraph, where it reads:
For instance, if we take the image
Xwith a sharp delineation between black and white and shift the whole image by one pixel to the right, i.e.,
Z[i, j] = X[i, j + 1], then the output for the new image
Zmight be vastly different.
Shouldn’t it be one pixel to the left? If we want to shift the whole image one pixel to the right, then the correct equation should be
Z[i, j] = X[i, j - 1], right?
This is not true. The pooling operation is applied to each channel separately, resulting in a tensor with the same number of channels as the input tensor, as described in the book.
At each step of the pooling operation, the information contained in the pooling window (i.e. the values of all the pixels inside the window) is “collapsed” in a single pixel; this is similar to what a convolution layer does, with a single difference: the pooling works on each channel separately, thus preserving the number of channels in the output (while the convolution layer with a 3-dimensional kernel sums over all the channels and “collapses” the channel dimension in the output). Thus, the formula you’re looking for is the same formula that was introduced when dealing with a convolutional layer, but replacing the kernel shape with the shape of the pooling window and keeping the channel dimension.
Yes, I’m aware. “Reduce” might have been better semantics than “collapse”. Probably good that we’re clarifying this however.