Hi, this is a great chapter on pooling, but I think it could be made more comprehensive by also stating the explicit formula for the output dimensions just as was done for convolutions, padding and strides. What do you think?

Hi @wilfreddesert, we would like to hear more about your feedback. Could you post a suggestion/PR on how shall we improve? Please feel free to contribute!

This probably wasn’t done b/c pooling simply collapses the channel dimension

Hi! Thank you for this great book. I have a question from this section, namely from the third paragraph, where it reads:

For instance, if we take the image X with a sharp delineation between black and white and shift the whole image by one pixel to the right, i.e., Z[i, j] = X[i, j + 1] , then the output for the new image Z might be vastly different.

Shouldn’t it be one pixel to the left? If we want to shift the whole image one pixel to the right, then the correct equation should be Z[i, j] = X[i, j - 1], right?

This is not true. The pooling operation is applied to each channel separately, resulting in a tensor with the same number of channels as the input tensor, as described in the book.

At each step of the pooling operation, the information contained in the pooling window (i.e. the values of all the pixels inside the window) is “collapsed” in a single pixel; this is similar to what a convolution layer does, with a single difference: the pooling works on each channel separately, thus preserving the number of channels in the output (while the convolution layer with a 3-dimensional kernel sums over all the channels and “collapses” the channel dimension in the output). Thus, the formula you’re looking for is the same formula that was introduced when dealing with a convolutional layer, but replacing the kernel shape with the shape of the pooling window and keeping the channel dimension.

Yes, I’m aware. “Reduce” might have been better semantics than “collapse”. Probably good that we’re clarifying this however.

1 Like

My opinions just for reference :
I think what “shift the whole image” means here is “shift the image capturing window” instead of image itself since shifting image itself by some pixels is somewhat hard to explain/imagine. If so, what do we do for the blank ?

Exercises and my answers

  1. Can you implement average pooling as a special case of a convolution layer? If so, do it.

  2. Can you implement maximum pooling as a special case of a convolution layer? If so, do it.

  • It can be done but we need to modify the convolution layer, or find the special kernel. I couldnt so I modified the operation itself, not the right answer I guess.

  1. What is the computational cost of the pooling layer? Assume that the input to the pooling

layer is of size c×h×w, the pooling window has a shape of ph ×pw with a padding of (ph, pw)

and a stride of (sh, sw).

  • I expect it to be (h- ph - sh)/sh * (w - pw -sw)/sw * len© * time taken to find max value.
  1. Why do you expect maximum pooling and average pooling to work differently?
  • because max pooling will give maximum value from the neighbors while average pooling would consider all the neighbors. I expect max pooling to be faster.
  1. Do we need a separate minimum pooling layer? Can you replace it with another operation?
  • This seems like a trick question, but one way is to multiply X by -1 and then do max pooling.
  1. Is there another operation between average and maximum pooling that you could consider

(hint: recall the softmax)? Why might it not be so popular?

  • taking average of the log and then computing the maximum value devide by sum of log. It might be computationally intensive.