Anchor Boxes

https://d2l.ai/chapter_computer-vision/anchor.html

Can anyone plz explain this statement more…

Therefore, we are usually only interested in a combination containing s1 or r1 sizes and aspect ratios, that is

And this…

That is, the number of anchor boxes centered on the same pixel is n+mβˆ’1

Thanks for your time.

Hey @Wolf_Rage, great question!

It can be interpreted as

Therefore, we only cross combine size s1 with all the ratios {r1, r2, …r_m}, and cross combine ratio r1 with all the sizes {s1, s2, …s_n}, this is (𝑠1,π‘Ÿ1),(𝑠1,π‘Ÿ2),…,(𝑠1,π‘Ÿπ‘š),(𝑠2,π‘Ÿ1),(𝑠3,π‘Ÿ1),…,(𝑠𝑛,π‘Ÿ1).

Hence, we has n+m-1 combinations in total. Does it make sense to you now?

1 Like

Hello, I have questions about the way that labeling anchors. It seems we need a loop to generate labels in the step as Figure 13.4.2 shows, and it might be slow. I find one other way to generate labels, it’s just find the largest iou and check if it’s larger than threshold.What’s the difference between the two ways?

Yes, now it totally makes sense. Thanks a lot.

Assume the size is s∈(0,1]

What is meant by size β€˜s’ of anchor boxes as stated in section 13.4.1? How should I visualize it in my head? I can understand that a box might have its height and width but what does this number called size denote?

Thank You.

Hi @Aman_Singh, great question! As we stated in the 13.4.1,

Assume the size is π‘ βˆˆ(0,1], the aspect ratio is π‘Ÿ>0, and the width and height of the anchor box are π‘€π‘ βˆšπ‘Ÿand β„Žπ‘ /βˆšπ‘Ÿ, respectively.

You can think of the size as the ratio to enlarge the height or width of the original bbox.

Got it. Thanks a lot !!

Hi, in the training mode, it is not hard to assign a category label (by Fig. 13.4.2) and offset value (by Eq. 13.4.3) of an anchor box because we have ground truth. But how do we set category label and offset value in the prediction mode but we don’t have ground truth b-box? Though we need the category label and offset value to calculate the probability class for each anchor box. The example given in prediction mode is trivial since it set all of the anchor boxes to prediction b-boxes. Thank you.

Hey @rezahabibi96, great question! If I understand correctly, you were asking about how to deal with no β€œground truth” label for the training set? In that case, the most efficient way is to label some manually and then apply transfer learning. If it is a commonly seen object, you only need ~100 labels with transfer learning.

No, I am not asking about annotating training set, I am asking for the prediction mode, how to calculate the offset value and assign the category label in the prediction mode?

Check the YOLO section here.

I’m not sure I understand the following line of code:

X = np.random.uniform(size=(1, 3, h, w))  # Construct input data

What exactly are the input data, and why is Y not affected by the first two dimensions?

Hi @tomsoya, great question. In this line, we randomly simulate some number which helps you compute the size of input and output. X should be a 4d tensor (i.e., a list of RGB images): (batch_size, RGB_channels, height, width).

Let me know if that makes sense to you.

I understand. Thank you!

How do we do exercise 2 in this section?

In 13.4.1. I understand that

  • w is the width of the image
  • h is the height of the image
  • r is the aspect-ratio of the anchor box.
  • s is the size of the anchor box
    if this would be true and the width and height of the anchor box are computed as described in the text by:
  • w_b = ws\sqrt{r}
  • h_b = hs / \sqrt{r}
    There would be
  • r = w_b/h_b= r w/h = w/h * w/h
    which is obviously wrong. With the definitions above the width and height of the anchor box must be computed by:
  • g := \sqrt{w*h}
  • w_b = g*s\sqrt{r}
  • h_b = g*s / \sqrt{r}
    So, probably the definitions of s or r are not as I assume. How are they defined exactly?

from https://github.com/apache/incubator-mxnet/blob/master/src/operator/contrib/multibox_prior.cc
I got how to compute the height and width of the bounding boxes, here as python code for readability

hb_factor = s / np.sqrt(r) 
hb = hb_factor * h # height of bounding box
wb_factor = s * np.sqrt(r) * h/w 
wb = wb_factor * w # width of bounding box

So, the width of the bounding box is computed by w_b = w * s * h/w * \sqrt{r}. This is contrary to the text.

Do you compare the pixels for both when determine the intersection of bounding box and anchor box? e.g. compare pixel value at position (x,y) for one anchor box with pixel value at position (i,j) for one bounding box on the intersection region.

Is intersection the same as overlapping? Namely, they are exactly the same. picture for that portion.

Let us say, I take 2 pictures from different angles for one cat. What is the logic/criteria or mathematically tell which parts are overlapping for these two pictures?

The formula and calculation seems simple, however I am puzzled to understand how to determine the overlapping portion.

Thanks !

Thanks for providing this additional explanation. I think it should be included in the chapter itself :smiley: