我想使用多GPU对SSD训练,修改了部分代码:
原代码如下:
device, net = d2l.try_gpu(), TinySSD(num_classes=1)
net = net.to(device)
将原码全部注释以后,修改的代码如下:
net = TinySSD(num_classes=1)
devices = d2l.try_all_gpus()
net = nn.DataParallel(net, device_ids=devices)
此外,每个epoch训练中的X、Y也做了调整:
X, Y = features.to(devices[0]), target.to(devices[0])
运行后报以下错误:
RuntimeError Traceback (most recent call last)
<ipython-input-50-edd9900ab59b> in <module>
14 X, Y = features.to(devices[0]), target.to(devices[0])
15 # 生成多尺度的锚框,为每个锚框预测类别和偏移量
---> 16 anchors, cls_preds, bbox_preds = net(X)
17 # 为每个锚框标注类别和偏移量
18 bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y)
D:\Anaconda3\envs\chtorch\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
D:\Anaconda3\envs\chtorch\lib\site-packages\torch\nn\parallel\data_parallel.py in forward(self, *inputs, **kwargs)
153 raise RuntimeError("module must have its parameters and buffers "
154 "on device {} (device_ids[0]) but found one of "
--> 155 "them on device: {}".format(self.src_device_obj, t.device))
156
157 inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
我应该如何修正,谢谢大家!
有个疑问,
anchors, cls_preds, bbox_preds = net(X)
这句中的anchors是将多个batch中不同层输出的锚框合并在了一起的结果
def multibox_target(anchors, labels):
…
for i in range(batch_size):
label = labels[i, :, :]
anchors_bbox_map = assign_anchor_to_bbox(label[:, 1:], anchors,
device)
…
但是当匹配label和anchors时候,将某张照片的label和batch中所有的锚框进行了匹配?按我的理解,应该是只有这张照片对应的锚框才能参与匹配。这里是为什么呢?