Concise Implementation for Multiple GPUs

https://d2l.ai/chapter_computational-performance/multiple-gpus-concise.html

X, y = X.to(devices[0]), y.to(devices[0])
In code, we send data to device 0. In this case, how to divide data parallel in a batch to different GPUs?

In PyTorch concise implementation we utilize nn.DataParallel.
It splits the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device). In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module.

In the pytorch implementation, if multiple devices (GPU’s) are used, the loss function l will be an array of losses (one per device) so l.backward() will return an error. Need to do l.sum().backward() or l.mean().backward(). Am I missing something?