I’m a bit confused as to how ring synchronization becomes (n-1)/n. From what I understand, If you aggregate chunks you will need to have a full pass (n-1) on all gpus to aggregate a specific chunk. Could you please clarify?
I think you should know each GPU sends a chunk to next GPU ‘simultaneously.’
Thus, it suffices to consider the overhead of one GPU.
‘Each chunk is of size 1/n’
Each GPU sends the chunk n-1 times.
Thus, time is O((n-1)/n) w.r.t the number of GPUs.
I am a bit confused of the ring synchronization algorithm. After (n-1)/n time, each GPU should have only an aggregated chunk, not the full gradient. Is that right?
If we use the same example of synchronizing 160 MB across 8 V100 GPUs we arrive at approximately 2⋅160MB/(3⋅18GB/s)≈6ms.can you explain why？ I think the value is 2⋅160MB/(18GB/s)≈24 ms