Asynchronous Computation

Don’t know why, somehow, when running on the GPU, the performance of tensor multiplication with synchronizing looks worse than without it. The running time on my GPU is 0.0004s without synchronizing and 0.0026s with it.
Interestingly, on the CPU, I received the “expected” results which are 0.2002s without synchronizing and 0.1061 with it.