使用 GPU

goldpiggy · January 14, 2021, 5:39am

https://zh.d2l.ai/chapter_deep-learning-computation/use-gpu.html

11122 · July 21, 2021, 3:27pm

torch.cuda.device可以被用来做什么呢？看起来在gpu上操作变量只需要torch.device就可以了。

11122 · July 22, 2021, 6:26am

以及5.6.3.中的代码，不需要写net = net.to(try_gpu())吧？直接net.to(try_gpu())就可以了。测试看来.to对于nn.Module是in-place操作。

learn_pytorch · September 3, 2021, 12:20pm

Mac m1 咋用GPU，需要特殊的设置吗？

Kaiyuan_Gan · September 21, 2021, 2:45pm

首先确认电脑上是否有GPU，可以使用nvidia-smi查看；
如果有的话，可以先安装显卡驱动，然后安装pytorch的GPU版本就行。

Hugo_Yang · October 14, 2021, 2:29pm

目前几大框架支持最好的还是NVIDIA的cuda，AMD的ROCm感兴趣可以试试，m1还是算了吧，soc估计不好支持

blessyyyu · December 6, 2021, 6:02am

测量同时在两个GPU上执行两个矩阵乘法与在一个GPU上按顺序执行两个矩阵乘法所需的时间。提示：你应该看到近乎线性的缩放

请问一下大家，这个该如何验证？

Aaron-Ma · February 2, 2022, 2:49pm

运行nvidia-smi显示Failed to initialize NVML: Unknown Error是为什么啊？

mingtianyebuchi · February 14, 2022, 9:04am

安装Cuda驱动就好了，网上有很多这样的教程

CrazyTianC · March 20, 2022, 4:20am

代码当中如何实现”最好是为GPU内部的日志分配内存，并且只移动较大的日志。” 一般都使用logging的模块，这个可以控制是否记录到GPU？

Sandra · March 21, 2022, 9:44am

torch.cuda.device 一般用于with 环境管理器; e.g with torch.cuda.device(0)

BabyXin · July 26, 2022, 11:51am

100个500x500矩阵的矩阵乘法所需的时间，并记录输出矩阵的Frobenius范数的时间对比：

import time
import torch

def try_gpu(i=0):  #@save
    """如果存在，则返回gpu(i)，否则返回cpu()"""
    if torch.cuda.device_count() >= i + 1:
        return torch.device(f'cuda:{i}')
    return torch.device('cpu')

startTime1=time.time()
for i in range(100):
    A = torch.ones(500,500)
    B = torch.ones(500,500)
    C = torch.matmul(A,B)
endTime1=time.time()

startTime2=time.time()
for i in range(100):
    A = torch.ones(500,500,device=try_gpu())
    B = torch.ones(500,500,device=try_gpu())
    C = torch.matmul(A,B)
endTime2=time.time()

print('cpu计算总时长:', round((endTime1 - startTime1)*1000, 2),'ms')
print('gpu计算总时长:', round((endTime2 - startTime2)*1000, 2),'ms')

运行结果：

cpu运行总时长: 275.99 ms
gpu运行总时长: 22.0 ms

xixi · September 2, 2022, 1:34am

为什么我用pycharm跑出来显示GPU用的时间有2430.96 ms比CPU慢好多，但是用jupyter跑出来时间就很短

Wei_pung · September 11, 2022, 3:56am

你好,这个问题,我很好奇,我尝试了一下,
pycharm上
cpu计算总时长: 1484.38 ms
gpu计算总时长: 1985.22 ms
jupyter上
cpu计算总时长: 1479.2 ms
gpu计算总时长: 8.0 ms
我在python IDE上也试了一下,发现gpu计算时长还是比cpu长好多.
你有找到答案吗?

xixi · September 11, 2022, 7:20am

还没找到答案，但我看有说是因为数据从cpu到gpu上也需要时间的，所以gpu优势不太明显

DJ_Zhu · September 19, 2022, 8:49am

Q1: 尝试一个计算量更大的任务，比如大矩阵的乘法，看看CPU和GPU之间的速度差异。再试一个计算量很小的任务呢？
A1: 当计算量较大时，GPU明显要比CPU快；当计算量很小时，两者差距不明显。

# 计算量较大的任务
X = torch.rand((10000, 10000))
Y = X.cuda(0)
time_start = time.time()
Z = torch.mm(X, X)
time_end = time.time()
print(f'cpu time cost: {round((time_end - time_start) * 1000, 2)}ms')
time_start = time.time()
Z = torch.mm(Y, Y)
time_end = time.time()
print(f'gpu time cost: {round((time_end - time_start) * 1000, 2)}ms')

# 计算量很小的任务
X = torch.rand((100, 100))
Y = X.cuda(0)
time_start = time.time()
Z = torch.mm(X, X)
time_end = time.time()
print(f'cpu time cost: {round((time_end - time_start) * 1000)}ms')
time_start = time.time()
Z = torch.mm(Y, Y)
time_end = time.time()
print(f'gpu time cost: {round((time_end - time_start) * 1000)}ms')

Q2: 我们应该如何在GPU上读写模型参数？
A2: 使用net.to(device=torch.device(‘cuda’))将模型迁移到gpu上，然后再按照之前的方法读写参数。

Q3: 测量计算1000个 100*100 的矩阵乘法所需的时间。记录输出矩阵的Frobenius范数，一次记录一个结果 vs 在GPU上保存并仅传输最终结果。
（中文版翻译有点问题，英文原版这句话是log the Frobenius norm of the output matrix one result at a time vs. keeping a log on the GPU and transferring only the final result，所以实质是要我们作对比）

# 一次记录一个结果
time_start = time.time()
for i in range(1000):
    Y = torch.mm(Y, Y)
    Z = torch.norm(Y)
time_end = time.time()
print(f'gpu time cost: {round((time_end - time_start) * 1000)}ms')
Y = X.cuda(0)
# 在GPU上保存并仅传输最终结果
time_start = time.time()
for i in range(1000):
    Y = torch.mm(Y, Y)
Z = torch.norm(Y)
time_end = time.time()
print(f'gpu time cost: {round((time_end - time_start) * 1000)}ms')

Q4: 测量同时在两个GPU上执行两个矩阵乘法与在一个GPU上按顺序执行两个矩阵乘法所需的时间。提示：你应该看到近乎线性的缩放。
A4: 由于只有一个GPU，所以没法实验验证，但个人推测后者的时间应该是前者的两倍。

ChenJoy · October 23, 2022, 4:00am

最后一问，在2个GPU计算矩阵乘法为什么比在1个GPU计算需要花费更多的时间呢？希望大佬帮忙解答下
代码如下：
X=torch.rand((10000,10000))
Y=torch.rand((10000,10000))
#在2个GPU上执行两个矩阵乘法
time_start=time.time()
X=X.cuda(0)
Y=Y.cuda(1)
Z1=torch.mm(X,X);Z2=torch.mm(Y,Y)
time_end = time.time()
print(f’use 2 gpu time cost: {round((time_end - time_start) * 1000)}ms’)
#在1个GPU上按顺序执行上面两个矩阵乘法
time_start=time.time()
Y=Y.cuda(0)
Z11=torch.mm(X,X)
Z22=torch.mm(Y,Y)
time_end = time.time()
print(f’use 1 gpu time cost: {round((time_end - time_start) * 1000)}ms’)

在kaggle执行结果：
use 2 gpu time cost: 218ms
use 1 gpu time cost: 2ms

Inamori_Mika · March 3, 2023, 1:11am

M1的话不能用哦，还是建议用有nvidia显卡的电脑

Inamori_Mika · March 3, 2023, 1:31am

首先上，你在第一次测试的时候已经把x放入gpu所以第二次直接计算会很快。建议为x重新分配内存后再进行测试。

protectorjy · March 9, 2023, 1:50pm

你可以在设变量设置两个变量，就不要复制了，就像我这样，对比还是挺明显的
微信截图_20230309215020
微信截图_20230309215030