epoch:40,lr:0.1, 2 hidden , hidden num(1&2) :256
参数初始值跟书上一样吗,感觉是参数初始值不一样导致的,可能初始值大了
暴力解决方案把所有程序放进如下语句内容中
if name == ‘main’:
几个超参的影响
- Loss的reduction:默认是mean;如果用sum梯度变大,难以收敛,配合调小batchsiz、lr可以解决。
- num_hiddens:过小准确率低,过大会导致过拟合,表现train_acc>>test_acc。
- lr:过大会震荡,过小优化的慢。
- 多个超参,num_hinddens、num_layers、batch_size、lr、loss_reduction、num_epochs等参数,会互相影响,联合找到最优组合,寻优空间很大。
- 多个超参数的搜索方法?
- 经验搜索:batch_size、num_hiddens参考同类业务,重点看学习速度、震荡情况调整lr,看损失曲线、过拟合调整num_epochs。
- 随机搜索:碰运气
- 网格搜索:暴力穷举
- 基于梯度的优化方法,参与反向传播
求教各位大佬,李老师直播里面提到的将weight的初始值设为0会出现问题,我自己尝试了一下,loss跟acc会保持在一个常数不变。我的理解是sgd里对weight的导数为0,所以参数保持不变。请问我这样想是对的吗,有没有哪里有这块的数学推导,谢谢了。
-
different width of hidden layer
256 seems to be good enough. -
different depth of hidden layers with same width (256)
1 layer
2 layers
3 layers
given the same batches of samples and other hyperparameters, the more hidden layers there are, the more time is needed for convergence, and possibly, the less accurate the model becomes. This is likely due to the activation function “ReLU”, which prevents the model from learning certain features, thereby diminishing the model’s expression. -
learning rate mostly takes effect on rate and result of convergence.
-
lists all discrete values for each hyperparameters, and uses DFS to traverse every combinations.