mmdetection报错汇总
one of the variables needed for gradient computation has been modified by an inplace operation
详细报错信息
Traceback (most recent call last): File "tools/train.py", line 242, in <module> main() File "tools/train.py", line 231, in main train_detector( File "/home/lgh/code/mmlab/mmdetection/mmdet/apis/train.py", line 244, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train self.call_hook('after_train_iter') File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter runner.outputs['loss'].backward() File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 2048, 25, 25]], which is output 0 of ReluBackward1, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
问题分析: 所谓inplace操作就是直接修改地址上的值。
首先是torch中所有加 _ 的函数,x.squeeze*(),x.unsqueeze*()。这些操作直接修改变量,不返回值。
其次一些函数可以通过设置是否inplace,比如Pytorch中 torch.relu()和torch.sigmoid()等激活函数不是inplace操作,其中ReLU可通过设置inplace=True进行inplace操作。
以及一些算术操作,x += res是inplace操作,x = x + res不是。以及一些赋值操作。
如果一些用于backward的值被inplace操作修改了就会报错,所以最好不使用inplace操作,以及将赋值操作放在各种需要计算梯度运算之前,gather要在赋值之后。
解决方案: 将
ReLu
的inplace=True
移除参考文章: https://blog.csdn.net/qq_38163755/article/details/110957133