mmdetection报错汇总

Huia2022/7/21大约 1 分钟

one of the variables needed for gradient computation has been modified by an inplace operation

详细报错信息

Traceback (most recent call last):
File "tools/train.py", line 242, in <module>
    main()
File "tools/train.py", line 231, in main
    train_detector(
File "/home/lgh/code/mmlab/mmdetection/mmdet/apis/train.py", line 244, in train_detector
    runner.run(data_loaders, cfg.workflow)
File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
    getattr(hook, fn_name)(self)
File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter
    runner.outputs['loss'].backward()
File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 2048, 25, 25]], which is output 0 of ReluBackward1, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

问题分析:
所谓inplace操作就是直接修改地址上的值。
首先是torch中所有加 _ 的函数，x.squeeze*()，x.unsqueeze*()。这些操作直接修改变量，不返回值。
其次一些函数可以通过设置是否inplace，比如Pytorch中 torch.relu()和torch.sigmoid()等激活函数不是inplace操作，其中ReLU可通过设置inplace=True进行inplace操作。
以及一些算术操作，x += res是inplace操作，x = x + res不是。以及一些赋值操作。
如果一些用于backward的值被inplace操作修改了就会报错，所以最好不使用inplace操作，以及将赋值操作放在各种需要计算梯度运算之前，gather要在赋值之后。

解决方案：
将ReLu的inplace=True移除
参考文章：
https://blog.csdn.net/qq_38163755/article/details/110957133