训练过程中报unhandled cuda error
Created by: xiegegege
为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息:
- 标题:简洁、精准概括您的问题,例如“Insufficient Memory xxx" ”
- 版本、环境信息: 1)PaddlePaddle版本:Paddle1.5 2)GPU:P40, Cuda9 ,Cudnn7 4)系统环境:CentOS, python2.7
- 训练信息 1)单机,多卡
- 问题描述:运行检测库模型yolov3,运行命令为:python ppdet/train.py --cfg_file=configs/yolov3_ResNet34_1x_syncbn.yml ,在正常运行很多天后报错异常退出,运行log为: 2019-07-09 18:08:16.698352, iter: 346094, lr: 0.001000, 'loss': 78.400635, time: 1.747 terminate called after throwing an instance of 'paddle::platform::EnforceNotMet' what(): unhandled cuda error at [/ssd1/xiege/paddle_wheel/Paddle_2.7/Paddle/paddle/fluid/platform/nccl_helper.h:70] PaddlePaddle Call Stacks: 0 0x7fa2e3830eb0p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 352 1 0x7fa2e3831229p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137 2 0x7fa2e538fac9p 3 0x7fa2e539f50dp 4 0x7fa2e539f50dp 5 0x7fa2e539f50dp 6 0x7fa2e539f50dp 7 0x7fa2e539f50dp 8 0x7fa2e539f50dp 9 0x7fa2e539f50dp 10 0x7fa2e539f50dp 11 0x7fa2e53a0244p paddle::framework::details::OpHandleBase::RunAndRecordEvent(std::function<void ()()> const&) + 116 12 0x7fa2e538fb52p paddle::framework::details::AllReduceOpHandle::RunAllReduceFuncs(std::vector<std::function<void ()()>, std::allocator<std::function<void ()()> > > const&) + 98 13 0x7fa2e5391658p paddle::framework::details::AllReduceOpHandle::RunImpl() + 3176 14 0x7fa2e53a07e0p paddle::framework::details::OpHandleBase::Run(bool) + 160 15 0x7fa2e5381b56p paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*) + 310 16 0x7fa2e53807bfp paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*) + 47 17 0x7fa2e5380b7fp 18 0x7fa2e4b663d3p std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()(), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) + 35 19 0x7fa2e38fbeb7p std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()()>&, bool&) + 39 20 0x7fa39d5a0e03p pthread_once + 83 21 0x7fa2e537c202p 22 0x7fa2e38fd434p _ZZN10ThreadPoolC1EmENKUlvE_clEv + 404 23 0x7fa3036e5470p 24 0x7fa39d59baa1p 25 0x7fa39cc5dc4dp clone + 109 显卡状态: