epoch的问题
Created by: yyhlvdl
@pkuyym deepspeech训练模型的第一个epoch是按照语音数据持续时间大小进行训练的。然而,在我进行亲自训练的时候出现了问题(我是按照readme中的docker模式进行的),在第一个epoch结束后,程序并不会开始下一个epoch,事实上,它卡在那里不动了。然后把gpu内存撑爆,这就是前段时间我一直怀疑的内存泄露的问题。 Pass: 0, Batch: 1, TrainCost: 93.167046 Pass: 0, Batch: 2, TrainCost: 95.089401 Pass: 0, Batch: 3, TrainCost: 83.036194 Pass: 0, Batch: 4, TrainCost: 80.692101 Pass: 0, Batch: 5, TrainCost: 67.172005 Pass: 0, Batch: 6, TrainCost: 55.272041 Pass: 0, Batch: 7, TrainCost: 54.719227 Pass: 0, Batch: 8, TrainCost: 48.882935 Pass: 0, Batch: 9, TrainCost: 52.184387 Pass: 0, Batch: 10, TrainCost: 35.945839 Pass: 0, Batch: 11, TrainCost: 37.550327 Pass: 0, Batch: 12, TrainCost: 37.640144 Pass: 0, Batch: 13, TrainCost: 37.709978 F0109 01:49:19.022586 845 hl_cuda_device.cc:273] Check failed: cudaSuccess == cudaStat (0 vs. 2) Cuda Error: out of memory *** Check failure stack trace: *** @ 0x7f9687cd504d google::LogMessage::Fail() @ 0x7f9687cd7398 google::LogMessage::SendToLog() @ 0x7f9687cd4b5b google::LogMessage::Flush() @ 0x7f9687cd826e google::LogMessageFatal::~LogMessageFatal() @ 0x7f9687c7da9f hl_malloc_device() @ 0x7f9687ad17c7 paddle::GpuAllocator::alloc() @ 0x7f9687abe458 paddle::PoolAllocator::alloc() @ 0x7f9687abde33 paddle::GpuMemoryHandle::GpuMemoryHandle() @ 0x7f9687a1d30b paddle::GemmConvFunction<>::calc() @ 0x7f96878b30f8 paddle::ExpandConvLayer::forward() @ 0x7f96879629ff paddle::NeuralNetwork::forward() @ 0x7f968796f92c paddle::TrainerThread::forward() @ 0x7f96879731c8 paddle::TrainerThread::computeThread() @ 0x7f96d6b5cc80 (unknown) @ 0x7f96dd17e6ba start_thread @ 0x7f96dceb43dd clone @ (nil) (unknown) Aborted (core dumped) 要说明的是,我使用的是librispeech的0s到1.5s共103个数据,batch_size是8,在batch13后,理论上是应该进入下一个epoch的,然而出现上面的错误,程序卡住,直到让内存崩溃。 然后,我将data.py中的这段代码注释掉, #if self._epoch == 0 and sortagrad: # manifest.sort(key=lambda x: x["duration"]) #else: 它依然出现了上面的问题。可以帮我解决一下吗?