[AMBER] REMD jobs crashing repeatedly on GPUs

From: Joe Passman <joepassman.comcast.net>
Date: Fri, 6 Sep 2013 17:57:42 +0000 (UTC)

Hi.


I am running a series of replica exchange jobs using AMBER12 on Keeneland. The jobs crash repeatedly. The error ALWAYS comes at different times during the simulation. I am not sure what the issue is.


Here is an excerpt of an example output file:

===================
started run 240

Running multipmemd version of pmemd Amber12
Total processors = 30
Number of groups = 30


Running multipmemd version of pmemd Amber12
Total processors = 30
Number of groups = 30


Running multipmemd version of pmemd Amber12
Total processors = 30
Number of groups = 30

started run 241

Running multipmemd version of pmemd Amber12
Total processors = 30
Number of groups = 30

UNKNOWN
cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
UNKNOWN
Error: unknown error launching kernel kCalculatePMENonbondForces
UNKNOWN
cudaMemcpy GpuBuffer::Upload failed unknown error
UNKNOWN
cudaMalloc GpuBuffer::Allocate failed unknown error
UNKNOWN
cudaMemcpy GpuBuffer::Upload failed unknown error
UNKNOWN
cudaMemcpy GpuBuffer::Upload failed unknown error
UNKNOWN
cudaMemcpy GpuBuffer::Upload failed unknown error
UNKNOWN
cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
UNKNOWN
Error: unknown error launching kernel kCalculatePMENonbondForces
UNKNOWN
Error: unknown error launching kernel kCalculatePMENonbondForces
UNKNOWN
Error: unknown error launching kernel kUpdate
UNKNOWN
Error: unknown error launching kernel kCalculatePMENonbondForces
UNKNOWN
Error: unknown error launching kernel kUpdate
UNKNOWN
gpu_download_partial_forces: download failed unknown error
UNKNOWN
cudaMemcpy GpuBuffer::Upload failed unknown error
UNKNOWN
cudaMalloc GpuBuffer::Allocate failed unknown error
UNKNOWN
UNKNOWN
Error: unknown error launching kernel kUpdate

Running multipmemd version of pmemd Amber12
Total processors = 30
Number of groups = 30

UNKNOWN
UNKNOWN
UNKNOWN
===================


I have been talking to Shiquan Su at Keeneland. He found
this seemingly relevant thread from year 2010.

http://archive.ambermd.org/ 201009/0180.html


Does anyone have an idea what is happening here?


Thank you!

-- 
Joe Passman 
E-mail: joseph.passman.gmail.com 
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Sep 06 2013 - 11:00:04 PDT
Custom Search