Re: [AMBER] REMD jobs crashing repeatedly on GPUs

From: Christina Bergonzo <cbergonzo.gmail.com>
Date: Fri, 6 Sep 2013 12:07:59 -0600

Hi Joe,

I saw similar problems and have been working with someone from Keeneland to
sort it out. I forwarded this to him.
It may be a Keeneland issue and not a problem with the code since we're
able to run this elsewhere.

-Christina


On Fri, Sep 6, 2013 at 11:57 AM, Joe Passman <joepassman.comcast.net> wrote:

>
>
>
> Hi.
>
>
> I am running a series of replica exchange jobs using AMBER12 on Keeneland.
> The jobs crash repeatedly. The error ALWAYS comes at different times during
> the simulation. I am not sure what the issue is.
>
>
> Here is an excerpt of an example output file:
>
> ===================
> started run 240
>
> Running multipmemd version of pmemd Amber12
> Total processors = 30
> Number of groups = 30
>
>
> Running multipmemd version of pmemd Amber12
> Total processors = 30
> Number of groups = 30
>
>
> Running multipmemd version of pmemd Amber12
> Total processors = 30
> Number of groups = 30
>
> started run 241
>
> Running multipmemd version of pmemd Amber12
> Total processors = 30
> Number of groups = 30
>
> UNKNOWN
> cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
> UNKNOWN
> Error: unknown error launching kernel kCalculatePMENonbondForces
> UNKNOWN
> cudaMemcpy GpuBuffer::Upload failed unknown error
> UNKNOWN
> cudaMalloc GpuBuffer::Allocate failed unknown error
> UNKNOWN
> cudaMemcpy GpuBuffer::Upload failed unknown error
> UNKNOWN
> cudaMemcpy GpuBuffer::Upload failed unknown error
> UNKNOWN
> cudaMemcpy GpuBuffer::Upload failed unknown error
> UNKNOWN
> cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
> UNKNOWN
> Error: unknown error launching kernel kCalculatePMENonbondForces
> UNKNOWN
> Error: unknown error launching kernel kCalculatePMENonbondForces
> UNKNOWN
> Error: unknown error launching kernel kUpdate
> UNKNOWN
> Error: unknown error launching kernel kCalculatePMENonbondForces
> UNKNOWN
> Error: unknown error launching kernel kUpdate
> UNKNOWN
> gpu_download_partial_forces: download failed unknown error
> UNKNOWN
> cudaMemcpy GpuBuffer::Upload failed unknown error
> UNKNOWN
> cudaMalloc GpuBuffer::Allocate failed unknown error
> UNKNOWN
> UNKNOWN
> Error: unknown error launching kernel kUpdate
>
> Running multipmemd version of pmemd Amber12
> Total processors = 30
> Number of groups = 30
>
> UNKNOWN
> UNKNOWN
> UNKNOWN
> ===================
>
>
> I have been talking to Shiquan Su at Keeneland. He found
> this seemingly relevant thread from year 2010.
>
> http://archive.ambermd.org/ 201009/0180.html
>
>
> Does anyone have an idea what is happening here?
>
>
> Thank you!
>
> --
>
> Joe Passman
> E-mail: joseph.passman.gmail.com
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
---------------------------------------------------------------------------------------
Christina Bergonzo, PhD
Department of Medicinal Chemistry, University of Utah
30 South 2000 East, Rm. 201
Salt Lake City, UT 84112-5820
Office: L.S. Skaggs Pharmacy Research Institute, Rm.4290
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652 / Fax: (801) 585-9119
---------------------------------------------------------------------------------------
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Sep 06 2013 - 11:30:03 PDT
Custom Search