Re: [AMBER] REMD jobs crashing repeatedly on GPUs

From: Scott Le Grand <varelse2005.gmail.com>
Date: Fri, 6 Sep 2013 11:13:27 -0700

One flaky GPU in the mix is more than enough to trigger this.



On Fri, Sep 6, 2013 at 11:07 AM, Christina Bergonzo <cbergonzo.gmail.com>wrote:

> Hi Joe,
>
> I saw similar problems and have been working with someone from Keeneland to
> sort it out. I forwarded this to him.
> It may be a Keeneland issue and not a problem with the code since we're
> able to run this elsewhere.
>
> -Christina
>
>
> On Fri, Sep 6, 2013 at 11:57 AM, Joe Passman <joepassman.comcast.net>
> wrote:
>
> >
> >
> >
> > Hi.
> >
> >
> > I am running a series of replica exchange jobs using AMBER12 on
> Keeneland.
> > The jobs crash repeatedly. The error ALWAYS comes at different times
> during
> > the simulation. I am not sure what the issue is.
> >
> >
> > Here is an excerpt of an example output file:
> >
> > ===================
> > started run 240
> >
> > Running multipmemd version of pmemd Amber12
> > Total processors = 30
> > Number of groups = 30
> >
> >
> > Running multipmemd version of pmemd Amber12
> > Total processors = 30
> > Number of groups = 30
> >
> >
> > Running multipmemd version of pmemd Amber12
> > Total processors = 30
> > Number of groups = 30
> >
> > started run 241
> >
> > Running multipmemd version of pmemd Amber12
> > Total processors = 30
> > Number of groups = 30
> >
> > UNKNOWN
> > cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
> > UNKNOWN
> > Error: unknown error launching kernel kCalculatePMENonbondForces
> > UNKNOWN
> > cudaMemcpy GpuBuffer::Upload failed unknown error
> > UNKNOWN
> > cudaMalloc GpuBuffer::Allocate failed unknown error
> > UNKNOWN
> > cudaMemcpy GpuBuffer::Upload failed unknown error
> > UNKNOWN
> > cudaMemcpy GpuBuffer::Upload failed unknown error
> > UNKNOWN
> > cudaMemcpy GpuBuffer::Upload failed unknown error
> > UNKNOWN
> > cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
> > UNKNOWN
> > Error: unknown error launching kernel kCalculatePMENonbondForces
> > UNKNOWN
> > Error: unknown error launching kernel kCalculatePMENonbondForces
> > UNKNOWN
> > Error: unknown error launching kernel kUpdate
> > UNKNOWN
> > Error: unknown error launching kernel kCalculatePMENonbondForces
> > UNKNOWN
> > Error: unknown error launching kernel kUpdate
> > UNKNOWN
> > gpu_download_partial_forces: download failed unknown error
> > UNKNOWN
> > cudaMemcpy GpuBuffer::Upload failed unknown error
> > UNKNOWN
> > cudaMalloc GpuBuffer::Allocate failed unknown error
> > UNKNOWN
> > UNKNOWN
> > Error: unknown error launching kernel kUpdate
> >
> > Running multipmemd version of pmemd Amber12
> > Total processors = 30
> > Number of groups = 30
> >
> > UNKNOWN
> > UNKNOWN
> > UNKNOWN
> > ===================
> >
> >
> > I have been talking to Shiquan Su at Keeneland. He found
> > this seemingly relevant thread from year 2010.
> >
> > http://archive.ambermd.org/ 201009/0180.html
> >
> >
> > Does anyone have an idea what is happening here?
> >
> >
> > Thank you!
> >
> > --
> >
> > Joe Passman
> > E-mail: joseph.passman.gmail.com
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
>
> --
>
> ---------------------------------------------------------------------------------------
> Christina Bergonzo, PhD
> Department of Medicinal Chemistry, University of Utah
> 30 South 2000 East, Rm. 201
> Salt Lake City, UT 84112-5820
> Office: L.S. Skaggs Pharmacy Research Institute, Rm.4290
> http://home.chpc.utah.edu/~cheatham/
> (801) 587-9652 / Fax: (801) 585-9119
>
> ---------------------------------------------------------------------------------------
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Sep 06 2013 - 11:30:03 PDT
Custom Search