Re: [AMBER] CUDA errors on a 700k system

From: David Cerutti <dscerutti.gmail.com>
Date: Sun, 14 Jul 2019 17:25:53 -0400

As Bill said, this is probably not an issue with kNLSkinTest, but rather
something that went wrong in the step before the skin test kernel was
launched. Some particle has experienced a large force and is flying into
the stratosphere. Make sure that your system is well equilibrated, with no
clashes, ring stabs, or other sources of chronic strain. If worse comes to
worse there was an mdgx trick on the forum recently to help you pinpoint
strain in your system, but for 700k atoms it may take some time to run.

Dave


On Sun, Jul 14, 2019 at 2:08 PM Bill Ross <ross.cgl.ucsf.edu> wrote:

> Have you tried on a CPU? Maybe just to get started to some degree, in
> case there's a GPU-specific numerical hump in Amber or GPUs.
>
> Also it's a normal type of problem if you are equilibrating too fast.
>
> Likely the manual describes vlimit, but searching: vlimit sander amber
>
> The variablevlimitresets the velocity to the value of VLIMIT once it
> becomes greater that abs(VLIMIT). This can be used to avoid occasional
> instabilities in molecular dynamics runs, and is especially important
> for simulated annealing runs because of the high temperature. It should
> be set to some value between 10 and 20, which is well above the most
> probably velocity in a Maxwell-Boltzmann distribution at room
> temperature. A warning message will be printed whenever the velocities
> are modified.
>
> http://ambermd.org/tutorials/advanced/tutorial4/index.htm
>
> Bill
>
> On 7/14/19 11:00 AM, Dmitry Suplatov wrote:
> > Dear Amber Users,
> >
> > I run a classical NVT simulation of a 700k system on Tesla P100's. I run
> 10
> > MD replicas of the same system on different nodes.
> >
> > All MDs generally run for 80-100 ns (i.e., the production run after EM,
> EQ,
> > etc.) then I get some problems.
> >
> > When running on *two GPUs *in the *peer2peer* mode (single node = 2
> cards)
> > I get the following error:
> > gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was
> > encountered
> >
> > When running on *one GPU *I get the following error:
> > Error: an illegal memory access was encountered launching kernel
> kNLSkinTest
> >
> > When adding the *vlimit=20,* option to my config file *some MDs run
> > normally on one GPU* while others encounter the same error. Nothing
> changes
> > in the peer-to-peer mode.
> >
> > When setting the *vlimit=10,* option to my config file *all MDs run
> > normally* *on one GPU*. Nothing changes in the peer-to-peer mode.
> >
> > My QUESTIONS are:
> >
> > 1/ What does the "vlimit" option do? I googled it from the amber mailing
> > lists but can not find the meaning.
> >
> > 2/ Does setting the vlimit affect performance of biological output of my
> > simulations?
> >
> > 3/ What would you suggest
> >
> > Thank you,
> > Dmitry
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Jul 14 2019 - 14:30:02 PDT
Custom Search