Re: [AMBER] Jobs hanging on parallel CUDA without producing output

From: Jason Swails <jason.swails.gmail.com>
Date: Mon, 23 Sep 2013 07:36:38 -0400

On Mon, Sep 23, 2013 at 6:12 AM, Nils Oberhauser
<Nils.Oberhauser.unige.ch>wrote:

> Hello,
>
> I am having some trouble running parallel CUDA jobs on 2 and 3 GPUs.
> I'm observing the following problems following the lipid 11 tutorial on
>
> http://ambermd.org/tutorials/advanced/tutorial16/An_Amber_Lipid_Force_Field_
> Tutorial.html :
> To be sure I did not do anything wrong in the setup, I downloaded the
> example files from the tutorial page as input.
> During the production phase the calculations break up after a short time,
> returning:
> "Nonbond cells need to be recalculated, restart simulation from
> previous checkpoint with a higher value for skinnb."
> I restart the calculation, as was suggested by previous posts in the
> mailing
> list. For a while the calculations seem to be running smoothly, but after a
> certain time period (sometimes minutes, sometimes hours) there is no more
> write access to the trajectory or any other output file. The pmemd.cuda.MPI
> processes keep running forever without any output or error message until
> the
> process is killed manually. Since the jobs just keep running without any
> output, I'm not sure how to further debug this problem...
> When I use only one GPU or 10 parallel CPU cores however, the job finishes
> normally.
>
> The machine I'm using has 3 GeForce GTX 680 Cards and two Intel Xeon
> E5-2620
> (2 GHz, 6 cores each) CPUs, running CentOS 64bit (kernel:
> 2.6.32-358.14.1.el6.x86_64) and the latest updates of amber12.
>
> Here is the most confusing part:
> We have another machine with identical hardware but a slightly different
> linux kernel (2.6.32-358.11.1.el6.x86_64) and a not up-to-date version of
> amber12. Since there are currently jobs running on the machine, I cannot
> upgrade the kernel or amber to see if it makes any difference.
> However when I run the same job from the tutorial as described above on
> this
> machine, I don't get any errors. The jobs run and terminate without any
> problems on 3 GPUs. Even the "Nonbond cells.." error never occurs (maybe
> because of the older amber version?).
> I ran the jobs several times on both machines, to be quite sure that the
> different outcomes are not due to the non-deterministic algorithms of the
> software, but it is always the same results.
>

You can check the output of $AMBERHOME/update_amber --version to see what
updates have been applied to which installation.

That said, the indication that the nonbonded cells need to be recalculated
was introduced by an update to address the fact that for certain types of
simulations (particularly lipids built with the CHARMM lipid builder, I
think), the volume changes a _lot_ during constant pressure simulations --
too much for the nonbonded cell decomposition scheme that was originally
set up.

This led to _very_ strange behavior. So if an 'updated' version of Amber
indicates that non-bonded cells need to be updated, then you should either
do your equilibration with that updated version or use CPUs instead
(because the 'old' GPU implementation that does not emit the error will be
wrong). See the description for bugfix.18.bz2 at
http://ambermd.org/bugfixes12.html

So, before I go through all the trouble of installing an older kernel to the
> first machine, I'm asking here for help. Can it at all be the kernel?
> Maybe some has a different suggestion to debug this problem.
>

Perhaps use only a single GPU to equilibrate the density, then try multiple
GPUs for the production stage if you really need the speed-up of multiple
GPUs (it's quite limited, isn't it?) Also, is the place where the
simulation stalls deterministic (i.e., does it always hang on exactly the
same step each time?)

HTH,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Sep 23 2013 - 05:00:03 PDT
Custom Search