Re: [AMBER] Jobs hanging on parallel CUDA without producing output from Nils Oberhauser on 2013-09-26 (Amber Archive Sep 2013)

From: Nils Oberhauser <Nils.Oberhauser.unige.ch>
Date: Thu, 26 Sep 2013 17:26:30 +0200

Okay, I was not aware of this. I did in deed only run the update_amber
script. We got this machine pre-configured and I'm quite new to amber...^^
I'll try this as soon as there are no more jobs running.

So thank you again for your help,
Nils

-----Original Message-----
From: Jason Swails [mailto:jason.swails.gmail.com]
Sent: Donnerstag, 26. September 2013 11:40
To: AMBER Mailing List
Subject: Re: [AMBER] Jobs hanging on parallel CUDA without producing output

On Thu, Sep 26, 2013 at 2:58 AM, Nils Oberhauser
<Nils.Oberhauser.unige.ch>wrote:

> Thank you Jason for your reply.
> I finally updated Amber on the other machine. So now both machines run
> the latest version.
>

Simply running "./update_amber --update" is not enough to update. You have
to recompile (it's not clear from this statement that you did that). Make
sure you recompile pmemd.cuda and pmemd.cuda.MPI both... I've used several
different kernels and never seen it make a difference before.

> However, I'm still having the same problem: On one machine the job
> runs through without any interruption, while on the other machine I
> get the nonbond cells error every time and have to restart the
> calculation at least twice. The parallel GPU jobs also still hang
> after a while on this machine, but you are right, the parallel GPU
> processing does not give a lot of speed-up.^^ So the only difference
> between the two machines seems to be the kernel...
> Is
> it possible that this causes the strange behavior?
>
> Maybe someone else in the list has tried the lipid 11 tutorial lately
> and can tell me if they had any errors?
>
> Thanks,
> Nils
>
>
>
>
> -----Original Message-----
> From: Jason Swails [mailto:jason.swails.gmail.com]
> Sent: Montag, 23. September 2013 13:37
> To: AMBER Mailing List
> Subject: Re: [AMBER] Jobs hanging on parallel CUDA without producing
> output
>
> On Mon, Sep 23, 2013 at 6:12 AM, Nils Oberhauser
> <Nils.Oberhauser.unige.ch>wrote:
>
> > Hello,
> >
> > I am having some trouble running parallel CUDA jobs on 2 and 3 GPUs.
> > I'm observing the following problems following the lipid 11 tutorial
> > on
> >
> > http://ambermd.org/tutorials/advanced/tutorial16/An_Amber_Lipid_Forc
> > e_
> > Field_
> > Tutorial.html :
> > To be sure I did not do anything wrong in the setup, I downloaded
> > the example files from the tutorial page as input.
> > During the production phase the calculations break up after a short
> > time,
> > returning:
> > "Nonbond cells need to be recalculated, restart simulation
> > from previous checkpoint with a higher value for skinnb."
> > I restart the calculation, as was suggested by previous posts in the
> > mailing list. For a while the calculations seem to be running
> > smoothly, but after a certain time period (sometimes minutes,
> > sometimes hours) there is no more write access to the trajectory or
> > any other output file. The pmemd.cuda.MPI processes keep running
> > forever without any output or error message until the process is
> > killed manually. Since the jobs just keep running without any
> > output, I'm not sure how to further debug this problem...
> > When I use only one GPU or 10 parallel CPU cores however, the job
> > finishes normally.
> >
> > The machine I'm using has 3 GeForce GTX 680 Cards and two Intel Xeon
> > E5-2620
> > (2 GHz, 6 cores each) CPUs, running CentOS 64bit (kernel:
> > 2.6.32-358.14.1.el6.x86_64) and the latest updates of amber12.
> >
> > Here is the most confusing part:
> > We have another machine with identical hardware but a slightly
> > different linux kernel (2.6.32-358.11.1.el6.x86_64) and a not
> > up-to-date version of amber12. Since there are currently jobs
> > running on the machine, I cannot upgrade the kernel or amber to see
> > if it makes
> any difference.
> > However when I run the same job from the tutorial as described above
> > on this machine, I don't get any errors. The jobs run and terminate
> > without any problems on 3 GPUs. Even the "Nonbond cells.." error
> > never occurs (maybe because of the older amber version?).
> > I ran the jobs several times on both machines, to be quite sure that
> > the different outcomes are not due to the non-deterministic
> > algorithms of the software, but it is always the same results.
> >
>
> You can check the output of $AMBERHOME/update_amber --version to see
> what updates have been applied to which installation.
>
> That said, the indication that the nonbonded cells need to be
> recalculated was introduced by an update to address the fact that for
> certain types of simulations (particularly lipids built with the
> CHARMM lipid builder, I think), the volume changes a _lot_ during
> constant pressure simulations -- too much for the nonbonded cell
> decomposition scheme that was originally set up.
>
> This led to _very_ strange behavior. So if an 'updated' version of
> Amber indicates that non-bonded cells need to be updated, then you
> should either do your equilibration with that updated version or use
> CPUs instead (because the 'old' GPU implementation that does not emit
> the error will be wrong).
> See the description for bugfix.18.bz2 at
> http://ambermd.org/bugfixes12.html
>
> So, before I go through all the trouble of installing an older kernel
> to the
> > first machine, I'm asking here for help. Can it at all be the kernel?
> > Maybe some has a different suggestion to debug this problem.
> >
>
> Perhaps use only a single GPU to equilibrate the density, then try
> multiple GPUs for the production stage if you really need the speed-up
> of multiple GPUs (it's quite limited, isn't it?) Also, is the place
> where the simulation stalls deterministic (i.e., does it always hang
> on exactly the same step each time?)
>
> HTH,
> Jason
>
> --
> Jason M. Swails
> BioMaPS,
> Rutgers University
> Postdoctoral Researcher
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

--
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Sep 26 2013 - 08:30:03 PDT