[AMBER] Jobs hanging on parallel CUDA without producing output

From: Nils Oberhauser <Nils.Oberhauser.unige.ch>
Date: Mon, 23 Sep 2013 12:12:59 +0200

Hello,

I am having some trouble running parallel CUDA jobs on 2 and 3 GPUs.
I'm observing the following problems following the lipid 11 tutorial on
http://ambermd.org/tutorials/advanced/tutorial16/An_Amber_Lipid_Force_Field_
Tutorial.html :
To be sure I did not do anything wrong in the setup, I downloaded the
example files from the tutorial page as input.
During the production phase the calculations break up after a short time,
returning:
        "Nonbond cells need to be recalculated, restart simulation from
previous checkpoint with a higher value for skinnb."
I restart the calculation, as was suggested by previous posts in the mailing
list. For a while the calculations seem to be running smoothly, but after a
certain time period (sometimes minutes, sometimes hours) there is no more
write access to the trajectory or any other output file. The pmemd.cuda.MPI
processes keep running forever without any output or error message until the
process is killed manually. Since the jobs just keep running without any
output, I'm not sure how to further debug this problem...
When I use only one GPU or 10 parallel CPU cores however, the job finishes
normally.

The machine I'm using has 3 GeForce GTX 680 Cards and two Intel Xeon E5-2620
(2 GHz, 6 cores each) CPUs, running CentOS 64bit (kernel:
2.6.32-358.14.1.el6.x86_64) and the latest updates of amber12.

Here is the most confusing part:
We have another machine with identical hardware but a slightly different
linux kernel (2.6.32-358.11.1.el6.x86_64) and a not up-to-date version of
amber12. Since there are currently jobs running on the machine, I cannot
upgrade the kernel or amber to see if it makes any difference.
However when I run the same job from the tutorial as described above on this
machine, I don't get any errors. The jobs run and terminate without any
problems on 3 GPUs. Even the "Nonbond cells.." error never occurs (maybe
because of the older amber version?).
I ran the jobs several times on both machines, to be quite sure that the
different outcomes are not due to the non-deterministic algorithms of the
software, but it is always the same results.

So, before I go through all the trouble of installing an older kernel to the
first machine, I'm asking here for help. Can it at all be the kernel?
Maybe some has a different suggestion to debug this problem.

Thanks in advance and best regards,
Nils




_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Sep 23 2013 - 03:30:02 PDT
Custom Search