Re: [AMBER] AMBER GPU Job Write Fails from Jan-Philip Gehrcke on 2013-07-09 (Amber Archive Jul 2013)

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Tue, 09 Jul 2013 19:31:05 +0200

Hello,

we're running a cluster with a couple of Tesla C2070 as well as GTX 580
cards with newest GPU code of Amber 12 (as of today). Within the past
weeks I sometimes observed frozen or locked up / hung up GPU jobs with
the same symptoms as described by Mona (no output written anymore). In
fact, I am now in the process of systematically collecting information
about this issue and my plan was to consult the Amber community within
the next days to ask for help and recommendations. Currently, I try to
find out if this issue is reproducible or not. I also want to run the
reproducibility tests mentioned various times in the other discussion
about the issue related to the GTX Titans in order to see if our
hardware fulfills this condition or if it must be considered unreliable
per se.

What I know so far: this only happened with jobs running on the GTX 580
cards. In these erroneous state, the pmemd.cuda process still consumes
100 % CPU, but no output becomes written anymore. This state lasts
forever until the pmemd.cuda process becomes actively killed. After
that, the affected GPU is immediately able to run another job without
obvious problems. If idle, the GTX580s we're using have a temperature of
about 50 degrees. If under load, they have about 90 degrees. They are in
a cooled rack, so conditions are pretty stable. That's why I believe to
have observed that in the erroneous state as described above they are a
few -- maybe 5 -- degrees cooler than 90 degrees (we monitor
temperatures with munin). Meaning: when in error state they are
basically as hot as under load which would support your observation
that they are still doing something (if our two observations are
comparable at all), but maybe just a tiny bit cooler, which might be a
helpful indication to an expert.

In a comparable discussion, using the latest code version was the
solution to a similar problem: http://archive.ambermd.org/201110/0387.html

Looks like this is not the case for 'our' problem here.

I will get back to the mailing list when I have more information.

Cheers,

JP

On 09.07.2013 18:42, Mona Minkara wrote:
> Hi,
> On numerous occasions I have run AMBER jobs on a GPU which, although they
> appear to be running, fail to write output. Our HPC staff suggests that:
>
> It seems to be running, although I do see that it
> keeps calling cudaThreadSynchronize in gpu_calculate_kinetic_energy_. Similar
> with the two GPU case, it might be looping over some calls.
>
>
>
> If anyone has encountered this type of problem with GPU jobs failing to
> write please let me know if there are suggestions for fixing this or if
> there is a bug fix.
>
> Thanks,
>
> Mona
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 09 2013 - 11:00:02 PDT