Re: [AMBER] AMBER GPU Job Write Fails

From: Ross Walker <>
Date: Tue, 09 Jul 2013 12:57:34 -0700

Hi Jan-Philip

Hmmmm. I thought we'd worked around all the GTX580 problems. Maybe one
crept back in with recent updates. Can you try doing the following:

make clean
./configure -cuda gnu

Then edit config.h

And modify the following line from:





make install

And see if the problem goes away. At least for single GPU runs.

All the best

On 7/9/13 10:31 AM, "Jan-Philip Gehrcke" <> wrote:

>we're running a cluster with a couple of Tesla C2070 as well as GTX 580
>cards with newest GPU code of Amber 12 (as of today). Within the past
>weeks I sometimes observed frozen or locked up / hung up GPU jobs with
>the same symptoms as described by Mona (no output written anymore). In
>fact, I am now in the process of systematically collecting information
>about this issue and my plan was to consult the Amber community within
>the next days to ask for help and recommendations. Currently, I try to
>find out if this issue is reproducible or not. I also want to run the
>reproducibility tests mentioned various times in the other discussion
>about the issue related to the GTX Titans in order to see if our
>hardware fulfills this condition or if it must be considered unreliable
>per se.
>What I know so far: this only happened with jobs running on the GTX 580
>cards. In these erroneous state, the pmemd.cuda process still consumes
>100 % CPU, but no output becomes written anymore. This state lasts
>forever until the pmemd.cuda process becomes actively killed. After
>that, the affected GPU is immediately able to run another job without
>obvious problems. If idle, the GTX580s we're using have a temperature of
>about 50 degrees. If under load, they have about 90 degrees. They are in
>a cooled rack, so conditions are pretty stable. That's why I believe to
>have observed that in the erroneous state as described above they are a
>few -- maybe 5 -- degrees cooler than 90 degrees (we monitor
>temperatures with munin). Meaning: when in error state they are
>basically as hot as under load which would support your observation
>that they are still doing something (if our two observations are
>comparable at all), but maybe just a tiny bit cooler, which might be a
>helpful indication to an expert.
>In a comparable discussion, using the latest code version was the
>solution to a similar problem:
>Looks like this is not the case for 'our' problem here.
>I will get back to the mailing list when I have more information.
>On 09.07.2013 18:42, Mona Minkara wrote:
>> Hi,
>> On numerous occasions I have run AMBER jobs on a GPU which, although
>> appear to be running, fail to write output. Our HPC staff suggests
>> It seems to be running, although I do see that it
>> keeps calling cudaThreadSynchronize in gpu_calculate_kinetic_energy_.
>> with the two GPU case, it might be looping over some calls.
>> If anyone has encountered this type of problem with GPU jobs failing to
>> write please let me know if there are suggestions for fixing this or if
>> there is a bug fix.
>> Thanks,
>> Mona
>> _______________________________________________
>> AMBER mailing list
>AMBER mailing list

AMBER mailing list
Received on Tue Jul 09 2013 - 13:00:05 PDT
Custom Search