Re: [AMBER] AMBER GPU Job Write Fails from Ross Walker on 2013-07-09 (Amber Archive Jul 2013)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 09 Jul 2013 12:57:34 -0700

Hi Jan-Philip

Hmmmm. I thought we'd worked around all the GTX580 problems. Maybe one
crept back in with recent updates. Can you try doing the following:

cd $AMBERHOME
make clean
./configure -cuda gnu

Then edit config.h

And modify the following line from:

PMEMD_CU_DEFINES=-DCUDA -Duse_SPFP

to

PMEMD_CU_DEFINES=-DCUDA -Duse_SPFP -DNODPTEXTURE

Then

make install

And see if the problem goes away. At least for single GPU runs.

All the best
Ross

On 7/9/13 10:31 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:

>Hello,
>
>we're running a cluster with a couple of Tesla C2070 as well as GTX 580
>cards with newest GPU code of Amber 12 (as of today). Within the past
>weeks I sometimes observed frozen or locked up / hung up GPU jobs with
>the same symptoms as described by Mona (no output written anymore). In
>fact, I am now in the process of systematically collecting information
>about this issue and my plan was to consult the Amber community within
>the next days to ask for help and recommendations. Currently, I try to
>find out if this issue is reproducible or not. I also want to run the
>reproducibility tests mentioned various times in the other discussion
>about the issue related to the GTX Titans in order to see if our
>hardware fulfills this condition or if it must be considered unreliable
>per se.
>
>What I know so far: this only happened with jobs running on the GTX 580
>cards. In these erroneous state, the pmemd.cuda process still consumes
>100 % CPU, but no output becomes written anymore. This state lasts
>forever until the pmemd.cuda process becomes actively killed. After
>that, the affected GPU is immediately able to run another job without
>obvious problems. If idle, the GTX580s we're using have a temperature of
>about 50 degrees. If under load, they have about 90 degrees. They are in
>a cooled rack, so conditions are pretty stable. That's why I believe to
>have observed that in the erroneous state as described above they are a
>few -- maybe 5 -- degrees cooler than 90 degrees (we monitor
>temperatures with munin). Meaning: when in error state they are
>basically as hot as under load which would support your observation
>that they are still doing something (if our two observations are
>comparable at all), but maybe just a tiny bit cooler, which might be a
>helpful indication to an expert.
>
>In a comparable discussion, using the latest code version was the
>solution to a similar problem: http://archive.ambermd.org/201110/0387.html
>
>Looks like this is not the case for 'our' problem here.
>
>I will get back to the mailing list when I have more information.
>
>Cheers,
>
>JP
>
>
>On 09.07.2013 18:42, Mona Minkara wrote:
>> Hi,
>> On numerous occasions I have run AMBER jobs on a GPU which, although
>>they
>> appear to be running, fail to write output. Our HPC staff suggests
>>that:
>>
>> It seems to be running, although I do see that it
>> keeps calling cudaThreadSynchronize in gpu_calculate_kinetic_energy_.
>>Similar
>> with the two GPU case, it might be looping over some calls.
>>
>>
>>
>> If anyone has encountered this type of problem with GPU jobs failing to
>> write please let me know if there are suggestions for fixing this or if
>> there is a bug fix.
>>
>> Thanks,
>>
>> Mona
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 09 2013 - 13:00:05 PDT