Re: [AMBER] AMBER GPU Job Write Fails from ET on 2013-07-10 (Amber Archive Jul 2013)

From: ET <sketchfoot.gmail.com>
Date: Wed, 10 Jul 2013 14:26:48 +0100

I've had this problem on my faulty TITANS. 100% CPU util & 80 degrees C.
Quite annoying as I thought the runs were working as I use gpu-info.sh to
gauge whether the cards are running or not. So a weeks worth of what I
though was useful work , was actually just heat. Reaction not favourable.

On 9 July 2013 20:57, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Jan-Philip
>
> Hmmmm. I thought we'd worked around all the GTX580 problems. Maybe one
> crept back in with recent updates. Can you try doing the following:
>
> cd $AMBERHOME
> make clean
> ./configure -cuda gnu
>
> Then edit config.h
>
> And modify the following line from:
>
> PMEMD_CU_DEFINES=-DCUDA -Duse_SPFP
>
> to
>
> PMEMD_CU_DEFINES=-DCUDA -Duse_SPFP -DNODPTEXTURE
>
> Then
>
> make install
>
> And see if the problem goes away. At least for single GPU runs.
>
> All the best
> Ross
>
>
>
> On 7/9/13 10:31 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:
>
> >Hello,
> >
> >we're running a cluster with a couple of Tesla C2070 as well as GTX 580
> >cards with newest GPU code of Amber 12 (as of today). Within the past
> >weeks I sometimes observed frozen or locked up / hung up GPU jobs with
> >the same symptoms as described by Mona (no output written anymore). In
> >fact, I am now in the process of systematically collecting information
> >about this issue and my plan was to consult the Amber community within
> >the next days to ask for help and recommendations. Currently, I try to
> >find out if this issue is reproducible or not. I also want to run the
> >reproducibility tests mentioned various times in the other discussion
> >about the issue related to the GTX Titans in order to see if our
> >hardware fulfills this condition or if it must be considered unreliable
> >per se.
> >
> >What I know so far: this only happened with jobs running on the GTX 580
> >cards. In these erroneous state, the pmemd.cuda process still consumes
> >100 % CPU, but no output becomes written anymore. This state lasts
> >forever until the pmemd.cuda process becomes actively killed. After
> >that, the affected GPU is immediately able to run another job without
> >obvious problems. If idle, the GTX580s we're using have a temperature of
> >about 50 degrees. If under load, they have about 90 degrees. They are in
> >a cooled rack, so conditions are pretty stable. That's why I believe to
> >have observed that in the erroneous state as described above they are a
> >few -- maybe 5 -- degrees cooler than 90 degrees (we monitor
> >temperatures with munin). Meaning: when in error state they are
> >basically as hot as under load which would support your observation
> >that they are still doing something (if our two observations are
> >comparable at all), but maybe just a tiny bit cooler, which might be a
> >helpful indication to an expert.
> >
> >In a comparable discussion, using the latest code version was the
> >solution to a similar problem:
> http://archive.ambermd.org/201110/0387.html
> >
> >Looks like this is not the case for 'our' problem here.
> >
> >I will get back to the mailing list when I have more information.
> >
> >Cheers,
> >
> >JP
> >
> >
> >On 09.07.2013 18:42, Mona Minkara wrote:
> >> Hi,
> >> On numerous occasions I have run AMBER jobs on a GPU which, although
> >>they
> >> appear to be running, fail to write output. Our HPC staff suggests
> >>that:
> >>
> >> It seems to be running, although I do see that it
> >> keeps calling cudaThreadSynchronize in gpu_calculate_kinetic_energy_.
> >>Similar
> >> with the two GPU case, it might be looping over some calls.
> >>
> >>
> >>
> >> If anyone has encountered this type of problem with GPU jobs failing to
> >> write please let me know if there are suggestions for fixing this or if
> >> there is a bug fix.
> >>
> >> Thanks,
> >>
> >> Mona
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> >
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jul 10 2013 - 06:30:02 PDT