Re: [AMBER] GTX Titan Xs slowing down after 200ns

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 23 Jul 2015 08:12:15 -0700

1. What display driver? If <346.82, upgrade.

2. Do single GPU runs show the same behavior?

On Thu, Jul 23, 2015 at 7:51 AM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Mohamed,
>
> Very very weird. A couple of things to try:
>
> 1) If you run the single GPU code rather than the MPI code does the same
> thing happen
>
> 2) Try using mpich3 rather than openMPI and see if the same problem
> occurs. It's possible there is a memory leak in openMPI that is causing an
> issue - or causing P2P to stop working - or some other weirdness.
>
> All the best
> Ross
>
> > On Jul 23, 2015, at 7:47 AM, Mohamed Faizan Momin <
> mmomin9.student.gsu.edu> wrote:
> >
> > Hi Ross,
> >
> > The production file stays the same throughout the entire run since I'm
> wanting to run for a full microsecond, and no other jobs are being fired up
> on this machine as I'm the only one with access to it. I'm running this
> over a full microsecond continuously using this script:
> >
> > #!/bin/csh
> >
> > setenv CUDA_HOME /usr/local/cuda-6.5
> > setenv LD_LIBRARY_PATH
> "/usr/local/cuda-6.5/lib64:/software/openmpi.1.8.1/lib:${LD_LIBRARY_PATH}"
> > setenv PATH "/usr/local/cuda-6.5/bin:${PATH}"
> > setenv CUDA_VISIBLE_DEVICES "0,1"
> >
> > set prv=M
> >
> > foreach cur (N O P Q R S T U V W X Y Z)
> >
> > /software/openmpi.1.8.1/bin/mpirun -v -np 2 pmemd.cuda.MPI -O -i
> production.in -p ../protein.prmtop -c production.$prv.restrt -o
> production.$cur.out -r production.$cur.restrt -x production.$cur.mdcrd
> >
> > set prv=$cur
> > end
> >
> > --
> > Mohamed Faizan Momin
> >
> > ________________________________________
> > From: Ross Walker <ross.rosswalker.co.uk>
> > Sent: Thursday, July 23, 2015 10:41 AM
> > To: AMBER Mailing List
> > Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
> >
> > Hi Mohamed,
> >
> > My first thought here was temperature throttling but when you say it
> always happens at the same point that hypothesis goes out the window. I've
> never seen this behavior before and am not even sure how to speculate on
> what might be causing it. First off given you say the performance is half
> are you certain it is not related to your input files in some way. Is there
> any difference with them - are you suddenly dropping the time step to 1fs
> from 2fs? Is anything else changed - do you change the ensemble or the
> barostat?
> >
> > My guess is it has to be something related to your simulation settings
> rather than the machine or GPUs since it happens when you start the next
> simulation. The other possibility is somehow multiple runs are being fired
> up on the same GPU. E.g. I could envision forgetting to set
> CUDA_VISIBLE_DEVICES again after the first run on GPU 0 completes and so
> ending up with the second run that was supposed to go on GPU 0 ending up on
> GPU 1 where another job is already running. Look for things like this in
> your scripts / by watching with nvidia-smi etc.
> >
> > All the best
> > Ross
> >
> >> On Jul 23, 2015, at 7:12 AM, Mohamed Faizan Momin <
> mmomin9.student.gsu.edu> wrote:
> >>
> >> Hi all,
> >>
> >>
> >> I have two GTX Titan Xs paired with a i7 5930K . 3.5 GHz processor in
> an ASUS Rampage V motherboard with 16GB 2133 MHz DDR4 RAM. i'm running a
> relatively small ~15K atom system and doing normal MD simulation with
> dt=0.002. My production file setup saves files every 100ns. I get an
> average of 275ns/day on the system but for some reason the Titan Xs slow
> down to a mere 100ns/day after two runs or 200ns. This happens exactly
> after the 2nd run is completed. I have to stop the current job and start it
> up again to continue onward. The err.log is empty and the temperatures are
> not an issue, as I have the fans running at ~50% which keep both GPUs under
> 65c. Any suggestions?
> >>
> >>
> >> --
> >> Mohamed Faizan Momin
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 23 2015 - 08:30:02 PDT
Custom Search