Re: [AMBER] GTX Titan Xs slowing down after 200ns from Scott Le Grand on 2015-07-23 (Amber Archive Jul 2015)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 23 Jul 2015 08:39:28 -0700

So um, why are you running 2 GPU processes per GPU? That would explain a
lot...

On Thu, Jul 23, 2015 at 8:26 AM, Mohamed Faizan Momin <
mmomin9.student.gsu.edu> wrote:

> Hi Scott,
>
> This is my nvidia-smi output, I'm running the latest version for the
> driver, I've yet to try it on just one GPU.
>
> +------------------------------------------------------+
> | NVIDIA-SMI 352.21 Driver Version: 352.21 |
>
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
> ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
> M. |
>
> |===============================+======================+======================|
> | 0 GeForce GTX TIT... On | 0000:01:00.0 Off |
> N/A |
> | 60% 65C P2 190W / 250W | 392MiB / 12287MiB | 95%
> Default |
>
> +-------------------------------+----------------------+----------------------+
> | 1 GeForce GTX TIT... On | 0000:02:00.0 Off |
> N/A |
> | 60% 55C P2 156W / 250W | 361MiB / 12287MiB | 84%
> Default |
>
> +-------------------------------+----------------------+----------------------+
>
>
> +-----------------------------------------------------------------------------+
> | Processes: GPU
> Memory |
> | GPU PID Type Process name Usage
> |
>
> |=============================================================================|
> | 0 3176 C pmemd.cuda.MPI
> 110MiB |
> | 0 3177 C pmemd.cuda.MPI
> 255MiB |
> | 1 3176 C pmemd.cuda.MPI
> 223MiB |
> | 1 3177 C pmemd.cuda.MPI
> 110MiB |
>
> +-----------------------------------------------------------------------------+
>
>
> --
> Mohamed Faizan Momin
>
> ________________________________________
> From: Scott Le Grand <varelse2005.gmail.com>
> Sent: Thursday, July 23, 2015 11:12 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
>
> 1. What display driver? If <346.82, upgrade.
>
> 2. Do single GPU runs show the same behavior?
>
> On Thu, Jul 23, 2015 at 7:51 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
> > Hi Mohamed,
> >
> > Very very weird. A couple of things to try:
> >
> > 1) If you run the single GPU code rather than the MPI code does the same
> > thing happen
> >
> > 2) Try using mpich3 rather than openMPI and see if the same problem
> > occurs. It's possible there is a memory leak in openMPI that is causing
> an
> > issue - or causing P2P to stop working - or some other weirdness.
> >
> > All the best
> > Ross
> >
> > > On Jul 23, 2015, at 7:47 AM, Mohamed Faizan Momin <
> > mmomin9.student.gsu.edu> wrote:
> > >
> > > Hi Ross,
> > >
> > > The production file stays the same throughout the entire run since I'm
> > wanting to run for a full microsecond, and no other jobs are being fired
> up
> > on this machine as I'm the only one with access to it. I'm running this
> > over a full microsecond continuously using this script:
> > >
> > > #!/bin/csh
> > >
> > > setenv CUDA_HOME /usr/local/cuda-6.5
> > > setenv LD_LIBRARY_PATH
> >
> "/usr/local/cuda-6.5/lib64:/software/openmpi.1.8.1/lib:${LD_LIBRARY_PATH}"
> > > setenv PATH "/usr/local/cuda-6.5/bin:${PATH}"
> > > setenv CUDA_VISIBLE_DEVICES "0,1"
> > >
> > > set prv=M
> > >
> > > foreach cur (N O P Q R S T U V W X Y Z)
> > >
> > > /software/openmpi.1.8.1/bin/mpirun -v -np 2 pmemd.cuda.MPI -O -i
> > production.in -p ../protein.prmtop -c production.$prv.restrt -o
> > production.$cur.out -r production.$cur.restrt -x production.$cur.mdcrd
> > >
> > > set prv=$cur
> > > end
> > >
> > > --
> > > Mohamed Faizan Momin
> > >
> > > ________________________________________
> > > From: Ross Walker <ross.rosswalker.co.uk>
> > > Sent: Thursday, July 23, 2015 10:41 AM
> > > To: AMBER Mailing List
> > > Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
> > >
> > > Hi Mohamed,
> > >
> > > My first thought here was temperature throttling but when you say it
> > always happens at the same point that hypothesis goes out the window.
> I've
> > never seen this behavior before and am not even sure how to speculate on
> > what might be causing it. First off given you say the performance is half
> > are you certain it is not related to your input files in some way. Is
> there
> > any difference with them - are you suddenly dropping the time step to 1fs
> > from 2fs? Is anything else changed - do you change the ensemble or the
> > barostat?
> > >
> > > My guess is it has to be something related to your simulation settings
> > rather than the machine or GPUs since it happens when you start the next
> > simulation. The other possibility is somehow multiple runs are being
> fired
> > up on the same GPU. E.g. I could envision forgetting to set
> > CUDA_VISIBLE_DEVICES again after the first run on GPU 0 completes and so
> > ending up with the second run that was supposed to go on GPU 0 ending up
> on
> > GPU 1 where another job is already running. Look for things like this in
> > your scripts / by watching with nvidia-smi etc.
> > >
> > > All the best
> > > Ross
> > >
> > >> On Jul 23, 2015, at 7:12 AM, Mohamed Faizan Momin <
> > mmomin9.student.gsu.edu> wrote:
> > >>
> > >> Hi all,
> > >>
> > >>
> > >> I have two GTX Titan Xs paired with a i7 5930K . 3.5 GHz processor in
> > an ASUS Rampage V motherboard with 16GB 2133 MHz DDR4 RAM. i'm running a
> > relatively small ~15K atom system and doing normal MD simulation with
> > dt=0.002. My production file setup saves files every 100ns. I get an
> > average of 275ns/day on the system but for some reason the Titan Xs slow
> > down to a mere 100ns/day after two runs or 200ns. This happens exactly
> > after the 2nd run is completed. I have to stop the current job and start
> it
> > up again to continue onward. The err.log is empty and the temperatures
> are
> > not an issue, as I have the fans running at ~50% which keep both GPUs
> under
> > 65c. Any suggestions?
> > >>
> > >>
> > >> --
> > >> Mohamed Faizan Momin
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 23 2015 - 09:00:03 PDT