So um, why are you running 2 GPU processes per GPU?  That would explain a
lot...
On Thu, Jul 23, 2015 at 8:26 AM, Mohamed Faizan Momin <
mmomin9.student.gsu.edu> wrote:
> Hi Scott,
>
> This is my nvidia-smi output, I'm running the latest version for the
> driver, I've yet to try it on just one GPU.
>
> +------------------------------------------------------+
> | NVIDIA-SMI 352.21     Driver Version: 352.21         |
>
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
> ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
> M. |
>
> |===============================+======================+======================|
> |   0  GeForce GTX TIT...  On   | 0000:01:00.0     Off |
> N/A |
> | 60%   65C    P2   190W / 250W |    392MiB / 12287MiB |     95%
> Default |
>
> +-------------------------------+----------------------+----------------------+
> |   1  GeForce GTX TIT...  On   | 0000:02:00.0     Off |
> N/A |
> | 60%   55C    P2   156W / 250W |    361MiB / 12287MiB |     84%
> Default |
>
> +-------------------------------+----------------------+----------------------+
>
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>     |
>
> |=============================================================================|
> |    0      3176    C   pmemd.cuda.MPI
>  110MiB |
> |    0      3177    C   pmemd.cuda.MPI
>  255MiB |
> |    1      3176    C   pmemd.cuda.MPI
>  223MiB |
> |    1      3177    C   pmemd.cuda.MPI
>  110MiB |
>
> +-----------------------------------------------------------------------------+
>
>
> --
> Mohamed Faizan Momin
>
> ________________________________________
> From: Scott Le Grand <varelse2005.gmail.com>
> Sent: Thursday, July 23, 2015 11:12 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
>
> 1. What display driver?  If <346.82, upgrade.
>
> 2. Do single GPU runs show the same behavior?
>
> On Thu, Jul 23, 2015 at 7:51 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
> > Hi Mohamed,
> >
> > Very very weird. A couple of things to try:
> >
> > 1) If you run the single GPU code rather than the MPI code does the same
> > thing happen
> >
> > 2) Try using mpich3 rather than openMPI and see if the same problem
> > occurs. It's possible there is a memory leak in openMPI that is causing
> an
> > issue - or causing P2P to stop working - or some other weirdness.
> >
> > All the best
> > Ross
> >
> > > On Jul 23, 2015, at 7:47 AM, Mohamed Faizan Momin <
> > mmomin9.student.gsu.edu> wrote:
> > >
> > > Hi Ross,
> > >
> > > The production file stays the same throughout the entire run since I'm
> > wanting to run for a full microsecond, and no other jobs are being fired
> up
> > on this machine as I'm the only one with access to it. I'm running this
> > over a full microsecond continuously using this script:
> > >
> > > #!/bin/csh
> > >
> > > setenv CUDA_HOME /usr/local/cuda-6.5
> > > setenv LD_LIBRARY_PATH
> >
> "/usr/local/cuda-6.5/lib64:/software/openmpi.1.8.1/lib:${LD_LIBRARY_PATH}"
> > > setenv PATH "/usr/local/cuda-6.5/bin:${PATH}"
> > > setenv CUDA_VISIBLE_DEVICES "0,1"
> > >
> > > set prv=M
> > >
> > > foreach cur (N O P Q R S T U V W X Y Z)
> > >
> > >  /software/openmpi.1.8.1/bin/mpirun -v -np 2 pmemd.cuda.MPI -O -i
> > production.in -p ../protein.prmtop -c production.$prv.restrt -o
> > production.$cur.out -r production.$cur.restrt -x production.$cur.mdcrd
> > >
> > >  set prv=$cur
> > > end
> > >
> > > --
> > > Mohamed Faizan Momin
> > >
> > > ________________________________________
> > > From: Ross Walker <ross.rosswalker.co.uk>
> > > Sent: Thursday, July 23, 2015 10:41 AM
> > > To: AMBER Mailing List
> > > Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
> > >
> > > Hi Mohamed,
> > >
> > > My first thought here was temperature throttling but when you say it
> > always happens at the same point that hypothesis goes out the window.
> I've
> > never seen this behavior before and am not even sure how to speculate on
> > what might be causing it. First off given you say the performance is half
> > are you certain it is not related to your input files in some way. Is
> there
> > any difference with them - are you suddenly dropping the time step to 1fs
> > from 2fs? Is anything else changed - do you change the ensemble or the
> > barostat?
> > >
> > > My guess is it has to be something related to your simulation settings
> > rather than the machine or GPUs since it happens when you start the next
> > simulation. The other possibility is somehow multiple runs are being
> fired
> > up on the same GPU. E.g. I could envision forgetting to set
> > CUDA_VISIBLE_DEVICES again after the first run on GPU 0 completes and so
> > ending up with the second run that was supposed to go on GPU 0 ending up
> on
> > GPU 1 where another job is already running. Look for things like this in
> > your scripts / by watching with nvidia-smi etc.
> > >
> > > All the best
> > > Ross
> > >
> > >> On Jul 23, 2015, at 7:12 AM, Mohamed Faizan Momin <
> > mmomin9.student.gsu.edu> wrote:
> > >>
> > >> Hi all,
> > >>
> > >>
> > >> I have two GTX Titan Xs paired with a i7 5930K . 3.5 GHz processor in
> > an ASUS Rampage V motherboard with 16GB 2133 MHz DDR4 RAM. i'm running a
> > relatively small ~15K atom system and doing normal MD simulation with
> > dt=0.002. My production file setup saves files every 100ns. I get an
> > average of 275ns/day on the system but for some reason the Titan Xs slow
> > down to a mere 100ns/day after two runs or 200ns. This happens exactly
> > after the 2nd run is completed. I have to stop the current job and start
> it
> > up again to continue onward. The err.log is empty and the temperatures
> are
> > not an issue, as I have the fans running at ~50% which keep both GPUs
> under
> > 65c. Any suggestions?
> > >>
> > >>
> > >> --
> > >> Mohamed Faizan Momin
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 23 2015 - 09:00:03 PDT