Hi Scott,
This is my nvidia-smi output, I'm running the latest version for the driver, I've yet to try it on just one GPU.
+------------------------------------------------------+
| NVIDIA-SMI 352.21 Driver Version: 352.21 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... On | 0000:01:00.0 Off | N/A |
| 60% 65C P2 190W / 250W | 392MiB / 12287MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... On | 0000:02:00.0 Off | N/A |
| 60% 55C P2 156W / 250W | 361MiB / 12287MiB | 84% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3176 C pmemd.cuda.MPI 110MiB |
| 0 3177 C pmemd.cuda.MPI 255MiB |
| 1 3176 C pmemd.cuda.MPI 223MiB |
| 1 3177 C pmemd.cuda.MPI 110MiB |
+-----------------------------------------------------------------------------+
--
Mohamed Faizan Momin
________________________________________
From: Scott Le Grand <varelse2005.gmail.com>
Sent: Thursday, July 23, 2015 11:12 AM
To: AMBER Mailing List
Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
1. What display driver? If <346.82, upgrade.
2. Do single GPU runs show the same behavior?
On Thu, Jul 23, 2015 at 7:51 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Mohamed,
>
> Very very weird. A couple of things to try:
>
> 1) If you run the single GPU code rather than the MPI code does the same
> thing happen
>
> 2) Try using mpich3 rather than openMPI and see if the same problem
> occurs. It's possible there is a memory leak in openMPI that is causing an
> issue - or causing P2P to stop working - or some other weirdness.
>
> All the best
> Ross
>
> > On Jul 23, 2015, at 7:47 AM, Mohamed Faizan Momin <
> mmomin9.student.gsu.edu> wrote:
> >
> > Hi Ross,
> >
> > The production file stays the same throughout the entire run since I'm
> wanting to run for a full microsecond, and no other jobs are being fired up
> on this machine as I'm the only one with access to it. I'm running this
> over a full microsecond continuously using this script:
> >
> > #!/bin/csh
> >
> > setenv CUDA_HOME /usr/local/cuda-6.5
> > setenv LD_LIBRARY_PATH
> "/usr/local/cuda-6.5/lib64:/software/openmpi.1.8.1/lib:${LD_LIBRARY_PATH}"
> > setenv PATH "/usr/local/cuda-6.5/bin:${PATH}"
> > setenv CUDA_VISIBLE_DEVICES "0,1"
> >
> > set prv=M
> >
> > foreach cur (N O P Q R S T U V W X Y Z)
> >
> > /software/openmpi.1.8.1/bin/mpirun -v -np 2 pmemd.cuda.MPI -O -i
> production.in -p ../protein.prmtop -c production.$prv.restrt -o
> production.$cur.out -r production.$cur.restrt -x production.$cur.mdcrd
> >
> > set prv=$cur
> > end
> >
> > --
> > Mohamed Faizan Momin
> >
> > ________________________________________
> > From: Ross Walker <ross.rosswalker.co.uk>
> > Sent: Thursday, July 23, 2015 10:41 AM
> > To: AMBER Mailing List
> > Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
> >
> > Hi Mohamed,
> >
> > My first thought here was temperature throttling but when you say it
> always happens at the same point that hypothesis goes out the window. I've
> never seen this behavior before and am not even sure how to speculate on
> what might be causing it. First off given you say the performance is half
> are you certain it is not related to your input files in some way. Is there
> any difference with them - are you suddenly dropping the time step to 1fs
> from 2fs? Is anything else changed - do you change the ensemble or the
> barostat?
> >
> > My guess is it has to be something related to your simulation settings
> rather than the machine or GPUs since it happens when you start the next
> simulation. The other possibility is somehow multiple runs are being fired
> up on the same GPU. E.g. I could envision forgetting to set
> CUDA_VISIBLE_DEVICES again after the first run on GPU 0 completes and so
> ending up with the second run that was supposed to go on GPU 0 ending up on
> GPU 1 where another job is already running. Look for things like this in
> your scripts / by watching with nvidia-smi etc.
> >
> > All the best
> > Ross
> >
> >> On Jul 23, 2015, at 7:12 AM, Mohamed Faizan Momin <
> mmomin9.student.gsu.edu> wrote:
> >>
> >> Hi all,
> >>
> >>
> >> I have two GTX Titan Xs paired with a i7 5930K . 3.5 GHz processor in
> an ASUS Rampage V motherboard with 16GB 2133 MHz DDR4 RAM. i'm running a
> relatively small ~15K atom system and doing normal MD simulation with
> dt=0.002. My production file setup saves files every 100ns. I get an
> average of 275ns/day on the system but for some reason the Titan Xs slow
> down to a mere 100ns/day after two runs or 200ns. This happens exactly
> after the 2nd run is completed. I have to stop the current job and start it
> up again to continue onward. The err.log is empty and the temperatures are
> not an issue, as I have the fans running at ~50% which keep both GPUs under
> 65c. Any suggestions?
> >>
> >>
> >> --
> >> Mohamed Faizan Momin
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 23 2015 - 08:30:03 PDT