Re: [AMBER] GTX Titan Xs slowing down after 200ns from Ross Walker on 2015-07-23 (Amber Archive Jul 2015)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 23 Jul 2015 09:05:01 -0700

Scott - I think this is a red herring. When running P2P nvidia-smi always shows 2 processes per GPU. I've never tracked down why.

Mohamed, I would strongly advise you to switch to mpich3 and see if that fixes your problems. I've had a number of people report problems with OpenMPI and P2P - yours is undoubtedly a new one but I would not be surprised if that is the problem.

Just download mpich3.1.4 (or whatever is the latest version) and then:

edit your bashrc to set

export MPI_HOME=~/mpich-3.1.4
export PATH=\$MPI_HOME/bin:\$PATH

then

source ~/.bashrc
cd ~/
tar xvzf mpich-3.1.4.tar.gz
mv mpich-3.1.4 mpich-3.1.4_source
cd mpich-3.1.4_source
export FC=gfortran
export CC=gcc
export CXX=g++

./configure --prefix=/usr/local/mpich-3.1.4
make
make install

Then clean AMBER and rebuild it with the new MPI and hopefully you will be good.

All the best
Ross

> On Jul 23, 2015, at 8:46 AM, Mohamed Faizan Momin <mmomin9.student.gsu.edu> wrote:
>
> My normal way to run the script is nohup ./jobp.in >& err.log & and ever since I've installed the GPUs I've always receive this output for my MD runs. How would I go about fixing it?
>
> --
> Mohamed Faizan Momin
>
> ________________________________________
> From: Scott Le Grand <varelse2005.gmail.com>
> Sent: Thursday, July 23, 2015 11:39 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
>
> So um, why are you running 2 GPU processes per GPU? That would explain a
> lot...
>
> On Thu, Jul 23, 2015 at 8:26 AM, Mohamed Faizan Momin <
> mmomin9.student.gsu.edu> wrote:
>
>> Hi Scott,
>>
>> This is my nvidia-smi output, I'm running the latest version for the
>> driver, I've yet to try it on just one GPU.
>>
>> +------------------------------------------------------+
>> | NVIDIA-SMI 352.21 Driver Version: 352.21 |
>>
>> |-------------------------------+----------------------+----------------------+
>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
>> ECC |
>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
>> M. |
>>
>> |===============================+======================+======================|
>> | 0 GeForce GTX TIT... On | 0000:01:00.0 Off |
>> N/A |
>> | 60% 65C P2 190W / 250W | 392MiB / 12287MiB | 95%
>> Default |
>>
>> +-------------------------------+----------------------+----------------------+
>> | 1 GeForce GTX TIT... On | 0000:02:00.0 Off |
>> N/A |
>> | 60% 55C P2 156W / 250W | 361MiB / 12287MiB | 84%
>> Default |
>>
>> +-------------------------------+----------------------+----------------------+
>>
>>
>> +-----------------------------------------------------------------------------+
>> | Processes: GPU
>> Memory |
>> | GPU PID Type Process name Usage
>> |
>>
>> |=============================================================================|
>> | 0 3176 C pmemd.cuda.MPI
>> 110MiB |
>> | 0 3177 C pmemd.cuda.MPI
>> 255MiB |
>> | 1 3176 C pmemd.cuda.MPI
>> 223MiB |
>> | 1 3177 C pmemd.cuda.MPI
>> 110MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>>
>> --
>> Mohamed Faizan Momin
>>
>> ________________________________________
>> From: Scott Le Grand <varelse2005.gmail.com>
>> Sent: Thursday, July 23, 2015 11:12 AM
>> To: AMBER Mailing List
>> Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
>>
>> 1. What display driver? If <346.82, upgrade.
>>
>> 2. Do single GPU runs show the same behavior?
>>
>> On Thu, Jul 23, 2015 at 7:51 AM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>>
>>> Hi Mohamed,
>>>
>>> Very very weird. A couple of things to try:
>>>
>>> 1) If you run the single GPU code rather than the MPI code does the same
>>> thing happen
>>>
>>> 2) Try using mpich3 rather than openMPI and see if the same problem
>>> occurs. It's possible there is a memory leak in openMPI that is causing
>> an
>>> issue - or causing P2P to stop working - or some other weirdness.
>>>
>>> All the best
>>> Ross
>>>
>>>> On Jul 23, 2015, at 7:47 AM, Mohamed Faizan Momin <
>>> mmomin9.student.gsu.edu> wrote:
>>>>
>>>> Hi Ross,
>>>>
>>>> The production file stays the same throughout the entire run since I'm
>>> wanting to run for a full microsecond, and no other jobs are being fired
>> up
>>> on this machine as I'm the only one with access to it. I'm running this
>>> over a full microsecond continuously using this script:
>>>>
>>>> #!/bin/csh
>>>>
>>>> setenv CUDA_HOME /usr/local/cuda-6.5
>>>> setenv LD_LIBRARY_PATH
>>>
>> "/usr/local/cuda-6.5/lib64:/software/openmpi.1.8.1/lib:${LD_LIBRARY_PATH}"
>>>> setenv PATH "/usr/local/cuda-6.5/bin:${PATH}"
>>>> setenv CUDA_VISIBLE_DEVICES "0,1"
>>>>
>>>> set prv=M
>>>>
>>>> foreach cur (N O P Q R S T U V W X Y Z)
>>>>
>>>> /software/openmpi.1.8.1/bin/mpirun -v -np 2 pmemd.cuda.MPI -O -i
>>> production.in -p ../protein.prmtop -c production.$prv.restrt -o
>>> production.$cur.out -r production.$cur.restrt -x production.$cur.mdcrd
>>>>
>>>> set prv=$cur
>>>> end
>>>>
>>>> --
>>>> Mohamed Faizan Momin
>>>>
>>>> ________________________________________
>>>> From: Ross Walker <ross.rosswalker.co.uk>
>>>> Sent: Thursday, July 23, 2015 10:41 AM
>>>> To: AMBER Mailing List
>>>> Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
>>>>
>>>> Hi Mohamed,
>>>>
>>>> My first thought here was temperature throttling but when you say it
>>> always happens at the same point that hypothesis goes out the window.
>> I've
>>> never seen this behavior before and am not even sure how to speculate on
>>> what might be causing it. First off given you say the performance is half
>>> are you certain it is not related to your input files in some way. Is
>> there
>>> any difference with them - are you suddenly dropping the time step to 1fs
>>> from 2fs? Is anything else changed - do you change the ensemble or the
>>> barostat?
>>>>
>>>> My guess is it has to be something related to your simulation settings
>>> rather than the machine or GPUs since it happens when you start the next
>>> simulation. The other possibility is somehow multiple runs are being
>> fired
>>> up on the same GPU. E.g. I could envision forgetting to set
>>> CUDA_VISIBLE_DEVICES again after the first run on GPU 0 completes and so
>>> ending up with the second run that was supposed to go on GPU 0 ending up
>> on
>>> GPU 1 where another job is already running. Look for things like this in
>>> your scripts / by watching with nvidia-smi etc.
>>>>
>>>> All the best
>>>> Ross
>>>>
>>>>> On Jul 23, 2015, at 7:12 AM, Mohamed Faizan Momin <
>>> mmomin9.student.gsu.edu> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>> I have two GTX Titan Xs paired with a i7 5930K . 3.5 GHz processor in
>>> an ASUS Rampage V motherboard with 16GB 2133 MHz DDR4 RAM. i'm running a
>>> relatively small ~15K atom system and doing normal MD simulation with
>>> dt=0.002. My production file setup saves files every 100ns. I get an
>>> average of 275ns/day on the system but for some reason the Titan Xs slow
>>> down to a mere 100ns/day after two runs or 200ns. This happens exactly
>>> after the 2nd run is completed. I have to stop the current job and start
>> it
>>> up again to continue onward. The err.log is empty and the temperatures
>> are
>>> not an issue, as I have the fans running at ~50% which keep both GPUs
>> under
>>> 65c. Any suggestions?
>>>>>
>>>>>
>>>>> --
>>>>> Mohamed Faizan Momin
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 23 2015 - 09:30:02 PDT