Re: [AMBER] GTX Titan Xs slowing down after 200ns

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 23 Jul 2015 07:51:31 -0700

Hi Mohamed,

Very very weird. A couple of things to try:

1) If you run the single GPU code rather than the MPI code does the same thing happen

2) Try using mpich3 rather than openMPI and see if the same problem occurs. It's possible there is a memory leak in openMPI that is causing an issue - or causing P2P to stop working - or some other weirdness.

All the best
Ross

> On Jul 23, 2015, at 7:47 AM, Mohamed Faizan Momin <mmomin9.student.gsu.edu> wrote:
>
> Hi Ross,
>
> The production file stays the same throughout the entire run since I'm wanting to run for a full microsecond, and no other jobs are being fired up on this machine as I'm the only one with access to it. I'm running this over a full microsecond continuously using this script:
>
> #!/bin/csh
>
> setenv CUDA_HOME /usr/local/cuda-6.5
> setenv LD_LIBRARY_PATH "/usr/local/cuda-6.5/lib64:/software/openmpi.1.8.1/lib:${LD_LIBRARY_PATH}"
> setenv PATH "/usr/local/cuda-6.5/bin:${PATH}"
> setenv CUDA_VISIBLE_DEVICES "0,1"
>
> set prv=M
>
> foreach cur (N O P Q R S T U V W X Y Z)
>
> /software/openmpi.1.8.1/bin/mpirun -v -np 2 pmemd.cuda.MPI -O -i production.in -p ../protein.prmtop -c production.$prv.restrt -o production.$cur.out -r production.$cur.restrt -x production.$cur.mdcrd
>
> set prv=$cur
> end
>
> --
> Mohamed Faizan Momin
>
> ________________________________________
> From: Ross Walker <ross.rosswalker.co.uk>
> Sent: Thursday, July 23, 2015 10:41 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
>
> Hi Mohamed,
>
> My first thought here was temperature throttling but when you say it always happens at the same point that hypothesis goes out the window. I've never seen this behavior before and am not even sure how to speculate on what might be causing it. First off given you say the performance is half are you certain it is not related to your input files in some way. Is there any difference with them - are you suddenly dropping the time step to 1fs from 2fs? Is anything else changed - do you change the ensemble or the barostat?
>
> My guess is it has to be something related to your simulation settings rather than the machine or GPUs since it happens when you start the next simulation. The other possibility is somehow multiple runs are being fired up on the same GPU. E.g. I could envision forgetting to set CUDA_VISIBLE_DEVICES again after the first run on GPU 0 completes and so ending up with the second run that was supposed to go on GPU 0 ending up on GPU 1 where another job is already running. Look for things like this in your scripts / by watching with nvidia-smi etc.
>
> All the best
> Ross
>
>> On Jul 23, 2015, at 7:12 AM, Mohamed Faizan Momin <mmomin9.student.gsu.edu> wrote:
>>
>> Hi all,
>>
>>
>> I have two GTX Titan Xs paired with a i7 5930K . 3.5 GHz processor in an ASUS Rampage V motherboard with 16GB 2133 MHz DDR4 RAM. i'm running a relatively small ~15K atom system and doing normal MD simulation with dt=0.002. My production file setup saves files every 100ns. I get an average of 275ns/day on the system but for some reason the Titan Xs slow down to a mere 100ns/day after two runs or 200ns. This happens exactly after the 2nd run is completed. I have to stop the current job and start it up again to continue onward. The err.log is empty and the temperatures are not an issue, as I have the fans running at ~50% which keep both GPUs under 65c. Any suggestions?
>>
>>
>> --
>> Mohamed Faizan Momin
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 23 2015 - 08:00:04 PDT
Custom Search