Re: [AMBER] GTX Titan Xs slowing down after 200ns

From: Mohamed Faizan Momin <mmomin9.student.gsu.edu>
Date: Thu, 23 Jul 2015 14:47:34 +0000

Hi Ross,

The production file stays the same throughout the entire run since I'm wanting to run for a full microsecond, and no other jobs are being fired up on this machine as I'm the only one with access to it. I'm running this over a full microsecond continuously using this script:

#!/bin/csh

setenv CUDA_HOME /usr/local/cuda-6.5
setenv LD_LIBRARY_PATH "/usr/local/cuda-6.5/lib64:/software/openmpi.1.8.1/lib:${LD_LIBRARY_PATH}"
setenv PATH "/usr/local/cuda-6.5/bin:${PATH}"
setenv CUDA_VISIBLE_DEVICES "0,1"

set prv=M

foreach cur (N O P Q R S T U V W X Y Z)

  /software/openmpi.1.8.1/bin/mpirun -v -np 2 pmemd.cuda.MPI -O -i production.in -p ../protein.prmtop -c production.$prv.restrt -o production.$cur.out -r production.$cur.restrt -x production.$cur.mdcrd

  set prv=$cur
end

--
Mohamed Faizan Momin
________________________________________
From: Ross Walker <ross.rosswalker.co.uk>
Sent: Thursday, July 23, 2015 10:41 AM
To: AMBER Mailing List
Subject: Re: [AMBER] GTX Titan Xs slowing down after 200ns
Hi Mohamed,
My first thought here was temperature throttling but when you say it always happens at the same point that hypothesis goes out the window. I've never seen this behavior before and am not even sure how to speculate on what might be causing it. First off given you say the performance is half are you certain it is not related to your input files in some way. Is there any difference with them - are you suddenly dropping the time step to 1fs from 2fs? Is anything else changed - do you change the ensemble or the barostat?
My guess is it has to be something related to your simulation settings rather than the machine or GPUs since it happens when you start the next simulation. The other possibility is somehow multiple runs are being fired up on the same GPU. E.g. I could envision forgetting to set CUDA_VISIBLE_DEVICES again after the first run on GPU 0 completes and so ending up with the second run that was supposed to go on GPU 0 ending up on GPU 1 where another job is already running. Look for things like this in your scripts / by watching with nvidia-smi etc.
All the best
Ross
> On Jul 23, 2015, at 7:12 AM, Mohamed Faizan Momin <mmomin9.student.gsu.edu> wrote:
>
> Hi all,
>
>
> I have two GTX Titan Xs paired with a i7 5930K . 3.5 GHz processor in an ASUS Rampage V motherboard with 16GB 2133 MHz DDR4 RAM. i'm running a relatively small ~15K atom system and doing normal MD simulation with dt=0.002. My production file setup saves files every 100ns. I get an average of 275ns/day on the system but for some reason the Titan Xs slow down to a mere 100ns/day after two runs or 200ns. This happens exactly after the 2nd run is completed. I have to stop the current job and start it up again to continue onward. The err.log is empty and the temperatures are not an issue, as I have the fans running at ~50% which keep both GPUs under 65c. Any suggestions?
>
>
> --
> Mohamed Faizan Momin
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 23 2015 - 08:00:03 PDT
Custom Search