Re: [AMBER] problem of GTX470 running pmemd.cuda_DPDP/input file access provided from Sasha Buzko on 2010-09-08 (Amber Archive Sep 2010)

From: Sasha Buzko <obuzko.ucla.edu>
Date: Wed, 08 Sep 2010 14:17:52 -0700

Ross,
My system is a bit different from yours: CentOS 5.5, Intel fortran
(10.1.012), CUDA toolkit 3.1 and Nvidia driver 256.40 (both downloaded
from the Nvidia developers page). Amber11 with bugfixes applied.
I started a JAC simulation with your parameters (except no netcdf
output, since I compiled without it). Will let you know how it goes.

Best,

Sasha

Ross Walker wrote:
> Hi Sasha,
>
> Thanks for your help on this. There is lot of noise going on right now which
> makes it real tough to actual debug things, SPDP vs DPDP, various different
> systems etc etc. Thus it would be good if we can get some specific concrete
> information to see what is going on.
>
> Here is what I am running right now.
>
> Amber 11 Vanilla copy with bugfixes 1 to 8 applied.
>
> Redhat 4.8 x86_64, gfortran 4.1.2-44, nvcc 3.1 v0.2.1221, NVIDIA Driver
> v256.44
>
> I have taken the JAC NPT benchmark from
> http://ambermd.org/gpus/AMBER11_GPU_Benchmarks.tar.bz2 and modified it to
> run 100,000,000 steps. The input file is below and the files I am using are
> attached to this email.
>
> &cntrl
> ntx=5, irest=1,
> ntc=2, ntf=2,
> nstlim=100000000,
> ntpr=1000, ntwx=1000,
> ntwr=10000,
> dt=0.002, cut=8.,
> ntt=1, tautp=10.0,
> temp0=300.0,
> ntb=2, ntp=1, taup=10.0,
> ioutfm=1,
> /
>
> I compiled amber11 with './configure -cuda gnu'
>
> I am currently running this on the following:
>
> 1) 8xE5462 MPI - Has so far completed 2.572ns without issue.
>
> 2) Tesla C1060 - Has so far completed 7.890ns without issue.
>
> 3) Tesla C2050 - Has so far completed 14.664ns without issue.
>
> 4) GTX295 - Has so far completed 7.552ns without issue.
>
> Could you try running this exact same simulation on your GTX480 / 470 with
> the same toolkit and drivers if possible and see what happens. This way we
> will have consistent set of data we can look at rather than a 100 different
> theories.
>
> Thanks,
>
> All the best
> Ross
>
>
>> -----Original Message-----
>> From: Sasha Buzko [mailto:obuzko.ucla.edu]
>> Sent: Wednesday, September 08, 2010 1:43 PM
>> To: AMBER Mailing List
>> Subject: Re: [AMBER] problem of GTX470 running pmemd.cuda_DPDP/input
>> file access provided
>>
>> Scott,
>> in my experience, the errors ALWAYS came at different times in the
>> simulation. I even wrote up a wrapper script that would run 1 ns
>> chunks,
>> catch these errors and restart the failed simulation until it worked.
>> This way I could squeeze through a decent number of ns until the whole
>> thing froze (no output printed, but 100% load, as reported by other
>> people as well).
>> These observations, among others, have led me to believe that the
>> problem is outside of the pmemd.cuda port and is either hardware or
>> driver related.
>>
>> Sasha
>>
>>
>> Scott Le Grand wrote:
>>
>>> Running full double-precision changes the balance of computation and
>>>
>> memory access. This could have the effect of cooling the chip.
>>
>>> Running NPT versus NVT also traverses different code paths. This
>>>
>> could also have the effect of cooling the chip.
>>
>>> But the big question is if you run the same simulation twice. Does
>>>
>> it crash on exactly the same iteration? This is *the* *biggest*
>> question. If it does, then this is a code issue. If not, then it's
>> something else outside of the pmemd.cuda application(s). These
>> simulations are deterministic. Two independent runs on the same
>> hardware configuration and same input files and command line should
>> produce the *same* output.
>>
>>> Scott
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Sergio R Aragon [mailto:aragons.sfsu.edu]
>>> Sent: Wednesday, September 08, 2010 11:35
>>> To: AMBER Mailing List
>>> Cc: Duncan Poole
>>> Subject: [AMBER] problem of GTX470 running pmemd.cuda_DPDP/input file
>>>
>> access provided
>>
>>> Hello Ross,
>>>
>>> The job that I wrote to you about, 1faj, just failed with the DPDP
>>>
>> program in my 470 card after accumulating 2.3 ns of NVT ensemble. The
>> error messages captured were the following (a little different from
>> previous failures):
>>
>>> Error: the launch timed out and was terminated launching kernel
>>>
>> kPMEGetGridWeights
>>
>>> Error: the launch timed out and was terminated launching kernel
>>>
>> kCalculatePMENonbondForces
>>
>>> A second kernel time out occurred in addition to the usual one. The
>>>
>> DPDP model allowed the system to run a bit more before crashing. It
>> would be very nice if you could try this system on your C2050 card.
>> This 1faj system is also running in an 8 processor machine under Amber
>> 10 and has accumulated 3.66 ns so far under NPT. The density is around
>> 1.07 in both the Amber 10 run and the Cuda_DPDP run (determined by 1ns
>> NPT simulation before starting NVT), at 300K. As I mentioned before,
>> this is a 6 subunit protein, inorganic pyrophosphatase. This system
>> has 65,000 atoms.
>>
>>> An even better system to try to reproduce the error on is 1cts,
>>>
>> citrate synthase. This is only a dimeric protein whose file is too big
>> to run under the cuda DPDP program in my 470 card (malloc error). I am
>> running it on Amber 10 and it has accumulated, 20.1 ns under NPT.
>> Under pmemd.cuda, it crashes with the usual kernel time out error (#1
>> above), in the first ns on NVT md. The density of this system is 1.04
>> under Amber 10 NPT, and under pmemd.cuda (determined with 1 ns of NPT
>> before starting NVT), at 300K. This system has 79,000 atoms.
>>
>>> I don't know what systems Sasha Buzko is running, but they appear to
>>>
>> be smaller than mine. We are trying the 1faj system at SFSU with a GTX
>> 240 card in the default SPDP model. I'm afraid that card does not have
>> enough memory to run this system - we'll find out soon.
>>
>>> I have made an account in my system for you to login; data is
>>>
>> provided off list. Thanks!
>>
>>> Sergio
>>>
>>> Sergio Aragon
>>> Professor of Chemistry
>>> SfSU
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Ross Walker [mailto:ross.rosswalker.co.uk]
>>> Sent: Monday, September 06, 2010 5:54 PM
>>> To: 'AMBER Mailing List'
>>> Cc: 'Duncan Poole'
>>> Subject: Re: [AMBER] problem of GTX480 running pmemd.cuda
>>>
>>> Hi All,
>>>
>>> Can we please get a very simple example of the input and output that
>>>
>> is
>>
>>> effectively 'guaranteed' to produce this problem. I would like to
>>>
>> start by
>>
>>> confirming for sure that this works fine on GTX295, C1060 and C2050.
>>>
>> Once
>>
>>> this is confirmed we will know that it is something related
>>>
>> specifically to
>>
>>> GTX480 / 470. Unfortunately I do not have any GTX480's so cannot
>>>
>> reproduce
>>
>>> things myself. I want to make sure though that it definitely does not
>>>
>> occur
>>
>>> on other hardware.
>>>
>>> All the best
>>> Ross
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Sasha Buzko [mailto:obuzko.ucla.edu]
>>>> Sent: Monday, September 06, 2010 2:21 PM
>>>> To: AMBER Mailing List
>>>> Subject: Re: [AMBER] problem of GTX480 running pmemd.cuda
>>>>
>>>> Hi Yi,
>>>> yes, this issue does happen to other people, and we are in the
>>>>
>> process
>>
>>>> of figuring out why these things happen on consumer cards and don't
>>>> happen on Tesla. As far as I know, there is no clear solution to
>>>>
>> this
>>
>>>> yet, although maybe Ross and Scott could make some suggestions.
>>>>
>>>> As a side note, have you seen any simulation failures with "the
>>>>
>> launch
>>
>>>> timed out" error? Also, what's your card/CUDA driver versions?
>>>>
>>>> Thanks
>>>>
>>>> Sasha
>>>>
>>>>
>>>> Yi Xue wrote:
>>>>
>>>>
>>>>> Dear Amber users,
>>>>>
>>>>> I've been running pmemd.cuda on GTX480 for two months (implicit
>>>>>
>>>>>
>>>> solvent
>>>>
>>>>
>>>>> simulation). Occasionally, the program would get stuck: the process
>>>>>
>>>>>
>>>> is
>>>>
>>>>
>>>>> running ok when typing "top"; output file "md.out" just prints out
>>>>>
>>>>>
>>>> energy
>>>>
>>>>
>>>>> terms at some time point but does not get updated any more;
>>>>>
>>>>>
>>>> temperature of
>>>>
>>>>
>>>>> GPU will decrease by ~13C, but it is still higher than the idle
>>>>>
>>>>>
>>>> temperature
>>>>
>>>>
>>>>> by ~25C. After I restart the current trajectory, the problem would
>>>>>
>> be
>>
>>>> gone
>>>>
>>>>
>>>>> in most cases.
>>>>>
>>>>> It seems like in that case the job cannot be summited to (or
>>>>>
>> executed
>>
>>>> in)
>>>>
>>>>
>>>>> GPU unit. I'm wondering if this issue also happens to other
>>>>>
>> people...
>>
>>>>> Thanks for any response.
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>> ---------------------------------------------------------------------
>>>
>> --------------
>>
>>> This email message is for the sole use of the intended recipient(s)
>>>
>> and may contain
>>
>>> confidential information. Any unauthorized review, use, disclosure
>>>
>> or distribution
>>
>>> is prohibited. If you are not the intended recipient, please contact
>>>
>> the sender by
>>
>>> reply email and destroy all copies of the original message.
>>> ---------------------------------------------------------------------
>>>
>> --------------
>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>>
>>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 08 2010 - 14:30:13 PDT