Dear debug effort,
I need help with patching Amber, please see below.
I restarted my jac job without changing anything, following Jason's suggestion. We want to see if the failure point is reproducible in this machine with this card.
Because there has been some question of temperatures, I am running the X server and the nividia-settings tool this time. Since the job is small I don't expect any interference, but this will allow me to record temperatures. I will re run the job w/o X again to eliminate any doubts about X server interference. The temperatures are 83 C gpu, 56 C board, 55% fan. The job already passed the 120 ps where it crashed the last time. Thus, sorry to report, the iteration where the error occurs is not reproducible - this agrees with Sasha's report. I will report the state in which it crashes again. I have all the files produced - let me know who wants to see what. OK, maybe X server is interfering and causing the crash to be at another point. I will run for a 3rd time w/o X but I'll let this run for a while to see what happens.
Please help: I ran that patch with the new bugfix.all for Amber11 to include patches 7, and 8 that I did not have. It appears that in patch 4, which I had previously applied, the patch did not get skipped and I got a message saying:
Patching file src/pmemd/src/cuda/gpu.cpp
HUNK #4 Failed at 2072
HUNK #5 Failed at 2259
HUNK #6 Failed at 2773 out of 6 hunks failed - saving rejects in ...
The rest of the patch job went fine. 3 more patches were skipped, and then the pbsa patch and a make file patch were applied. It is curious that I had previously applied bugfix.all with patches 1-6, and the patch job reported a total of 6 patches skipped, yet in the middle there appeared the above message.
At any rate, I have not recompiled Amber 11. I know I should have just applied patches 7,8 individually, and instead I got lazy and used the entire bugfix.all.
My question is: do I now undo this patch job? Do I ignore and compile? Please suggest a course of action. Ultimately I'm trying to get as close to Ross's setup as I can for continuing the debug process. BTW, I configure and compile exactly as Ross does. Thanks!
Sergio
-----Original Message-----
From: Jason Swails [mailto:jason.swails.gmail.com]
Sent: Wednesday, September 08, 2010 7:26 PM
To: AMBER Mailing List
Subject: Re: [AMBER] JAC test on GTX 470 error produced
Hello,
On Wed, Sep 8, 2010 at 9:17 PM, Sergio R Aragon <aragons.sfsu.edu> wrote:
> Hello Scott, Ross and Sasha,
>
> My JAC job just quit, after only 120 ps of run with the error:
>
Run the exact same thing again with the exact same inputs everywhere. Do
you get the same error in the same place? From the sounds of the thread,
reproducibility of the error is the name of the game here. If you do get
the same error in the same place, save the restrt file that occurs right
before that error and send it out with the applicable input files and such.
This bug isn't sounding like much fun to hunt down (ALMOST makes me glad I
don't have a 4xx card to test it out on :) ).
Thanks!
Jason
> Error: the launch timed out and was terminated launching kernel
> kPMEGetGridWeights.
>
> I ran the job using inpcrd, prmtop and the input file mdin provided by Ross
> w/o doing any energy minimization or constrained MD beforehand. That is,
> went directly into production. Was that the intention?
>
> Here is the description of my system again for easy reference:
> MSI GTX 470
> Amber 11 Vanilla copy with bugfixes 1 to 6 applied.
> Redhat 4.8 x86_64, gfortran 4.1.2-44, nvcc 3.1 v0.2.1221, NVIDIA Driver
> v256.35
>
> I will now update Amber and the driver, and start over.
>
> Thanks, Sergio
>
> -----Original Message-----
> From: Scott Le Grand [mailto:SLeGrand.nvidia.com]
> Sent: Wednesday, September 08, 2010 3:42 PM
> To: AMBER Mailing List
> Subject: Re: [AMBER] JAC test on GTX 470 started
>
> Not really...
>
> They're all just general GPU and/or driver failures. Try not to read
> anything into it beyond that... To diagnose this specifically, we're going
> to need a relatively solid repro case and that's what we're all doing here.
>
> Scott
>
>
>
>
> -----Original Message-----
> From: Sergio R Aragon [mailto:aragons.sfsu.edu]
> Sent: Wednesday, September 08, 2010 15:23
> To: AMBER Mailing List
> Subject: [AMBER] JAC test on GTX 470 started
>
> Hello Ross and Sasha,
>
> I've started the job with your input files and mdin file on my GTX 470. My
> system is presently described as follows:
>
> Amber 11 Vanilla copy with bugfixes 1 to 6 applied.
>
> Redhat 4.8 x86_64, gfortran 4.1.2-44, nvcc 3.1 v0.2.1221, NVIDIA Driver
> v256.35
>
> This is very similar to Ross's system: my Driver is a touch older, I
> haven't applied Amber bugfixes 7,8. If I have a problem, I'll update the
> Amber bug fixes and report. I found a driver 256.53, don't know how to get
> 256.44 - is that important?
>
> Other comment: we are testing a different error situation than the launch
> time out at the GetGrid kernel here - aren't we Sasha? This is an error
> condition in which there is no output but the simulation just keeps going.
> I have observed that when the X server was interfering, but not at other
> times. Nevertheless, in the effort to provide input from a 470 card, I'm
> also running this job.
>
> Sergio
>
> -----Original Message-----
> From: Ross Walker [mailto:ross.rosswalker.co.uk]
> Sent: Wednesday, September 08, 2010 2:02 PM
> To: 'AMBER Mailing List'
> Cc: 'Duncan Poole'
> Subject: Re: [AMBER] problem of GTX470 running pmemd.cuda_DPDP/input file
> access provided
>
> Hi Sasha,
>
> Thanks for your help on this. There is lot of noise going on right now
> which makes it real tough to actual debug things, SPDP vs DPDP, various
> different systems etc etc. Thus it would be good if we can get some specific
> concrete information to see what is going on.
>
> Here is what I am running right now.
>
> Amber 11 Vanilla copy with bugfixes 1 to 8 applied.
>
> Redhat 4.8 x86_64, gfortran 4.1.2-44, nvcc 3.1 v0.2.1221, NVIDIA Driver
> v256.44
>
> I have taken the JAC NPT benchmark from
> http://ambermd.org/gpus/AMBER11_GPU_Benchmarks.tar.bz2 and modified it to
> run 100,000,000 steps. The input file is below and the files I am using are
> attached to this email.
>
> &cntrl
> ntx=5, irest=1,
> ntc=2, ntf=2,
> nstlim=100000000,
> ntpr=1000, ntwx=1000,
> ntwr=10000,
> dt=0.002, cut=8.,
> ntt=1, tautp=10.0,
> temp0=300.0,
> ntb=2, ntp=1, taup=10.0,
> ioutfm=1,
> /
>
> I compiled amber11 with './configure -cuda gnu'
>
> I am currently running this on the following:
>
> 1) 8xE5462 MPI - Has so far completed 2.572ns without issue.
>
> 2) Tesla C1060 - Has so far completed 7.890ns without issue.
>
> 3) Tesla C2050 - Has so far completed 14.664ns without issue.
>
> 4) GTX295 - Has so far completed 7.552ns without issue.
>
> Could you try running this exact same simulation on your GTX480 / 470 with
> the same toolkit and drivers if possible and see what happens. This way we
> will have consistent set of data we can look at rather than a 100 different
> theories.
>
> Thanks,
>
> All the best
> Ross
>
> > -----Original Message-----
> > From: Sasha Buzko [mailto:obuzko.ucla.edu]
> > Sent: Wednesday, September 08, 2010 1:43 PM
> > To: AMBER Mailing List
> > Subject: Re: [AMBER] problem of GTX470 running pmemd.cuda_DPDP/input
> > file access provided
> >
> > Scott,
> > in my experience, the errors ALWAYS came at different times in the
> > simulation. I even wrote up a wrapper script that would run 1 ns
> > chunks, catch these errors and restart the failed simulation until it
> > worked.
> > This way I could squeeze through a decent number of ns until the whole
> > thing froze (no output printed, but 100% load, as reported by other
> > people as well).
> > These observations, among others, have led me to believe that the
> > problem is outside of the pmemd.cuda port and is either hardware or
> > driver related.
> >
> > Sasha
> >
> >
> > Scott Le Grand wrote:
> > > Running full double-precision changes the balance of computation and
> > memory access. This could have the effect of cooling the chip.
> > >
> > > Running NPT versus NVT also traverses different code paths. This
> > could also have the effect of cooling the chip.
> > >
> > > But the big question is if you run the same simulation twice. Does
> > it crash on exactly the same iteration? This is *the* *biggest*
> > question. If it does, then this is a code issue. If not, then it's
> > something else outside of the pmemd.cuda application(s). These
> > simulations are deterministic. Two independent runs on the same
> > hardware configuration and same input files and command line should
> > produce the *same* output.
> > >
> > > Scott
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Sergio R Aragon [mailto:aragons.sfsu.edu]
> > > Sent: Wednesday, September 08, 2010 11:35
> > > To: AMBER Mailing List
> > > Cc: Duncan Poole
> > > Subject: [AMBER] problem of GTX470 running pmemd.cuda_DPDP/input
> > > file
> > access provided
> > >
> > > Hello Ross,
> > >
> > > The job that I wrote to you about, 1faj, just failed with the DPDP
> > program in my 470 card after accumulating 2.3 ns of NVT ensemble. The
> > error messages captured were the following (a little different from
> > previous failures):
> > >
> > > Error: the launch timed out and was terminated launching kernel
> > kPMEGetGridWeights
> > > Error: the launch timed out and was terminated launching kernel
> > kCalculatePMENonbondForces
> > >
> > > A second kernel time out occurred in addition to the usual one. The
> > DPDP model allowed the system to run a bit more before crashing. It
> > would be very nice if you could try this system on your C2050 card.
> > This 1faj system is also running in an 8 processor machine under Amber
> > 10 and has accumulated 3.66 ns so far under NPT. The density is
> > around
> > 1.07 in both the Amber 10 run and the Cuda_DPDP run (determined by 1ns
> > NPT simulation before starting NVT), at 300K. As I mentioned before,
> > this is a 6 subunit protein, inorganic pyrophosphatase. This system
> > has 65,000 atoms.
> > >
> > > An even better system to try to reproduce the error on is 1cts,
> > citrate synthase. This is only a dimeric protein whose file is too
> > big to run under the cuda DPDP program in my 470 card (malloc error).
> > I am running it on Amber 10 and it has accumulated, 20.1 ns under NPT.
> > Under pmemd.cuda, it crashes with the usual kernel time out error (#1
> > above), in the first ns on NVT md. The density of this system is 1.04
> > under Amber 10 NPT, and under pmemd.cuda (determined with 1 ns of NPT
> > before starting NVT), at 300K. This system has 79,000 atoms.
> > >
> > > I don't know what systems Sasha Buzko is running, but they appear to
> > be smaller than mine. We are trying the 1faj system at SFSU with a
> > GTX 240 card in the default SPDP model. I'm afraid that card does not
> > have enough memory to run this system - we'll find out soon.
> > >
> > > I have made an account in my system for you to login; data is
> > provided off list. Thanks!
> > >
> > > Sergio
> > >
> > > Sergio Aragon
> > > Professor of Chemistry
> > > SfSU
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ross Walker [mailto:ross.rosswalker.co.uk]
> > > Sent: Monday, September 06, 2010 5:54 PM
> > > To: 'AMBER Mailing List'
> > > Cc: 'Duncan Poole'
> > > Subject: Re: [AMBER] problem of GTX480 running pmemd.cuda
> > >
> > > Hi All,
> > >
> > > Can we please get a very simple example of the input and output that
> > is
> > > effectively 'guaranteed' to produce this problem. I would like to
> > start by
> > > confirming for sure that this works fine on GTX295, C1060 and C2050.
> > Once
> > > this is confirmed we will know that it is something related
> > specifically to
> > > GTX480 / 470. Unfortunately I do not have any GTX480's so cannot
> > reproduce
> > > things myself. I want to make sure though that it definitely does
> > > not
> > occur
> > > on other hardware.
> > >
> > > All the best
> > > Ross
> > >
> > >
> > >> -----Original Message-----
> > >> From: Sasha Buzko [mailto:obuzko.ucla.edu]
> > >> Sent: Monday, September 06, 2010 2:21 PM
> > >> To: AMBER Mailing List
> > >> Subject: Re: [AMBER] problem of GTX480 running pmemd.cuda
> > >>
> > >> Hi Yi,
> > >> yes, this issue does happen to other people, and we are in the
> > process
> > >> of figuring out why these things happen on consumer cards and don't
> > >> happen on Tesla. As far as I know, there is no clear solution to
> > this
> > >> yet, although maybe Ross and Scott could make some suggestions.
> > >>
> > >> As a side note, have you seen any simulation failures with "the
> > launch
> > >> timed out" error? Also, what's your card/CUDA driver versions?
> > >>
> > >> Thanks
> > >>
> > >> Sasha
> > >>
> > >>
> > >> Yi Xue wrote:
> > >>
> > >>> Dear Amber users,
> > >>>
> > >>> I've been running pmemd.cuda on GTX480 for two months (implicit
> > >>>
> > >> solvent
> > >>
> > >>> simulation). Occasionally, the program would get stuck: the
> > >>> process
> > >>>
> > >> is
> > >>
> > >>> running ok when typing "top"; output file "md.out" just prints out
> > >>>
> > >> energy
> > >>
> > >>> terms at some time point but does not get updated any more;
> > >>>
> > >> temperature of
> > >>
> > >>> GPU will decrease by ~13C, but it is still higher than the idle
> > >>>
> > >> temperature
> > >>
> > >>> by ~25C. After I restart the current trajectory, the problem would
> > be
> > >>>
> > >> gone
> > >>
> > >>> in most cases.
> > >>>
> > >>> It seems like in that case the job cannot be summited to (or
> > executed
> > >>>
> > >> in)
> > >>
> > >>> GPU unit. I'm wondering if this issue also happens to other
> > people...
> > >>>
> > >>> Thanks for any response.
> > >>> _______________________________________________
> > >>> AMBER mailing list
> > >>> AMBER.ambermd.org
> > >>> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>
> > >>>
> > >>>
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >>
> > >
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > >
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > > --------------------------------------------------------------------
> > > -
> > --------------
> > > This email message is for the sole use of the intended recipient(s)
> > and may contain
> > > confidential information. Any unauthorized review, use, disclosure
> > or distribution
> > > is prohibited. If you are not the intended recipient, please
> > > contact
> > the sender by
> > > reply email and destroy all copies of the original message.
> > > --------------------------------------------------------------------
> > > -
> > --------------
> > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
--
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 08 2010 - 20:30:03 PDT