I've been experiencing this for a bit now, and I'll outline below my
current experiences:
As Dmitry says, it isn't specific to iwrap, it is a general problem. What
I've found is actually that at some point some 'NaN' entries will appear,
and over time the remaining atoms will disappear also. I've looked at
trajectories for this, and I don't see things flying off into space, the
error seems to be quite sudden, going from a totally normal system at one
instant, to a destroyed system one timestep later.
With all the latest bugfixes, and iwrap=1, this is what I've discovered:
1) The error rate is dependant on the size of the system, smaller systems
are more resilient. Whether this is just a result of there being fewer
total calculations and therefore less opportunity to experience a
catastrophe, or something else, I don't know
2) The error rate is very hardware dependant. I've tested 3 different
nVidia cards with exactly the same compilation of AMBER (all bug-fixes,
same nvcc, same gcc):
On Tesla M2070 cards (the super expensive ones): the error happens very
rarely for a 100k atom system (probably ~1/100 times)
On a GTX570: the error happens with considerable frequency for a 100k atom
system (~1/2 times), but very rarely for a 50k atom system (~1/100 times)
On a GTX580: the error happens every single time for a 100k atom system.
3) The error is NOT dependant on hardware temperature:
Using the GTX580 I was able to find for a 100k atom system that in general
all 100k atoms went from normal coordinates in the first and second
timestep, to all NaN by the 3rd timestep. Sometimes it was able to get
past the 3rd timestep and become destroyed by the 4th timestep. Since the
time for this on a GTX580 is measured in msec at most, there was simply no
time for the card to heat beyond the resting temperature of ~35 degrees,
thus I don't think it is temperature dependant.
4) The error MAY be dependant on hardware settings, particularly voltage:
I haven't had the time to fully test this, but I've noticed for fewer
errors when running on the GTX570 if the nVidia settings are set to prefer
maximum performance, versus the adaptive performance. Since I notice most
of the errors happening very early in a simulation, this suggests the
possibility that if the card is running at low clock speeds, and suddenly
gets a lot of requests from AMBER, there maybe be a period in which voltage
is unstable as the card clocks up to full performance, whereas setting it
to stay at full clock speeds and then starting the simulation bypasses this
problem. As I say, I've tested this a bit on the GTX570, but haven't had a
chance to test the less stable 580 which would be the real test. I can
imagine though that the M2070 does not suffer from these issues.
So that is what I've seen so far. If anyone else has found any more
information that would be great. If I manage to find the time to test the
very unstable GTX580 in terms of setting the clock speeds to not be
variable, I'll report back. Obviously also I'm not suggesting that all
GTX580s would have this problem, since it is clearly hardware dependent,
there may be many other hardware factors involved here, perhaps reliability
of the power supply for the system (as the GTX580 system actually has a
power supply of 550W whereas the recommended minimum for that card is 600W).
~Aron
On Mon, Feb 20, 2012 at 12:46 PM, Dmitry Mukha <dvmukha.gmail.com> wrote:
> Dear Amber developers and users,
>
> may I revive a discussion about PMEMD 'Not A Number' error during
> calculations on CUDA device? After applying the latest bug fixes, from 1st
> to 20th for Amber and from 1st to 15th for AT1.5 I still have annoying
> NaN's on 'wrapping'. But after spending two days I guess I've found
> something pointing to reason of the error.
>
> Conditions when the error appears are:
> 1) minimization, heating, density equilibration, and then, long MD run with
> ntt=1
> 2) all are done on CUDA, so heating is performed without temperature
> scaling (nmropt=0)
> 3) all runs prior the latest is done with Langevin thermostat, the latest
> one is switched to Berendsen
>
> The error behavior is rather unpredictable. You cannot find the point of
> its 100% appearing. But the time lag is definitely dependent on thermostat
> settings (not on iwrap). Most of 'soft' crashes (NaN in the output) occurs
> during the last run in the aforementioned chain. But my guess is that
> something wrong happens with velocity distribution in the very beginning.
> It cannot be seen from the energy in the output. And Berendsen thermostat
> preserving the distribution acts like time bomb awaiting for something
> ultrafast happens in the system and then pmemd crashes.
>
> Systems I model are rather equilibrated apriori, based on X-ray structures.
> They don't contain any extra tension introduced 'by hands'. To overcome an
> issue with NaN I have divided heating run into few smaller ones with very
> short time step (0.0001 ps) and strong thermostat coupling (gamma_ln -gt
> 20). These runs revealed initial rising of the temperature due to
> relaxation though minimization was exhaustive and finished with 'limin
> failure' messages. It is strange but after that error doesn't appear. So it
> seems like it was delayed from the first MD runs and came out only after
> few ns. Such a behavior was tested on 50 different proteins. MD of all of
> them were successfully modeled when initial runs were calculated with
> nmropt=1 and smooth temperature scaling in the beginning. It was
> few months ago before bugfixes 18-20 had been released.
>
> The only question is why CPU code compiled with gcc is resistant to the
> error. Maybe, this is the reason why NaNs are so unexpected in the output
> during regular MD runs with pmemd.cuda.
>
> --
> Sincerely,
> Dmitry Mukha
> Institute of Bioorganic Chemistry, NAS, Minsk, Belarus
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
--
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Feb 20 2012 - 11:00:02 PST