Dear Amber developers and users,
may I revive a discussion about PMEMD 'Not A Number' error during
calculations on CUDA device? After applying the latest bug fixes, from 1st
to 20th for Amber and from 1st to 15th for AT1.5 I still have annoying
NaN's on 'wrapping'. But after spending two days I guess I've found
something pointing to reason of the error.
Conditions when the error appears are:
1) minimization, heating, density equilibration, and then, long MD run with
ntt=1
2) all are done on CUDA, so heating is performed without temperature
scaling (nmropt=0)
3) all runs prior the latest is done with Langevin thermostat, the latest
one is switched to Berendsen
The error behavior is rather unpredictable. You cannot find the point of
its 100% appearing. But the time lag is definitely dependent on thermostat
settings (not on iwrap). Most of 'soft' crashes (NaN in the output) occurs
during the last run in the aforementioned chain. But my guess is that
something wrong happens with velocity distribution in the very beginning.
It cannot be seen from the energy in the output. And Berendsen thermostat
preserving the distribution acts like time bomb awaiting for something
ultrafast happens in the system and then pmemd crashes.
Systems I model are rather equilibrated apriori, based on X-ray structures.
They don't contain any extra tension introduced 'by hands'. To overcome an
issue with NaN I have divided heating run into few smaller ones with very
short time step (0.0001 ps) and strong thermostat coupling (gamma_ln -gt
20). These runs revealed initial rising of the temperature due to
relaxation though minimization was exhaustive and finished with 'limin
failure' messages. It is strange but after that error doesn't appear. So it
seems like it was delayed from the first MD runs and came out only after
few ns. Such a behavior was tested on 50 different proteins. MD of all of
them were successfully modeled when initial runs were calculated with
nmropt=1 and smooth temperature scaling in the beginning. It was
few months ago before bugfixes 18-20 had been released.
The only question is why CPU code compiled with gcc is resistant to the
error. Maybe, this is the reason why NaNs are so unexpected in the output
during regular MD runs with pmemd.cuda.
--
Sincerely,
Dmitry Mukha
Institute of Bioorganic Chemistry, NAS, Minsk, Belarus
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Feb 20 2012 - 10:00:02 PST