Aron,
did you try finding out if the problem depends on ECC being
enabled/disabled (in case of the M2070, which seems to be hard work to
investigate for an error rate of about 1/100 :-))?
On 20.02.2012 19:37, Aron Broom wrote:
> I've been experiencing this for a bit now, and I'll outline below my
> current experiences:
>
> As Dmitry says, it isn't specific to iwrap, it is a general problem. What
> I've found is actually that at some point some 'NaN' entries will appear,
> and over time the remaining atoms will disappear also. I've looked at
> trajectories for this, and I don't see things flying off into space, the
> error seems to be quite sudden, going from a totally normal system at one
> instant, to a destroyed system one timestep later.
>
> With all the latest bugfixes, and iwrap=1, this is what I've discovered:
>
> 1) The error rate is dependant on the size of the system, smaller systems
> are more resilient. Whether this is just a result of there being fewer
> total calculations and therefore less opportunity to experience a
> catastrophe, or something else, I don't know
>
> 2) The error rate is very hardware dependant. I've tested 3 different
> nVidia cards with exactly the same compilation of AMBER (all bug-fixes,
> same nvcc, same gcc):
>
> On Tesla M2070 cards (the super expensive ones): the error happens very
> rarely for a 100k atom system (probably ~1/100 times)
>
> On a GTX570: the error happens with considerable frequency for a 100k atom
> system (~1/2 times), but very rarely for a 50k atom system (~1/100 times)
>
> On a GTX580: the error happens every single time for a 100k atom system.
>
> 3) The error is NOT dependant on hardware temperature:
>
> Using the GTX580 I was able to find for a 100k atom system that in general
> all 100k atoms went from normal coordinates in the first and second
> timestep, to all NaN by the 3rd timestep. Sometimes it was able to get
> past the 3rd timestep and become destroyed by the 4th timestep. Since the
> time for this on a GTX580 is measured in msec at most, there was simply no
> time for the card to heat beyond the resting temperature of ~35 degrees,
> thus I don't think it is temperature dependant.
>
> 4) The error MAY be dependant on hardware settings, particularly voltage:
>
> I haven't had the time to fully test this, but I've noticed for fewer
> errors when running on the GTX570 if the nVidia settings are set to prefer
> maximum performance, versus the adaptive performance. Since I notice most
> of the errors happening very early in a simulation, this suggests the
> possibility that if the card is running at low clock speeds, and suddenly
> gets a lot of requests from AMBER, there maybe be a period in which voltage
> is unstable as the card clocks up to full performance, whereas setting it
> to stay at full clock speeds and then starting the simulation bypasses this
> problem. As I say, I've tested this a bit on the GTX570, but haven't had a
> chance to test the less stable 580 which would be the real test. I can
> imagine though that the M2070 does not suffer from these issues.
>
> So that is what I've seen so far. If anyone else has found any more
> information that would be great. If I manage to find the time to test the
> very unstable GTX580 in terms of setting the clock speeds to not be
> variable, I'll report back. Obviously also I'm not suggesting that all
> GTX580s would have this problem, since it is clearly hardware dependent,
> there may be many other hardware factors involved here, perhaps reliability
> of the power supply for the system (as the GTX580 system actually has a
> power supply of 550W whereas the recommended minimum for that card is 600W).
>
> ~Aron
>
> On Mon, Feb 20, 2012 at 12:46 PM, Dmitry Mukha<dvmukha.gmail.com> wrote:
>
>> Dear Amber developers and users,
>>
>> may I revive a discussion about PMEMD 'Not A Number' error during
>> calculations on CUDA device? After applying the latest bug fixes, from 1st
>> to 20th for Amber and from 1st to 15th for AT1.5 I still have annoying
>> NaN's on 'wrapping'. But after spending two days I guess I've found
>> something pointing to reason of the error.
>>
>> Conditions when the error appears are:
>> 1) minimization, heating, density equilibration, and then, long MD run with
>> ntt=1
>> 2) all are done on CUDA, so heating is performed without temperature
>> scaling (nmropt=0)
>> 3) all runs prior the latest is done with Langevin thermostat, the latest
>> one is switched to Berendsen
>>
>> The error behavior is rather unpredictable. You cannot find the point of
>> its 100% appearing. But the time lag is definitely dependent on thermostat
>> settings (not on iwrap). Most of 'soft' crashes (NaN in the output) occurs
>> during the last run in the aforementioned chain. But my guess is that
>> something wrong happens with velocity distribution in the very beginning.
>> It cannot be seen from the energy in the output. And Berendsen thermostat
>> preserving the distribution acts like time bomb awaiting for something
>> ultrafast happens in the system and then pmemd crashes.
>>
>> Systems I model are rather equilibrated apriori, based on X-ray structures.
>> They don't contain any extra tension introduced 'by hands'. To overcome an
>> issue with NaN I have divided heating run into few smaller ones with very
>> short time step (0.0001 ps) and strong thermostat coupling (gamma_ln -gt
>> 20). These runs revealed initial rising of the temperature due to
>> relaxation though minimization was exhaustive and finished with 'limin
>> failure' messages. It is strange but after that error doesn't appear. So it
>> seems like it was delayed from the first MD runs and came out only after
>> few ns. Such a behavior was tested on 50 different proteins. MD of all of
>> them were successfully modeled when initial runs were calculated with
>> nmropt=1 and smooth temperature scaling in the beginning. It was
>> few months ago before bugfixes 18-20 had been released.
>>
>> The only question is why CPU code compiled with gcc is resistant to the
>> error. Maybe, this is the reason why NaNs are so unexpected in the output
>> during regular MD runs with pmemd.cuda.
>>
>> --
>> Sincerely,
>> Dmitry Mukha
>> Institute of Bioorganic Chemistry, NAS, Minsk, Belarus
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Feb 20 2012 - 11:00:03 PST