Re: [AMBER] Amber20 Performance on RTX (Turing): known problems, with a patch forthcoming

From: David Cerutti <>
Date: Tue, 21 Jul 2020 12:06:31 -0400

This is a timely question, and while we've had the fix in our master branch
for some time there are other issues that need to be resolved before we can
commit the result to patch the release version. We've recently had the
original author of pmemd.cuda come back to our team, and he is a wealth of
knowledge about the direction NVIDIA is going and methods in their toolkit
for ferreting out vulnerabilities. Things that caused compiler warnings
but didn't seem to interfere with test cases or calculations are getting
hunted down and solved. However, a patch will be ready soon.

When it arrives, the patch will affect all versions of CUDA 9 and 10.
CUDA11 will be able to compile the code for ampere but still produce a
slowdown on Turing, Volta, and Pascal. This is a curious case of a newer
compiler not giving better results, but this is due to new functionality
and deprecation of old methods rather than allowing users an opt-out.


On Tue, Jul 21, 2020 at 11:17 AM Tim Travers <> wrote:

> Hi Dave,
> I wanted to inquire if the patch for Amber20 that you mentioned below will
> be made available? We are currently using CUDA 10.2 on Turing cards and
> have seen the same slowdown. Not sure if these cards are among those that
> do not perform the thread synchronization, and so would not benefit here by
> upgrading to CUDA 11.
> Thanks,
> Tim
> On Fri, May 29, 2020 at 12:30 AM David Cerutti <>
> wrote:
> > Dear Users,
> >
> > As has been shared on this listserv, many users are finding that Amber20
> is
> > not as fast on Turing architectures for PME simulations as Amber18. The
> > source of this problem has now been identified, and indeed it affects
> much
> > more than just Turing, but the 15-20% slowdown seen on Turing is merely
> the
> > most severe case.
> >
> > The slowdown itself does NOT reflect any bugs or issues that would
> > necessitate repeating experiments. The problem, rather, is that some
> > future-proofing that our collaborators at NVIDIA kindly performed for us
> > has led to more GPU effort in synchronization. The benefit of this is
> > that, come CUDA 11 and the new Ampere chipset, pmemd.cuda is already
> > prepared to run on the cards (at a substantially greater speed than is
> > currently possible with a V100, which in my view competes with RTX-6000
> for
> > top dog). However, legacy chipsets that do not need to perform the
> > synchronization required for CUDA 11 to work properly will suffer in
> > performance.
> >
> > Contrary to what I warned yesterday afternoon, a fix is possible and we
> > already have it. A compiler-specific directive will create separate code
> > paths for the various chipsets and mask out the synchronization where it
> is
> > not needed, recovering the Amber18 performance while still keeping the
> code
> > in a state that is ready for the next architecture.
> >
> > I would like to thank Scott Legrand, Peng Wang, and others at NVIDIA who
> > contributed either to the future-proofing or the short-term recovery
> > effort. As you sit at home preparing your new simulations, please enjoy
> > some ice cream or other simple treat while you await the forthcoming
> patch
> > that will put Amber20 back where it should be on the benchmarks.
> >
> > Sincerely,
> >
> > Dave Cerutti
> > _______________________________________________
> > AMBER mailing list
> >
> >
> >
> _______________________________________________
> AMBER mailing list
AMBER mailing list
Received on Tue Jul 21 2020 - 09:30:03 PDT
Custom Search