Re: [AMBER] Amber20 pmemd.cuda installation and performance problems from David Cerutti on 2020-05-28 (Amber Archive May 2020)

From: David Cerutti <dscerutti.gmail.com>
Date: Thu, 28 May 2020 18:51:16 -0400

I have access to a machine in Darrin York's group that has a properly
situated RTX-2080Ti, and I have been able to benchmark the code and see the
performance hit (also, contrary to what I reported on another thread about
this, the Pascal GP100 is seeing a 5% DECREASE in performance, not an
increase with Amber20, so the problem is widespread). After further
consultation with Scott Legrand and one of our NVIDIA technical
collaborators, we have narrowed the problem to some things that are now
happening in the non-bonded kernels in order to protect us against
deprecations that NVIDIA plans to make in CUDA 11 and future releases.

(Thomas, thank you for your efforts, but I think we have already solved the
problem, at least to the extent that testing on your platforms could inform
us. I may yet log in if we devise a fix and want to test it on a broad
array of RTX hardware, but that is an IF.)

In particular, Scott and I looked over the kernel register spills, and
Amber20 is overall better in this regard than Amber18, although neither
code has a performance problem in this respect. Scott has also checked in
some safer, hardware-specific kernel launch bounds and these will be
applied in a future patch, but what is there is not causing a performance
problem nor, that we can tell, creating any other issue. Some users, and
even NVIDIA technicians themselves, have brought up possible performance
drag resulting from the cmake installation as opposed to the legacy build,
but I compiled with the legacy build system and I still see a 15% hit on
RTX-2080Ti, so any impact of cmake is marginal.

In summary, we have traced the performance hit to two kernels, and
unfortunately they are the most time-consuming kernels in most MD
applications. While I am still trying to understand the degree that each
of the necessary changes contributes to the slowdown, the changes are not
things we can simply revert, and I do not know whether even a major
overhaul of the non-bonded kernel could mitigate the new CUDA calls that
will be required in all code going forward. Furthermore, two years ago I
attempted a major rewrite of the non-bonded routine. After climbing that
mountain and reaching over the last rock, I looked down to see that the
code was going to become much more complex, many niche features would have
been affected, and that while it might be nice for directions I would like
to see MD go, the new method was not a performance win across the board.

I wish I came with better news, but it appears that this slowdown in
Amber20 is about where CUDA is going, not the result of any mistakes we
have made. We may be able to macro-out some of the calls for compilations
to recover a few percent on legacy chips like Pascal, and the Volta
architecture (V100 and Titan-V) seems to be resilient in the face of the
added synchronization calls. However, it looks like the Turing
architecture performance is going to suffer for the foreseeable future.
The hope I can offer is that the Ampere chips on the horizon appear to
resume an upward trend in the compute capacity of a single card, so in time
simulations will again start to get faster.

Dave

On Thu, May 28, 2020 at 2:26 PM Thomas Zeiser <thomas.zeiser.fau.de> wrote:

> On Wed, May 27, 2020 at 08:35:38PM +0200, Thomas Zeiser wrote:
> > Hi Dave,
> >
> > please send me your SSH public key. I'll come back tomorrow with
> > further instructions.
>
> you have to connect with SSH as w17p0001 to port 22622 of
> grid.rrze.uni-erlangen.de
> e.g. ssh -4 -p 22622 w17p0001.grid.rrze.uni-erlangen.de
>
> That should give you a shell on "testfront1". There, you can use e.g.
> srun -w medusa --time=10:0:0 --pty /bin/bash
> to get an interactive job on the node "medusa" which hosts the GPUs.
>
> CUDA, etc. are only installed on "medusa". Both testfront1 and
> medusa can directly access data from the internet (through NAT).
>
> While you have a job running on medua, you can also SSH to medusa.
> Once the job ends, all procces are kill. Thus, if you want to use
> "screen", run it on testfront1.
>
> $HOME has a quota of 10 GB; $WORK of 333 GB.
>
>
> Best
>
> thomas
>
> > Best
> >
> > thomas
> >
> > On Wed, May 27, 2020 at 02:29:58PM -0400, David Cerutti wrote:
> > > This sounds like a great plan. I am about to test amber18 and amber20
> on a
> > > local machine in another lab. If I can get access to the server with a
> > > range of different Turing GPUs I can start to look at how your problem
> > > takes place.
> > >
> > > Thanks,
> > >
> > > Dave
> > >
> > >
> > > On Wed, May 27, 2020 at 9:08 AM Thomas Zeiser <thomas.zeiser.fau.de>
> wrote:
> > >
> > > > Hi Dave,
> > > >
> > > > On Wed, May 27, 2020 at 07:54:32AM -0400, David Cerutti wrote:
> > > > > I cannot make a meaningful test on an RTX-2080Ti because the card
> I have
> > > > > access to are not sufficiently powered to give the right numbers.
> I see
> > > > > about a 20% degradation relative to what Ross was able to get.
> Ditto for
> > > > > an RTX-6000, which is nearly as fast as a V100 despite having 20%
> too
> > > > > little power feeding it.
> > > >
> > > > we could provide you temporary access to our HPC systems to support
> > > > you in investigating the prossible performance degradation.
> > > >
> > > > I'd could either offer one host with four different Turing-based
> > > > GPUs (Geforce 2070 Super, 2080 Super, Quadro RTX 5000, and 6000) or
> > > > a node with 4x Geforce 2080Ti.
> > > >
> > > > Both systems are running Ubuntu 18.04. Cuda toolkits 9.0 up to 10.2
> > > > are installedi together with driver 440.64.00. Persistence-Mode is
> > > > enabled on all GPUs. cmake/3.11.1 would also be available as
> > > > module.
> > > >
> > > >
> > > > Best
> > > >
> > > > thomas
> > > >
> > > > > Dave
> > > > --
> > > > Dr.-Ing. Thomas Zeiser, HPC Services
> > > > Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
> > > > Regionales Rechenzentrum Erlangen (RRZE)
> > > > Martensstraße 1, 91058 Erlangen, Germany
> > > > Tel.: +49 9131 85-28737, Fax: +49 9131 302941
> > > > thomas.zeiser.fau.de
> > > > https://www.rrze.de/hpc & https://hpc.fau.de
> > > >
> >
> > --
> > Dr.-Ing. Thomas Zeiser, HPC Services
> > Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
> > Regionales Rechenzentrum Erlangen (RRZE)
> > Martensstraße 1, 91058 Erlangen, Germany
> > Tel.: +49 9131 85-28737, Fax: +49 9131 302941
> > thomas.zeiser.fau.de
> > https://www.rrze.de/hpc & https://hpc.fau.de
>
> --
> Dr.-Ing. Thomas Zeiser, HPC Services
> Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
> Regionales Rechenzentrum Erlangen (RRZE)
> Martensstraße 1, 91058 Erlangen, Germany
> Tel.: +49 9131 85-28737, Fax: +49 9131 302941
> thomas.zeiser.fau.de
> https://www.rrze.de/hpc & https://hpc.fau.de
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 28 2020 - 16:00:02 PDT