Re: [AMBER] Amber20 pmemd.cuda performance problems (especially on Turing-based GPUs)

From: Thomas Zeiser <>
Date: Mon, 25 May 2020 11:01:09 +0200

Dear All,

we also observe a significant performance degradation of pmemd.cuda
from Amber20 especially on Turing-based GPUs (e.g. RTX2080Ti)
compared to Amber18.

Our environment:
- dual socket nodes with Intel Xeon CPUs (either Broadwell or
  Skylake) - but host processor should not matter
- typically four identical GPUs per node
- Operating system: Ubuntu 18.04.4 LTS and gcc-7 as
  default compiler
- Cuda: 9.2, 10.0, 10.1, 10.2 tested for Amber20
- exclusive access to the nodes; persistence mode enabled on the
  GPUs; no thermal throttling as confirmed by the system monitoring
- identical nodes have been used for all runs

The Amber18 reference binary (18p14-at19p03) has been compiled
using Cuda-10.0.

For Amber20 with Cuda-10.2 we compared the new Cmake build
(./quick_cmake_install -cuda) and the old configure build (using
the same protocol as used for building Amber18 at our site). Both
result in comparable results for Amber20 - which is good.

For Amber20 with Cuda-10.2 we also tried gcc-8/gfortran-8 instead
of the default gcc-7 toolchain. No significant change (at least on
Geforce RTX2080Ti) - as expected as all work is offloaded.

On Geforce RTX2080Ti we also tested Singularity images of CentOS7,
Ununtu16.04 and 18.04 built using Gerald Monard's amberity script.
Again no significant change.

The Amber18 benchmark suite is used for performance evaluation.

Attached are graphs which show the relative performance difference
between Amber18 and Amber20 (i.e. ns/day from Amber18 divided by
the value from Amber20 minus 1) for "PME (Optimized)" and "GB",
respectively.The average performance of the four values reported
for the four GPUs of the node is taken. For Cuda-10.2 we repeated
the simulations twice and the results are comparable. A reported
result with a positive value means that Amber18 is that much faster
(i.e. 0.1 means 10% faster than Amber20).

General findings: Cuda-9.2 and Cuda-10.0 generally result in worse
performance for Amber20; Cuda-10.1 and Cuda-10.2 give comparable

Results for different (generations of) GPUs:
- on Tesla V100 the larger PME cases are up to 5% faster than with
- for Geforce GTX1080 and GTX1080Ti, Amber18 and 20 give similar mean
  performance over all benchmarks; most PME cases are ~2% slower
  while TRPCage is faster
- on Geforce RTX2080Ti all PME benchmarks and myoglobin are 10-15%
  slower for Amber20; that's really a severe problem.
  Other Turing based cards (Quadro RTX5000, Quadro RTX6000, Geforce
  2070 Super and 2080 Super) also show a performance degradation of
  10-20% for the PME cases. There is not much difference for TRPCage
  and nucleosome.

For those who prefer plain numbers: here, as an example, the raw
output for "Cellulose_production_NPT_4fs PME (Optimized)" from
the Amber18-Benchmark runscript and Amber20 compiled with Cuda-10.2.
The four columns are the values of the four GPUs in the node. Semicolons
have been added for easier processing

***AMBER18 (18p14-at19p03)***
tesla-v100 Cellulose_production_NPT_4fs ;PME (Optimized) ; 85.76; 86.12; 85.76; 85.59
geforce-rtx2080ti Cellulose_production_NPT_4fs ;PME (Optimized) ; 71.01; 71.14; 70.56; 71.52 !!!!
geforce-gtx1080ti Cellulose_production_NPT_4fs ;PME (Optimized) ; 43.03; 43.51; 43.50; 43.86
geforce-gtx1080 Cellulose_production_NPT_4fs ;PME (Optimized) ; 30.44; 29.82; 30.47; 29.78

quadro-rtx5000 Cellulose_production_NPT_4fs ;PME (Optimized) ; 53.44 !!!!
quadro-rtx6000 Cellulose_production_NPT_4fs ;PME (Optimized) ; 75.67 !!!!
geforce-rtx2070super Cellulose_production_NPT_4fs ;PME (Optimized) ; 49.20 !!!!
geforce-rtx2080super Cellulose_production_NPT_4fs ;PME (Optimized) ; 55.36 !!!!

The Amber18 number roughly match the "boost" values on

***AMBER20 (20p00-at20p01)***
tesla-v100 Cellulose_production_NPT_4fs ;PME (Optimized) ; 91.52; 91.22; 91.05; 91.26
geforce-gtx2080ti Cellulose_production_NPT_4fs ;PME (Optimized) ; 63.73; 64.46; 63.83; 64.74 !!!!
rtx2080ti(configure) Cellulose_production_NPT_4fs ;PME (Optimized) ; 64.15; 64.48; 63.82; 64.82 !!!!
geforce-gtx1080ti Cellulose_production_NPT_4fs ;PME (Optimized) ; 42.89; 43.54; 43.78; 43.74
geforce-gtx1080 Cellulose_production_NPT_4fs ;PME (Optimized) ; 31.44; 30.73; 30.97; 30.81

quadro-rtx5000 Cellulose_production_NPT_4fs ;PME (Optimized) ; 48.26 !!!!
quadro-rtx6000 Cellulose_production_NPT_4fs ;PME (Optimized) ; 68.50 !!!!
geforce-rtx2070super Cellulose_production_NPT_4fs ;PME (Optimized) ; 43.94 !!!!
geforce-rtx2080super Cellulose_production_NPT_4fs ;PME (Optimized) ; 49.70 !!!!

The complete raw data as well as logs from the build process can be
provided up on request.

Best regards

Thomas Zeiser

On Sun, May 17, 2020 at 11:43:26AM +0300, Filip Fratev wrote:
> Hi,
> I was able to install pmemd.cuda only if I use the old ./configuration
> method. In this way the link between gcc7 and cuda 10.2, as such for
> example |sudo ln -s /usr/bin/gcc-7.xx /usr/local/cuda/bin/gcc| is
> possible. However, this is not possible using the new cmake procedure.
> Further, I notice significant performance drop of Amber20 in comparison
> to Amber18. I don't know whether this is due to the compilation process
> make v.s cmake as this has been already noticed for Sander.
> These are the numbers obtained by GTX 2080Ti and Factor X system:
> Steps            Amber20               Amber18
> 10K             177.06ns/day        198.31
> 50K             175.33ns/ day        196.18
> Any comments and sharing experience by other users could be helpful.
> Regards,
> Filip

Dr.-Ing. Thomas Zeiser, HPC Services
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
Regionales Rechenzentrum Erlangen (RRZE)
Martensstraße 1, 91058 Erlangen, Germany  &

AMBER mailing list

Received on Mon May 25 2020 - 02:30:03 PDT
Custom Search