Re: [AMBER] Amber20 pmemd.cuda performance problems (especially on Turing-based GPUs)

From: Brent Krueger <kruegerb.hope.edu>
Date: Mon, 25 May 2020 09:04:28 -0400

As long as we are talking about this, let me add two observations that I've
seen on my systems:
1) With AMBER20 using CUDA10.2 gives me slightly slower performance than
CUDA10.1 (but less than 1% different). I have not done the comparison of
AMBER18 to AMBER20 with identical CUDA versions. This performance
difference is consistent for me across two PME-based benchmarks (cellulose
and factor-ix).

2) With AMBER20 and CUDA 10.X the memory use is slightly higher compared to
AMBER18 with CUDA 9.1. The result is that I can get the STMV benchmark to
run on a GTX780 (3GB of RAM) with the AMBER18 setup if I massage the input
slightly so that it uses ntt=1 rather than ntt=3. These simulations use
2963 MB of RAM. But I cannot get these simulations to run on the same
hardware using AMBER20 with CUDA 10.X.

Neither of these issues is anything I'm really all that concerned about.
Just thought I'd toss out a couple more datapoints for anyone who chooses
to dig into this.

Cheers,
Brent



On Mon, May 25, 2020 at 5:41 AM David Cerutti <dscerutti.gmail.com> wrote:

> This is troubling. I have just run new test for Amber20 on some of those
> architectures. Like you, I saw a slight improvement for V100 (and a big
> one for Titan-V, although I suspect most of that owes to compiler
> improvements). There were not that many things done in the code, but some
> optimizations were added by NVIDIA developers to manage memory access and
> caching methods, particularly in the non-bonded inner loop. I have seen
> comparable performance to old Amber18 benchmarks across most of the systems
> I tested--a fee percent improvement is common throughout the hardware lines
> I tested, and that's actually consistent with your results when I realize
> that I've been testing on mostly Pascal and Volta. However, even in these
> experiments I have NOT made an explicit test against a recently compiled
> Amber18.
>
> The only Turing cards I can compare to are some RTX-2080Tis that are
> powered by an anemic supply and so do not deliver full performance
> regardless. I also have an RTX-5000 but have not yet done a head-to-head
> comparison of Amber20 against Amber18. That's something I will try to do
> soon.
>
> Dave
>
>
> On Mon, May 25, 2020 at 5:01 AM Thomas Zeiser <
> thomas.zeiser.rrze.uni-erlangen.de> wrote:
>
> > Dear All,
> >
> > we also observe a significant performance degradation of pmemd.cuda
> > from Amber20 especially on Turing-based GPUs (e.g. RTX2080Ti)
> > compared to Amber18.
> >
> > Our environment:
> > - dual socket nodes with Intel Xeon CPUs (either Broadwell or
> > Skylake) - but host processor should not matter
> > - typically four identical GPUs per node
> > - Operating system: Ubuntu 18.04.4 LTS and gcc-7 as
> > default compiler
> > - Cuda: 9.2, 10.0, 10.1, 10.2 tested for Amber20
> > - exclusive access to the nodes; persistence mode enabled on the
> > GPUs; no thermal throttling as confirmed by the system monitoring
> > - identical nodes have been used for all runs
> >
> > The Amber18 reference binary (18p14-at19p03) has been compiled
> > using Cuda-10.0.
> >
> >
> > For Amber20 with Cuda-10.2 we compared the new Cmake build
> > (./quick_cmake_install -cuda) and the old configure build (using
> > the same protocol as used for building Amber18 at our site). Both
> > result in comparable results for Amber20 - which is good.
> >
> > For Amber20 with Cuda-10.2 we also tried gcc-8/gfortran-8 instead
> > of the default gcc-7 toolchain. No significant change (at least on
> > Geforce RTX2080Ti) - as expected as all work is offloaded.
> >
> > On Geforce RTX2080Ti we also tested Singularity images of CentOS7,
> > Ununtu16.04 and 18.04 built using Gerald Monard's amberity script.
> > Again no significant change.
> >
> >
> > The Amber18 benchmark suite is used for performance evaluation.
> > (./runBenchmarks.sh -SKIP_CPU -RUNID $JOBID)
> >
> >
> > Attached are graphs which show the relative performance difference
> > between Amber18 and Amber20 (i.e. ns/day from Amber18 divided by
> > the value from Amber20 minus 1) for "PME (Optimized)" and "GB",
> > respectively.The average performance of the four values reported
> > for the four GPUs of the node is taken. For Cuda-10.2 we repeated
> > the simulations twice and the results are comparable. A reported
> > result with a positive value means that Amber18 is that much faster
> > (i.e. 0.1 means 10% faster than Amber20).
> >
> >
> > General findings: Cuda-9.2 and Cuda-10.0 generally result in worse
> > performance for Amber20; Cuda-10.1 and Cuda-10.2 give comparable
> > results.
> >
> >
> > Results for different (generations of) GPUs:
> > - on Tesla V100 the larger PME cases are up to 5% faster than with
> > Amber18
> > - for Geforce GTX1080 and GTX1080Ti, Amber18 and 20 give similar mean
> > performance over all benchmarks; most PME cases are ~2% slower
> > while TRPCage is faster
> > - on Geforce RTX2080Ti all PME benchmarks and myoglobin are 10-15%
> > slower for Amber20; that's really a severe problem.
> > Other Turing based cards (Quadro RTX5000, Quadro RTX6000, Geforce
> > 2070 Super and 2080 Super) also show a performance degradation of
> > 10-20% for the PME cases. There is not much difference for TRPCage
> > and nucleosome.
> >
> >
> > For those who prefer plain numbers: here, as an example, the raw
> > output for "Cellulose_production_NPT_4fs PME (Optimized)" from
> > the Amber18-Benchmark runscript and Amber20 compiled with Cuda-10.2.
> > The four columns are the values of the four GPUs in the node. Semicolons
> > have been added for easier processing
> >
> > ***AMBER18 (18p14-at19p03)***
> > tesla-v100 Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 85.76; 86.12; 85.76; 85.59
> > geforce-rtx2080ti Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 71.01; 71.14; 70.56; 71.52 !!!!
> > geforce-gtx1080ti Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 43.03; 43.51; 43.50; 43.86
> > geforce-gtx1080 Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 30.44; 29.82; 30.47; 29.78
> >
> > quadro-rtx5000 Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 53.44 !!!!
> > quadro-rtx6000 Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 75.67 !!!!
> > geforce-rtx2070super Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 49.20 !!!!
> > geforce-rtx2080super Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 55.36 !!!!
> >
> > The Amber18 number roughly match the "boost" values on
> > https://ambermd.org/GPUPerformance.php
> >
> >
> > ***AMBER20 (20p00-at20p01)***
> > tesla-v100 Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 91.52; 91.22; 91.05; 91.26
> > geforce-gtx2080ti Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 63.73; 64.46; 63.83; 64.74 !!!!
> > rtx2080ti(configure) Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 64.15; 64.48; 63.82; 64.82 !!!!
> > geforce-gtx1080ti Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 42.89; 43.54; 43.78; 43.74
> > geforce-gtx1080 Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 31.44; 30.73; 30.97; 30.81
> >
> > quadro-rtx5000 Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 48.26 !!!!
> > quadro-rtx6000 Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 68.50 !!!!
> > geforce-rtx2070super Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 43.94 !!!!
> > geforce-rtx2080super Cellulose_production_NPT_4fs ;PME (Optimized)
> > ; 49.70 !!!!
> >
> >
> > The complete raw data as well as logs from the build process can be
> > provided up on request.
> >
> >
> > Best regards
> >
> > Thomas Zeiser
> >
> > On Sun, May 17, 2020 at 11:43:26AM +0300, Filip Fratev wrote:
> > > Hi,
> > >
> > > I was able to install pmemd.cuda only if I use the old ./configuration
> > > method. In this way the link between gcc7 and cuda 10.2, as such for
> > > example |sudo ln -s /usr/bin/gcc-7.xx /usr/local/cuda/bin/gcc| is
> > > possible. However, this is not possible using the new cmake procedure.
> > >
> > > Further, I notice significant performance drop of Amber20 in comparison
> > > to Amber18. I don't know whether this is due to the compilation process
> > > make v.s cmake as this has been already noticed for Sander.
> > >
> > > These are the numbers obtained by GTX 2080Ti and Factor X system:
> > >
> > > Steps Amber20 Amber18
> > >
> > > 10K 177.06ns/day 198.31
> > >
> > > 50K 175.33ns/ day 196.18
> > >
> > > Any comments and sharing experience by other users could be helpful.
> > >
> > >
> > > Regards,
> > >
> > > Filip
> >
> > --
> > Dr.-Ing. Thomas Zeiser, HPC Services
> > Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
> > Regionales Rechenzentrum Erlangen (RRZE)
> > Martensstraße 1, 91058 Erlangen, Germany
> > https://www.rrze.de/hpc & https://hpc.fau.de
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>


-- 
_______________________________________________
Brent P. Krueger  (he/him/his)......phone:   616 395 7629
Professor......................................fax:       616 395 7118
Hope College...............................Schaap Hall 2120
Department of Chemistry
Holland, MI     49423
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 25 2020 - 06:30:03 PDT
Custom Search