[AMBER] How fast is Amber's new GPU code? Are CPUs ever coming back?

From: David Cerutti <dscerutti.gmail.com>
Date: Tue, 17 Apr 2018 11:40:14 -0400

And now for a public service announcement...

Recently on the list people have been writing in asking about the speed of
pmemd, of its GPU acceleration, which simulation program to use, and so
forth. One newcomer even asked whether the GPU code is faster than sander,
and reading between the lines I can see a PI or two wondering whether the
Amber package, as opposed to the freely distributed AmberTools, is worth
the investment. While I'm not here to advertise per se, I will offer the
following observations, which should help anyone trying to understand the
current state of the field, how fast is Amber, and where the technology is
taking us.

Elmar Krieger, perhaps my European doppleganger for the way he ebbs from
force fields to simulation methodology and back, published a fascinating
article back in 2015 showing how they took a single Core i7-5960X Haswell
processor and pushed the dihydrofolate reductase (DHFR, explicit water
molecules) simulation throughput from a humdrum 7.5 ns / day reference (1
fs time step, long cutoff) to run more than 20x faster through a series of
approximations that they demonstrate to be safe:

https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.23899

This is a big achievement--given the way they talk about AVX instructions
and their proximity to the GROMACS folks I'm going to assume they coded it
'right' and can't get their YASARA program significantly faster under the
same approximations on the same hardware. The paper says "a single Core
i7-5960X" but that's a capital C--they use all eight cores, 16 threads, of
the hyperthreaded Core i7 processor, and I am going to assume that they're
not turbo-boosting because their program is running consistently enough
that there's not much opportunity for the chip to safely ramp up the clock
speed on any particular core.

First, let's compare the nuts and bolts of YASARA to what Amber's primary
simulation engine, pmemd, does. We have something almost identical to DHFR
in terms of atom counts and run conditions (most significantly, PME and
8.0A cutoff): we call it JAC and you can find it on our benchmarks page:

http://ambermd.org/gpus/benchmarks.htm#Benchmarks

Off the cuff, YASARA looks like it might be a bit faster than our pmemd CPU
code, but not tremendously so until they start invoking more drastic
approximations (we kind of part ways once they get to the lime green line
on Figure 8, when we use heavy hydrogens to get to a 4fs time step and they
use LINCS with multiple time steps). We maintain an 'air tight' pair list,
which guarantees that every interaction within the stated cutoff is going
to be counted (there was a kerfuffle on the listserv a few months back when
I said there was a chink in the armor with Amber16, but we've patched
that). YASARA takes a GROMACS-like approach with what they call a "sloppy"
pair list, which one can reasonably argue is no more an approximation than
hard truncation of vdW interactions or the tail of the erfc(r)/r function
in PME. Krieger and Vriend then get another impressive boost from atomic
instructions (for those unfamiliar, this is a fundamental problem of
parallel computing, and works like a library checkout system so that common
pieces of memory can only have one thread operating on them at a time).
The additional boost that they then get from a more advanced method for
ensuring that threads don't negate one another's results is small, but they
get another big jump by writing their program to avoid the need to worry
about threads crossing paths in the first place. Newer vector instruction
sets add a little more, and their "densostat," which approximates the
effect of a barostat but merely works to keep the simulation box size where
the user thinks it should be as opposed to where the Newtonian physics
would actually take it, is an 8% boost, comparable to the effect we see for
invoking or removing a virial calculation.

Now back to logistics. There are a couple of facts pertinent to the
questions on the listserv that are not in the paper: the price of that
high-performance Haswell processor (currently about $900, on clearance as
it has been succeeded by the new Skylake line), and the power envelope
(140W). The price of the modern equivalent Core i9-7960X Skylake is about
$1400. Skylake doubles the number of cores and threads, but the clock
speed is back down, the cache is not that much bigger, and the power
envelope has gone up to 165W. Tick, tock, tick, tock. I'm just going to
estimate that on a new Skylake Krieger and colleagues might be able to get
DHFR to run at 250ns / day. As stated above, the approximations made by
pmemd and YASARA are not comparable, and not everything that Krieger does
on a CPU will map to a GPU although a good portion of it will. But let's
go ahead and compare the results to our GTX-1080Ti benchmarks. The new
pmemd.cuda is somewhat faster than the Amber16 code, and I have more
optimizations in my own branch, working within our original set of
approximations and running the JAC benchmark at 730ns / day. A GTX-1080Ti
cost $700 when it was new, and while the forthcoming GPUs will almost
certainly be more expensive thanks to the diligence of coin farmers, I
don't expect the price of a new Volta-like GTX to crest much above $1100.
Let's just stick with last year's model, though.

The GTX-1080Ti is a 250W card, but on the JAC/DHFR benchmarks it doesn't
ramp up completely (system isn't big enough), so the power consumption is
around 175W. The GPU is therefore beating the CPU by a factor of nearly 3
in speed, for half the cost, and only slightly more electricity consumption
(running a 250W GPU round the clock for a year at 0.12 cents per kW/hr will
run you about $260). As I said, JAC can't fully occupy the GPU--bigger
systems run up to 30% more effectively, with the throughput reaching a
plateau for systems about 3-4 times the size of JAC. In contrast, Krieger
shows that the CPU offers nearly linear returns as a function of system
size, with slight degradation due to the O(N log N) property of the FFT.
(The FFT scaling is a weak effect on GPUs, which in fact do FFTs very
efficiently and reach peak efficiency in the FFT at about the same time
they reach peak efficiency in other aspects of the algorithm.) So on bigger
systems the GPU will retain its advantage in terms of power consumption and
grow its lead substantially in terms of throughput. Of course, if we were
to invoke some of the approximations that Krieger shows on the CPU, our GPU
performance would likewise benefit.

As I said, I'm not doing this to advertise, and I'm certainly not doing it
to call out Krieger and Vriend for their statement that the technological
revolution that took computational science to GPUs may be shifting back to
CPUs. There is truth to that statement, and Elmar has done some fabulous
work which I also admire for the way he thought out of the box and ran his
dynamics without a Verlet pair list (have a look at mdgx). However, it
underscores the current state of things: while CPUs are making impressive
gains of their own, GPUs retain a considerable lead in terms of simulation
speed and power consumption. As long as NVIDIA continues to let us use GTX
for scientific applications, the GPUs have a considerable cost advantage as
well.

Another way to look at this is that CPUs devote a lot of their silicon to
cache, a dustbin of information that, at some time in the last few hundred
thousand transactions, was hauled in from RAM. GPUs, by contrast, devote
much more of their silicon to arithmetic units, have very little cache, and
run at about half the clock speed of a CPU. Aside from those differences,
the lithography and circuits are pretty much the same. It pretty well
explains the differences in power consumption and throughput when you just
look at what each device is doing with the silicon. In that sense, I would
expect GPUs to maintain their logistical advantages over CPUs for
compute-intensive operations.

In terms of programming, one must be mindful of the scarcity of cache on
each GPU "core" (the streaming multi-processor, of which the new Volta
architecture has 80 on an immense 81 cm2 die). But, I feel that in their
paper Krieger and Vriend have touched on another aspect of the computing
revolution that is now flowing back to CPUs:

"Although compilers could in theory do [Advanced Vector Extensions]
automatically, it does not work well enough in practice. Instead, the
developer must write code that explicitly uses these instruction sets,
either by programming directly in assembly language, or by using
'intrinsics,' small C/C++ functions that operate on vector data types and
map almost directly to the corresponding assembly instructions, so that the
compiler has an easy job... one needs to rewrite or at least adapt the code
for each SIMD instruction set (and almost every new CPU comes with
additional SIMD instructions)."

For a long time, GPU programming was considered a barrier that saddled the
scientific computing revolution on the backs of a group of people who might
not be computer science geniuses but at least had invested something to
learn an advanced language. In practice, the threshold for CUDA
programming is to understand that whatever instructions one writes will be
executed by some combination of 32 threads (32-way SIMT), how to manage a
very small cache effectively, and how big the GPU really is. Otherwise,
CUDA is more or less like C. The computer scientists behind GPUs work in
close concert with software developers to provide solutions to fundamental
parallel computing problems like atomic memory transactions and thread
synchronization, and the innovations have been truly impressive. The
impetus may now be on Intel and other chip makers to provide compiler
support for C- and Fortran-level programming in ways that effectively
utilize vectorization and all cores of these increasingly multi-core CPUs.

The good, and the bad, of the GPU computing revolution is returning to CPUs
themselves. The rosiest future I can foresee is one in which there are
multiple stages between the single-core CPUs of 2000 and the GPUs of today,
where we can all select our price points, the transitions for coding are
incremental, and the device distinctions boil down to how much of the
silicon is for math or short-term memory. Until then, please just know
that Amber is as fast as we can make it go, and for anyone interested in
undertaking a significant MD project it is worthwhile to buy into the GPU
power offered with the licensed version.
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Apr 17 2018 - 09:00:03 PDT
Custom Search