Re: [AMBER] Sufficient CPU cores/GPU ratio ? from Ross Walker on 2011-09-13 (Amber Archive Sep 2011)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 13 Sep 2011 11:40:56 -0700

Hi Peter,

> to run GPUs it is essential to put them in PCI-E X16 slots.

This is not actually essential. Not when running single GPU runs. Going to
x8 won't make a huge difference as long as NTPR and NTWX are large enough.
The X16 issue really comes into play when you want to run a parallel MPI GPU
run.

> On a normal motherboard there are only 1 or 2 PCI-E X16 slots
> available, but
> there are some special motherboards on the market with additional chip
> sets,
> which supply more PCI-E X16 slots.

Yes but these do nothing to boost the actual CPU memory bandwidth which
becomes the limiting factor with a single socket system and 4 GPUs all at
X16 (or an IB card at X16). Try it. Run a 3 GPU MPI run on the node itself
and then try running a bandwidth test for the IB card. I bet you can't max
out the IB bandwidth. Dual socket motherboards actually boost the effective
CPU memory bandwidth by having separate memory banks for each CPU - hence
why they can drive 4 X16 slots flat out. Two on one CPU and two on the
other.

Again though this only applies when running parallel jobs.

> We did some benchmarks on our HP SL390 system.
> This node is equipped with an additional chipset and supplies 4 PCI-E
> X16
> slots, one for the onboard QDR Infiniband and the other to be used by 3
> GPUs.
> The performance degration was less than 2% when we ran 3 separated GPU
> jobs

This is what I'd expect.

> and one non-GPU pmemd.MPI with 8 cores in parallel on this dual-CPU 6-
> core

Note the DUAL-CPU here. That is the key difference. Here you have twice the
effective memory bandwidth of the single socket system Marek was originally
mentioning.

> On our system, the GPU utilization was only 90% but the CPU runs with
> 100%,
> so using faster CPUs (with higher clock rate than our 2.67 GHz CPUs)
> may be useful.

This won't make any difference. The CPU is pegged at 100% because it is just
spinning at a barrier waiting for the GPU kernel to complete. You can use
the slowest CPUs you want and it shouldn't make a difference in serial on
the GPU. Again where the difference will appear is running in parallel
across multiple GPUs since the faster CPUs typically have a faster front
side bus and so more CPU memory bandwidth which will improve things.

Of course once one can do true DMA style GPU to GPU and GPU to IB transfers
all these CPU memory bandwidth arguments go away.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Sep 13 2011 - 12:00:03 PDT