Re: [AMBER] Multiple simulations on one GPU from David Cerutti on 2019-10-29 (Amber Archive Oct 2019)

From: David Cerutti <dscerutti.gmail.com>
Date: Tue, 29 Oct 2019 13:46:10 -0400

Be sure to engage the MPS system when you do this: I still need to add this
to the website, but the command to start it is:

set -e
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d

and to terminate the MPS daemon, do:

echo quit | nvidia-cuda-mps-control

The MPS system was designed for MPI-parallel programs whose threads all
launch their own GPU kernels, but it can subdivide the resources for
multiple independent executables as well. Because it partitions the GPU
evenly between concurrent kernels, it's best to run systems of roughly the
same size (better yet multiple copies of the same system) and to run a
number of processes that is a multiple of your GPU SM count (2 or 4 works
great for all achitectures currently on the market). Max 16 processes.
You can get up to 5x throughput on the smallest of GB systems, perhaps 2.5x
throughput on very small PME problems.

For systems of 100k or more atoms, even the big cards are being well
utilized and you will see little to no additional throughput.

For very small problems, I have made a GPU Generalized Born implementation
in my mdgx code that runs Amber topologies, any number of unique systems or
replicas thereof. It uses the GPU like a miniature Beowulf cluster. You
can schedule hundreds of simulations at once, GPU will get 100% utilized
(well, the NVIDIA folks tell me that the kernel is using up to 80% of the
instruction throughput of the card, so pretty much topped out). This can
get up to 80x the pmemd throughput on 250-atoms systems (16x greater than
is possible with MPS) or > 100x the throughput for even smaller problems
(the program will use different block sizes to load additional problems
onto each SM and continue to scale).

GPUs are remarkable devices.

Dave

On Tue, Oct 29, 2019 at 1:00 PM Piotr Fajer <pfajer.fsu.edu> wrote:

> Aiden,
>
> I routinely run 2 pmemd.cuda jobs per GPU with about 20-30% loss versus a
> single job per GPU. Heavier loading starts being linear because of
> timesharing. You can follow the GPU loading with nvidia-smi as you know,
> and the inf files will give you the ns/day throughput.
>
> Peter
>
> On 10/29/19, 11:56 AM, "Aiden Aceves" <ajaceves.gmail.com> wrote:
>
> Hello,
> I have one single and very powerful GPU. Is there a way to run
> multiple simulations
> in parallel on it? I know I could proceed in series, but given that
> the GPU
> is mostly underutilized in terms of memory and processing power (based
> on
> its power draw during the run), I thought it might help to run multiple
> simulations simultaneously.
> Thanks,
> Aiden Aceves
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
>
> https://urldefense.com/v3/__http://lists.ambermd.org/mailman/listinfo/amber__;!5Xm4_O-4tfk!jIPIMlwvjaNsKndVAqhGtXYkdtaI_Sm4Powvq4ojuvPxCcRSfgjTDNbNeFh7HAU$
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Oct 29 2019 - 11:00:03 PDT