Peter, Dave, many thanks to you both. I preformed some benchmarks and
found the optimal performance on my system is 6x replicates in
parallel, for a net 1.85x speedup over running one simulation at a
time. I appreciate your input!
On Tue, Oct 29, 2019 at 10:46 AM David Cerutti <dscerutti.gmail.com> wrote:
>
> Be sure to engage the MPS system when you do this: I still need to add this
> to the website, but the command to start it is:
>
> set -e
> export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
> export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
> nvidia-cuda-mps-control -d
>
> and to terminate the MPS daemon, do:
>
> echo quit | nvidia-cuda-mps-control
>
> The MPS system was designed for MPI-parallel programs whose threads all
> launch their own GPU kernels, but it can subdivide the resources for
> multiple independent executables as well. Because it partitions the GPU
> evenly between concurrent kernels, it's best to run systems of roughly the
> same size (better yet multiple copies of the same system) and to run a
> number of processes that is a multiple of your GPU SM count (2 or 4 works
> great for all achitectures currently on the market). Max 16 processes.
> You can get up to 5x throughput on the smallest of GB systems, perhaps 2.5x
> throughput on very small PME problems.
>
> For systems of 100k or more atoms, even the big cards are being well
> utilized and you will see little to no additional throughput.
>
> For very small problems, I have made a GPU Generalized Born implementation
> in my mdgx code that runs Amber topologies, any number of unique systems or
> replicas thereof. It uses the GPU like a miniature Beowulf cluster. You
> can schedule hundreds of simulations at once, GPU will get 100% utilized
> (well, the NVIDIA folks tell me that the kernel is using up to 80% of the
> instruction throughput of the card, so pretty much topped out). This can
> get up to 80x the pmemd throughput on 250-atoms systems (16x greater than
> is possible with MPS) or > 100x the throughput for even smaller problems
> (the program will use different block sizes to load additional problems
> onto each SM and continue to scale).
>
> GPUs are remarkable devices.
>
> Dave
>
>
> On Tue, Oct 29, 2019 at 1:00 PM Piotr Fajer <pfajer.fsu.edu> wrote:
>
> > Aiden,
> >
> > I routinely run 2 pmemd.cuda jobs per GPU with about 20-30% loss versus a
> > single job per GPU. Heavier loading starts being linear because of
> > timesharing. You can follow the GPU loading with nvidia-smi as you know,
> > and the inf files will give you the ns/day throughput.
> >
> > Peter
> >
> > On 10/29/19, 11:56 AM, "Aiden Aceves" <ajaceves.gmail.com> wrote:
> >
> > Hello,
> > I have one single and very powerful GPU. Is there a way to run
> > multiple simulations
> > in parallel on it? I know I could proceed in series, but given that
> > the GPU
> > is mostly underutilized in terms of memory and processing power (based
> > on
> > its power draw during the run), I thought it might help to run multiple
> > simulations simultaneously.
> > Thanks,
> > Aiden Aceves
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> >
> > https://urldefense.com/v3/__http://lists.ambermd.org/mailman/listinfo/amber__;!5Xm4_O-4tfk!jIPIMlwvjaNsKndVAqhGtXYkdtaI_Sm4Powvq4ojuvPxCcRSfgjTDNbNeFh7HAU$
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Oct 30 2019 - 07:30:04 PDT