Peter, Dave, many thanks to you both. I preformed some benchmarks and
found the optimal performance on my system is 6x replicates in
parallel, for a net 1.85x speedup over running one simulation at a
time. I appreciate your input!
On Tue, Oct 29, 2019 at 10:46 AM David Cerutti <dscerutti.gmail.com> wrote:
>
> Be sure to engage the MPS system when you do this: I still need to add this
> to the website, but the command to start it is:
>
> set -e
> export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
> export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
> nvidia-cuda-mps-control -d
>
> and to terminate the MPS daemon, do:
>
> echo quit | nvidia-cuda-mps-control
>
> The MPS system was designed for MPI-parallel programs whose threads all
> launch their own GPU kernels, but it can subdivide the resources for
> multiple independent executables as well.  Because it partitions the GPU
> evenly between concurrent kernels, it's best to run systems of roughly the
> same size (better yet multiple copies of the same system) and to run a
> number of processes that is a multiple of your GPU SM count (2 or 4 works
> great for all achitectures currently on the market).  Max 16 processes.
> You can get up to 5x throughput on the smallest of GB systems, perhaps 2.5x
> throughput on very small PME problems.
>
> For systems of 100k or more atoms, even the big cards are being well
> utilized and you will see little to no additional throughput.
>
> For very small problems, I have made a GPU Generalized Born implementation
> in my mdgx code that runs Amber topologies, any number of unique systems or
> replicas thereof.  It uses the GPU like a miniature Beowulf cluster.  You
> can schedule hundreds of simulations at once, GPU will get 100% utilized
> (well, the NVIDIA folks tell me that the kernel is using up to 80% of the
> instruction throughput of the card, so pretty much topped out).  This can
> get up to 80x the pmemd throughput on 250-atoms systems (16x greater than
> is possible with MPS) or > 100x the throughput for even smaller problems
> (the program will use different block sizes to load additional problems
> onto each SM and continue to scale).
>
> GPUs are remarkable devices.
>
> Dave
>
>
> On Tue, Oct 29, 2019 at 1:00 PM Piotr Fajer <pfajer.fsu.edu> wrote:
>
> > Aiden,
> >
> > I routinely run 2 pmemd.cuda jobs per GPU with about 20-30%  loss versus a
> > single job per GPU. Heavier loading starts being linear because of
> > timesharing.  You can follow the GPU loading with nvidia-smi as you know,
> > and the inf files will give you the  ns/day  throughput.
> >
> > Peter
> >
> > On 10/29/19, 11:56 AM, "Aiden Aceves" <ajaceves.gmail.com> wrote:
> >
> >     Hello,
> >     I have one single and very powerful GPU. Is there a way to run
> >     multiple simulations
> >     in parallel on it? I know I could proceed in series, but given that
> > the GPU
> >     is mostly underutilized in terms of memory and processing power (based
> > on
> >     its power draw during the run), I thought it might help to run multiple
> >     simulations simultaneously.
> >     Thanks,
> >     Aiden Aceves
> >     _______________________________________________
> >     AMBER mailing list
> >     AMBER.ambermd.org
> >
> > https://urldefense.com/v3/__http://lists.ambermd.org/mailman/listinfo/amber__;!5Xm4_O-4tfk!jIPIMlwvjaNsKndVAqhGtXYkdtaI_Sm4Powvq4ojuvPxCcRSfgjTDNbNeFh7HAU$
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Oct 30 2019 - 07:30:04 PDT