Re: [AMBER] Multi-GPU Bug in Amber20 from Ross Walker on 2022-01-11 (Amber Archive Jan 2022)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 11 Jan 2022 14:22:57 -0500

Hi James,

This is cosmetic. It's to do with how the peer to peer communication is done over PCI-E and the way NVIDIA-SMI is reporting it.

It is indeed running correctly although as you note performance on multi-GPUs is worse than a single GPU. The advice for the last 5+ years has been to stick with single GPU runs since GPUs are now so fast that the communication between them is too slow to be effective. The design for AMBER has always been to focus on being as efficient as possible so the code makes such aggressive use of a single GPU that scaling to multiple GPUs is in-effective. NVLink systems offer better intra-GPU bandwidth but little effort has been expended on optimizing the code for this since the systems are typically too expensive to justify the effort. Exceptions to using multiple GPUs for a single run are:

1) Very large >5000+ atom implicit solvent GB simulations.
2) Replica exchange runs where individual GPUs run a 'full' MD simulation.
3) TI calculations where lambda windows can be run on different GPUs.

The general recommendation is if you have 4 GPUs it's better to run 4 independent simulations than try to run a single slightly longer simulation on all 4 GPUs.

Hope that helps.

All the best
Ross

> On Jan 11, 2022, at 13:51, James Kress <jimkress_58.kressworks.org> wrote:
>
> I am running the mdinOPT.GPU benchmark on my 4 RTX3090 GPU system. It has
> dual AMD CPUs each with 64 cores. The system also has 2TB of RAM.
>
> The OS is RHEL 8 and Amber20 is compiled with CUDA kit 11.4.1 and OpenMPI
> 4.1.1 to give $AMBERHOME/bin/pmemd.cuda_SPFP.MPI
>
> When I set export CUDA_VISIBLE_DEVICES=0,1,2 and run this command:
>
> mpirun -np 3 $AMBERHOME/bin/pmemd.cuda_SPFP.MPI -i mdinOPT.GPU -o
> JACOPT_3.out -p JAC.prmtop -c JAC.inpcrd
>
> I observe in nvidia-smi that 3 processes are routed to GPU 0 with one
> process routed to GPU 1 and another to GPU 2 for a total of 5 GPU processes.
> See attached screen shot (if it clears the posting process).
>
> Why are 3 processes being routed to GPU 0? I also tried the same command
> after setting OMP_NUM_THREADS=1 but that made no difference. Neither did
> mpirun --bind-to none -np 3 ...
>
> This behavior is not observed when I set export CUDA_VISIBLE_DEVICES=0,1
> and run this command:
>
> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda_SPFP.MPI -i mdinOPT.GPU -o
> JACOPT_0_1.out -p JAC.prmtop -c JAC.inpcrd
>
> or
>
> export CUDA_VISIBLE_DEVICES=2,3
> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda_SPFP.MPI -i mdinOPT.GPU -o
> JACOPT_2_3.out -p JAC.prmtop -c JAC.inpcrd
>
> What I get there is 2 GPU processes on each GPU.
>
> Is there something else I need to do to get only 1 process per GPU or is
> this normal behavior?
>
> Also note, the performance for the 2 GPU runs is ns/day = 1308.25 while
> for the 3 GPU run it is ns/day = 606.08.
>
> Any suggestions on what I am doing wrong and how to fix it?
>
> Thanks.
>
> Jim Kress
>
> <3GPUSnipImage.JPG>_______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jan 11 2022 - 11:30:22 PST