Re: [AMBER] Multi-GPU Bug in Amber20

From: James Kress <jimkress_58.kressworks.org>
Date: Tue, 11 Jan 2022 18:57:03 -0500

Hello,

I noticed another oddity in the GPU output section.

------------------- GPU DEVICE INFO --------------------
|
| CUDA_VISIBLE_DEVICES: 3
| CUDA Capable Devices Detected: 1
| CUDA Device ID in use: 0
| CUDA Device Name: NVIDIA GeForce RTX 3090
| CUDA Device Global Mem Size: 24268 MB
| CUDA Device Num Multiprocessors: 82
| CUDA Device Core Freq: 1.70 GHz
|
|--------------------------------------------------------

While trying to benchmark each GPU individually I noticed this apparent
anomaly in the output.

I monitored the Amber pmemd.cuda process using nvidia-smi. The GPU 3 was
the only active GPU. I had set CUDA_VISIBLE_DEVICES=3 and Amber picks that
up OK. However, the device ID in use by Amber is specified as 0. Shouldn't
that be 3?

Thanks.

Jim


-----Original Message-----
From: James Kress <jimkress_58.kressworks.org>
Sent: Tuesday, January 11, 2022 3:15 PM
To: 'Ross Walker' <ross.rosswalker.co.uk>; 'AMBER Mailing List'
<amber.ambermd.org>
Subject: RE: [AMBER] Multi-GPU Bug in Amber20

Hi Ross,

Thanks for the reply and description of what is going on.

Is there a specific way to invoke NVLink since my system does have that
present?

Thanks again.

Jim

-----Original Message-----
From: Ross Walker <rosscwalker.gmail.com> On Behalf Of Ross Walker
Sent: Tuesday, January 11, 2022 2:23 PM
To: James Kress <jimkress_58.kressworks.org>; AMBER Mailing List
<amber.ambermd.org>
Subject: Re: [AMBER] Multi-GPU Bug in Amber20

Hi James,

This is cosmetic. It's to do with how the peer to peer communication is done
over PCI-E and the way NVIDIA-SMI is reporting it.

It is indeed running correctly although as you note performance on
multi-GPUs is worse than a single GPU. The advice for the last 5+ years has
been to stick with single GPU runs since GPUs are now so fast that the
communication between them is too slow to be effective. The design for AMBER
has always been to focus on being as efficient as possible so the code makes
such aggressive use of a single GPU that scaling to multiple GPUs is
in-effective. NVLink systems offer better intra-GPU bandwidth but little
effort has been expended on optimizing the code for this since the systems
are typically too expensive to justify the effort. Exceptions to using
multiple GPUs for a single run are:

1) Very large >5000+ atom implicit solvent GB simulations.
2) Replica exchange runs where individual GPUs run a 'full' MD simulation.
3) TI calculations where lambda windows can be run on different GPUs.

The general recommendation is if you have 4 GPUs it's better to run 4
independent simulations than try to run a single slightly longer simulation
on all 4 GPUs.

Hope that helps.

All the best
Ross

> On Jan 11, 2022, at 13:51, James Kress <jimkress_58.kressworks.org> wrote:
>
> I am running the mdinOPT.GPU benchmark on my 4 RTX3090 GPU system. It
> has dual AMD CPUs each with 64 cores. The system also has 2TB of RAM.
>
> The OS is RHEL 8 and Amber20 is compiled with CUDA kit 11.4.1 and
> OpenMPI
> 4.1.1 to give $AMBERHOME/bin/pmemd.cuda_SPFP.MPI
>
> When I set export CUDA_VISIBLE_DEVICES=0,1,2 and run this command:
>
> mpirun -np 3 $AMBERHOME/bin/pmemd.cuda_SPFP.MPI -i mdinOPT.GPU -o
> JACOPT_3.out -p JAC.prmtop -c JAC.inpcrd
>
> I observe in nvidia-smi that 3 processes are routed to GPU 0 with one
> process routed to GPU 1 and another to GPU 2 for a total of 5 GPU
processes.
> See attached screen shot (if it clears the posting process).
>
> Why are 3 processes being routed to GPU 0? I also tried the same
> command after setting OMP_NUM_THREADS=1 but that made no difference.
> Neither did mpirun --bind-to none -np 3 ...
>
> This behavior is not observed when I set export
> CUDA_VISIBLE_DEVICES=0,1 and run this command:
>
> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda_SPFP.MPI -i mdinOPT.GPU -o
> JACOPT_0_1.out -p JAC.prmtop -c JAC.inpcrd
>
> or
>
> export CUDA_VISIBLE_DEVICES=2,3
> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda_SPFP.MPI -i mdinOPT.GPU -o
> JACOPT_2_3.out -p JAC.prmtop -c JAC.inpcrd
>
> What I get there is 2 GPU processes on each GPU.
>
> Is there something else I need to do to get only 1 process per GPU or
> is this normal behavior?
>
> Also note, the performance for the 2 GPU runs is ns/day = 1308.25 while
> for the 3 GPU run it is ns/day = 606.08.
>
> Any suggestions on what I am doing wrong and how to fix it?
>
> Thanks.
>
> Jim Kress
>
> <3GPUSnipImage.JPG>_______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jan 11 2022 - 16:00:02 PST
Custom Search