Re: [AMBER] GPU bug for replica MD with pmemd.cuda.MPI

From: Adrian Roitberg via AMBER <amber.ambermd.org>
Date: Wed, 17 Jan 2024 09:33:20 -0500

Hi

What happens if you try running the SAME script, but without using remd,
just 8 jobs ? Do they run ?

BTW: even if it worked for you in the past, it is not a good idea to run
more than one gpu job in the same gpu. Amber will swap jobs, and really
slow down things.

Adrian


On 1/16/24 10:18 PM, Zhenquan Hu via AMBER wrote:
> [External Email]
>
> Note: For the replica MD jobs I submitted to older 8 * RTX 2080ti nodes,
> normally I used 32~48 replicas in total. Each GPU equally takes 4~6
> subjobs, not only one single. No more data-exchange GPU memory usage on
> GPU0.
>
> Zhenquan Hu <zhqhu.sioc.gmail.com> 于2024年1月17日周三 09:26写道:
>
>> Dear all,
>>
>> Recently I tried to run multiple replica MD simulations with
>> pmemd.cuda.MPI on a single machine with 4 GPU cards ( NVIDIA RTX A6000).
>> There are 8 replicas in total. GPU1-3 cards each takes 2 jobs, which is
>> normal. But GPU0 takes 8 jobs in total. 2 of them are the same as the other
>> GPUs, but for the other 6 jobs each requires ~1/3 of GPU memory, these
>> jobs seemed like for data exchange, because each of them has the same
>> unique PID which could be found in GPU1-3 PIDs. As a result, GPU0 needs
>> much more GPU memory for replica version MD simulations.
>> I have tried this kind of MD simulations on some older machines (NVIDIA
>> RTX 2080Ti, 8 GPU cards on a single node, CUDA 10.2), on which GPU0 took
>> the same GPU memory compared with other GPUs, no memory usage for data
>> exchaunge.
>> Because the least supported CUDA version for RTX A6000 is CUDA 11, I tried
>> with both CUDA 11.6 and CUDA 11.7, and also tried with another node with
>> RTX 3080 which also needs CUDA 11 at least. All need quite a lot of GPU
>> memory on GPU0 for data-exchange.
>> I suppose it should be a multi-GPU bug as cpu core could not communicate
>> with GPU1-3 directly but via GPU0 instead, am I right?
>> Is there any way to solve this problem either from pmemd or NVIDIA?
>>
>> Best regards,
>> Zhenquan Hu
>>
>> [image: gpuinfo02.png]
>> [image: gpuinfo01.png]
>>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

-- 
Dr. Adrian E. Roitberg
V.T. and Louise Jackson Professor in Chemistry
Department of Chemistry
University of Florida
roitberg.ufl.edu
352-392-6972
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jan 17 2024 - 07:00:02 PST
Custom Search