Re: [AMBER] ghost pmemd.cuda process ?

From: Jordi Bujons <jordi.bujons.iqac.csic.es>
Date: Tue, 25 Jan 2022 16:52:59 +0100

Hi Ross and thanks for your answer.

I have just recompiled on one of my workstations AMBER 20 +AmberTools 21,
with the latest patches as of yesterday. This workstation is running
OpenSuse 15.3 (kernel 5.3.18-59.27-default), gcc version 7.5.0 (SUSE Linux)
and CUDA 11.2.
Tests ran mostly ok, as shown:

> tail test_amber_cuda/2022-01-25_11-30-40.log
make[2]: se entra en el directorio '/Programs/amber21/test'

Finished CUDA test suite for Amber 20 at mar 25 ene 2022 11:34:49 CET.

make[2]: se sale del directorio '/Programs/amber21/test'
241 file comparisons passed
0 file comparisons failed
0 tests experienced errors
Test log file saved as
/Programs/amber21/logs/test_amber_cuda/2022-01-25_11-30-40.log
No test diffs to save!

> tail test_at_cuda/at_summary
27 file comparisons passed
3 file comparisons failed (3 of which can be ignored)
0 tests experienced errors
Test log file saved as
/Programs/amber21///logs/test_at_cuda/2022-01-25_11-29-37.log
Test diffs file saved as
/Programs/amber21///logs/test_at_cuda/2022-01-25_11-29-37.diff


After that, I ran a test calculation with pmemd.cuda, without specifying
anything (except that Persistence and Exclusive modes are on by default on
this machine) and directly without intervention of the queueing system, by
executing

> $AMBERHOME/bin/pmemd.cuda -O -i prod.in -o prod5.out -p
../COMPLEX.solv.prmtop -c prod4.rst -r prod5.rst -x prod5.nc &

Now, the command nvidia-smi shows

> nvidia-smi
Tue Jan 25 15:43:57 2022
+---------------------------------------------------------------------------
--+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2
|
|-------------------------------+----------------------+--------------------
--+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
| | | MIG
M. |
|===============================+======================+====================
==|
| 0 GeForce GTX 108... On | 00000000:03:00.0 Off |
N/A |
| 20% 36C P8 10W / 250W | 226MiB / 11176MiB | 0% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
| 1 GeForce GTX 108... On | 00000000:81:00.0 Off |
N/A |
| 68% 83C P2 202W / 250W | 611MiB / 11178MiB | 97% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
 

+---------------------------------------------------------------------------
--+
| Processes:
|
| GPU GI CI PID Type Process name GPU
Memory |
| ID ID Usage
|
|===========================================================================
==|
| 0 N/A N/A 7647 G /usr/bin/X
35MiB |
| 0 N/A N/A 7671 G /usr/bin/sddm-greeter
29MiB |
| 0 N/A N/A 28850 C ...ms/amber21/bin/pmemd.cuda
155MiB |
| 1 N/A N/A 28850 C ...ms/amber21/bin/pmemd.cuda
607MiB |
+---------------------------------------------------------------------------
--+

Which indicates that there are two active pmemd.cuda processes that occupy
both GPUs, although the one on GPU 0 is doing nothing (indeed, running the
top command shows only one pmemd.cuda process). And if I try to launch a
second pmemd.cuda, I get the error

> Error selecting compatible GPU all CUDA-capable devices are busy or
unavailable

This "ghost" pmemd.cuda is still there even after several minutes, making
impossible to launch a second calculation. So the problem has nothing to do
with the queuing system.

On the contrary, if I set the CUDA_VISIBLE_DEVICES parameter, I don't see
the "ghost" pmemd.cuda:

> export CUDA_VISIBLE_DEVICES=0
> $AMBERHOME/bin/pmemd.cuda -O -i prod.in -o prod5.out -p
../COMPLEX.solv.prmtop -c prod4.rst -r prod5.rst -x prod5.nc &
[1] 28961
> nvidia-smi
Tue Jan 25 15:58:18 2022
+---------------------------------------------------------------------------
--+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2
|
|-------------------------------+----------------------+--------------------
--+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
| | | MIG
M. |
|===============================+======================+====================
==|
| 0 GeForce GTX 108... On | 00000000:03:00.0 Off |
N/A |
| 20% 54C P2 189W / 250W | 678MiB / 11176MiB | 97% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
| 1 GeForce GTX 108... On | 00000000:81:00.0 Off |
N/A |
| 45% 56C P8 13W / 250W | 2MiB / 11178MiB | 0% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
 

+---------------------------------------------------------------------------
--+
| Processes:
|
| GPU GI CI PID Type Process name GPU
Memory |
| ID ID Usage
|
|===========================================================================
==|
| 0 N/A N/A 7647 G /usr/bin/X
35MiB |
| 0 N/A N/A 7671 G /usr/bin/sddm-greeter
29MiB |
| 0 N/A N/A 28961 C ...ms/amber21/bin/pmemd.cuda
607MiB |
+---------------------------------------------------------------------------
--+

After this, if I set CUDA_VISIBLE_DEVICES=1, I can launch a second
pmemd.cuda calculation that runs nicely without perturbing the previous
process, as expected.

I am pretty sure that this started after installing AMBER 20 several months
ago (I have just been using the CUDA_VISIBLE_DEVICES parameter since then),
and that it did not happen with previous versions that we had installed
(AMBER 18 and 16). I checked that the same problem is reproduced in 5
different machines that are essentially equivalent (ie. same OS, gcc and
CUDA versions).

So, do you think that this could be due to a wrong combination of OS, gcc
and CUDA, and that if I update gcc and CUDA versions to the ones that you
mentioned (GCC/ 9.3.1, CUDA 11.4) and recompile AMBER, it could be solved?

Best regards,

Jordi





Message: 6
Date: Mon, 24 Jan 2022 09:45:16 -0500
From: Ross Walker <ross.rosswalker.co.uk>
To: AMBER Mailing List <amber.ambermd.org>
Subject: Re: [AMBER] ghost pmemd.cuda process ?
Message-ID: <0EB4926B-5ED6-4044-AF0A-B0730F650B5B.rosswalker.co.uk>
Content-Type: text/plain; charset=us-ascii

Hi Jordi,

I am unable to replicate the behavior you describe. Running AMBER 20
+AmberTools 21, latest patches as of yesterday, GCC/ 9.3.1, CUDA 11.4,
PMEMD.CUDA single process on a system with 2 GPUs I get:

1 x run on GPU 0
Mon Jan 24 06:34:51 2022
+---------------------------------------------------------------------------
--+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
|
|-------------------------------+----------------------+--------------------
--+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
| | | MIG
M. |
|===============================+======================+====================
==|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off |
N/A |
| 74% 72C P2 208W / 310W | 287MiB / 7977MiB | 95% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
| 1 NVIDIA GeForce ... Off | 00000000:21:00.0 Off |
N/A |
| 0% 54C P0 1W / 310W | 0MiB / 7982MiB | 0% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
 

+---------------------------------------------------------------------------
--+
| Processes:
|
| GPU GI CI PID Type Process name GPU
Memory |
| ID ID Usage
|
|===========================================================================
==|
| 0 N/A N/A 99756 C ...al/amber20/bin/pmemd.cuda
285MiB |
+---------------------------------------------------------------------------
--+

1x run on GPU 1
Mon Jan 24 06:37:02 2022
+---------------------------------------------------------------------------
--+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
|
|-------------------------------+----------------------+--------------------
--+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
| | | MIG
M. |
|===============================+======================+====================
==|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off |
N/A |
| 47% 58C P0 1W / 310W | 0MiB / 7977MiB | 0% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
| 1 NVIDIA GeForce ... Off | 00000000:21:00.0 Off |
N/A |
| 70% 71C P2 221W / 310W | 287MiB / 7982MiB | 95% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
 

+---------------------------------------------------------------------------
--+
| Processes:
|
| GPU GI CI PID Type Process name GPU
Memory |
| ID ID Usage
|
|===========================================================================
==|
| 1 N/A N/A 99858 C ...al/amber20/bin/pmemd.cuda
285MiB |
+---------------------------------------------------------------------------
--+


2x individual runs on each of the two GPUs
Mon Jan 24 06:42:44 2022
+---------------------------------------------------------------------------
--+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
|
|-------------------------------+----------------------+--------------------
--+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
| | | MIG
M. |
|===============================+======================+====================
==|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off |
N/A |
| 83% 80C P2 216W / 310W | 287MiB / 7977MiB | 95% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
| 1 NVIDIA GeForce ... Off | 00000000:21:00.0 Off |
N/A |
| 73% 73C P2 225W / 310W | 287MiB / 7982MiB | 95% E.
Process |
| | |
N/A |
+-------------------------------+----------------------+--------------------
--+
 

+---------------------------------------------------------------------------
--+
| Processes:
|
| GPU GI CI PID Type Process name GPU
Memory |
| ID ID Usage
|
|===========================================================================
==|
| 0 N/A N/A 99987 C ...al/amber20/bin/pmemd.cuda
285MiB |
| 1 N/A N/A 99988 C ...al/amber20/bin/pmemd.cuda
285MiB |
+---------------------------------------------------------------------------
--+

You should only see additional processes when running pmemd.cuda.MPI since
this is part of the way the code communicates using peer to peer copies.
When running single GPU pmemd.cuda you should only see the one process.

Note with regards to your CUDA_VISIBLE_DEVICES workaround this is something
the queuing system should be setting automatically for you based on which
GPU Consumable resource it allocate. The alternative is for the queuing
system to use CGroups to essentially virtualize the resources and prevent
jobs from using resources outside of those allocated. If neither of those
things are happening it suggests that the queuing system is not properly
configured for nodes with GPUs.

All the best
Ross



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jan 25 2022 - 08:00:02 PST
Custom Search