[AMBER] ghost pmemd.cuda process ? from Jordi Bujons on 2022-01-22 (Amber Archive Jan 2022)

From: Jordi Bujons <jordi.bujons.iqac.csic.es>
Date: Sat, 22 Jan 2022 22:02:53 +0100

Dear all,

I am running a series of simulations on several nodes of a cluster which are
equipped with different number and type of GPUs. I had done this in the past
and it used to be very simple since all nodes are set with Persistence and
Compute Exclusive modes on, such that any number of GPU jobs could be
launched through the queue system (SGE 6.2) and the jobs would be
distributed among the nodes with available GPUs, using a single GPU per job
and without doing anything else on the user side.

However, more recently I found that this is not possible anymore because
each GPU job seems to initiate 2 pmemd.cuda processes: one which performs
the actual calculation and a second one that remains without doing much but
keeping one GPU busy and not available for other jobs. Thus, if I run the
command nvidia-smi after launching a single pmemd.cuda calculation, this is
the output:

+---------------------------------------------------------------------------
--+

| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2
|

|-------------------------------+----------------------+--------------------
--+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |

| | | MIG
M. |

|===============================+======================+====================
==|

| 0 GeForce GTX 108... On | 00000000:03:00.0 Off |
N/A |

| 20% 36C P8 9W / 250W | 226MiB / 11176MiB | 0% E.
Process |

| | |
N/A |

+-------------------------------+----------------------+--------------------
--+

| 1 GeForce GTX 108... On | 00000000:81:00.0 Off |
N/A |

| 67% 84C P2 212W / 250W | 627MiB / 11178MiB | 98% E.
Process |

| | |
N/A |

+-------------------------------+----------------------+--------------------
--+

+---------------------------------------------------------------------------
--+

| Processes:
|

| GPU GI CI PID Type Process name GPU
Memory |

| ID ID Usage
|

|===========================================================================
==|

| 0 N/A N/A 2121 G /usr/bin/X
35MiB |

| 0 N/A N/A 2165 G /usr/bin/sddm-greeter
29MiB |

| 0 N/A N/A 31248 C ...ms/amber21/bin/pmemd.cuda
155MiB |

| 1 N/A N/A 31248 C ...ms/amber21/bin/pmemd.cuda
623MiB |

+---------------------------------------------------------------------------
--+

As shown, GPU ID 0 is busy with a pmemd.cuda process that shows 0% GPU-Util,
while GPU ID 1 is running at almost 100%. The calculation runs just fine but
trying to launch a second job in this node fails because no available GPUs.

The only way that I found to use all the available GPUs is by setting the
CUDA_VISIBLE_DEVICES parameter. Doing this, each job runs a single
pmemd.cuda process and occupies a single GPU. The inconvenience of this is
that trying to launch a relatively large number of simulations through the
queue system requires to set the CUDA_VISIBLE_DEVICES parameter in each
case, and it is possible that jobs won't run even if there are GPUs
available because they have a different IDs than the one set.

I do not know if this a problem of a wrong compilation or of the Nvidia
driver or something else. I wonder if somebody has found the same problem
and if there is any fix for this. Any suggestion will be appreciated.

Regards,

Jordi

----------------------------------------------------------------------------
----------
Jordi Bujons, PhD
Dept. of Biological Chemistry (QB)
Institute of Advanced Chemistry of Catalonia (IQAC)
National Research Council of Spain (CSIC)
Address: Jordi Girona 18-26, 08034 Barcelona, Spain
Phone: <tel:%2B34%20934006100%20ext.%201291> +34 934006100 ext. 1291
FAX: <tel:%2B34%20932045904> +34 932045904
<mailto:jordi.bujons.iqac.csic.es> jordi.bujons.iqac.csic.es
<http://www.iqac.csic.es> http://www.iqac.csic.es

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Jan 22 2022 - 13:30:02 PST