Re: [AMBER] ghost pmemd.cuda process ?

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 24 Jan 2022 09:45:16 -0500

Hi Jordi,

I am unable to replicate the behavior you describe. Running AMBER 20 +AmberTools 21, latest patches as of yesterday, GCC/ 9.3.1, CUDA 11.4, PMEMD.CUDA single process on a system with 2 GPUs I get:

1 x run on GPU 0
Mon Jan 24 06:34:51 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 74% 72C P2 208W / 310W | 287MiB / 7977MiB | 95% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:21:00.0 Off | N/A |
| 0% 54C P0 1W / 310W | 0MiB / 7982MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 99756 C ...al/amber20/bin/pmemd.cuda 285MiB |
+-----------------------------------------------------------------------------+

1x run on GPU 1
Mon Jan 24 06:37:02 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 47% 58C P0 1W / 310W | 0MiB / 7977MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:21:00.0 Off | N/A |
| 70% 71C P2 221W / 310W | 287MiB / 7982MiB | 95% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 99858 C ...al/amber20/bin/pmemd.cuda 285MiB |
+-----------------------------------------------------------------------------+


2x individual runs on each of the two GPUs
Mon Jan 24 06:42:44 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 83% 80C P2 216W / 310W | 287MiB / 7977MiB | 95% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:21:00.0 Off | N/A |
| 73% 73C P2 225W / 310W | 287MiB / 7982MiB | 95% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 99987 C ...al/amber20/bin/pmemd.cuda 285MiB |
| 1 N/A N/A 99988 C ...al/amber20/bin/pmemd.cuda 285MiB |
+-----------------------------------------------------------------------------+

You should only see additional processes when running pmemd.cuda.MPI since this is part of the way the code communicates using peer to peer copies. When running single GPU pmemd.cuda you should only see the one process.

Note with regards to your CUDA_VISIBLE_DEVICES workaround this is something the queuing system should be setting automatically for you based on which GPU Consumable resource it allocate. The alternative is for the queuing system to use CGroups to essentially virtualize the resources and prevent jobs from using resources outside of those allocated. If neither of those things are happening it suggests that the queuing system is not properly configured for nodes with GPUs.

All the best
Ross

> On Jan 22, 2022, at 16:02, Jordi Bujons <jordi.bujons.iqac.csic.es> wrote:
>
> Dear all,
>
>
>
> I am running a series of simulations on several nodes of a cluster which are
> equipped with different number and type of GPUs. I had done this in the past
> and it used to be very simple since all nodes are set with Persistence and
> Compute Exclusive modes on, such that any number of GPU jobs could be
> launched through the queue system (SGE 6.2) and the jobs would be
> distributed among the nodes with available GPUs, using a single GPU per job
> and without doing anything else on the user side.
>
>
>
> However, more recently I found that this is not possible anymore because
> each GPU job seems to initiate 2 pmemd.cuda processes: one which performs
> the actual calculation and a second one that remains without doing much but
> keeping one GPU busy and not available for other jobs. Thus, if I run the
> command nvidia-smi after launching a single pmemd.cuda calculation, this is
> the output:
>
>
>
> +---------------------------------------------------------------------------
> --+
>
> | NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2
> |
>
> |-------------------------------+----------------------+--------------------
> --+
>
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
> ECC |
>
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
> M. |
>
> | | | MIG
> M. |
>
> |===============================+======================+====================
> ==|
>
> | 0 GeForce GTX 108... On | 00000000:03:00.0 Off |
> N/A |
>
> | 20% 36C P8 9W / 250W | 226MiB / 11176MiB | 0% E.
> Process |
>
> | | |
> N/A |
>
> +-------------------------------+----------------------+--------------------
> --+
>
> | 1 GeForce GTX 108... On | 00000000:81:00.0 Off |
> N/A |
>
> | 67% 84C P2 212W / 250W | 627MiB / 11178MiB | 98% E.
> Process |
>
> | | |
> N/A |
>
> +-------------------------------+----------------------+--------------------
> --+
>
>
>
> +---------------------------------------------------------------------------
> --+
>
> | Processes:
> |
>
> | GPU GI CI PID Type Process name GPU
> Memory |
>
> | ID ID Usage
> |
>
> |===========================================================================
> ==|
>
> | 0 N/A N/A 2121 G /usr/bin/X
> 35MiB |
>
> | 0 N/A N/A 2165 G /usr/bin/sddm-greeter
> 29MiB |
>
> | 0 N/A N/A 31248 C ...ms/amber21/bin/pmemd.cuda
> 155MiB |
>
> | 1 N/A N/A 31248 C ...ms/amber21/bin/pmemd.cuda
> 623MiB |
>
> +---------------------------------------------------------------------------
> --+
>
>
>
> As shown, GPU ID 0 is busy with a pmemd.cuda process that shows 0% GPU-Util,
> while GPU ID 1 is running at almost 100%. The calculation runs just fine but
> trying to launch a second job in this node fails because no available GPUs.
>
>
>
> The only way that I found to use all the available GPUs is by setting the
> CUDA_VISIBLE_DEVICES parameter. Doing this, each job runs a single
> pmemd.cuda process and occupies a single GPU. The inconvenience of this is
> that trying to launch a relatively large number of simulations through the
> queue system requires to set the CUDA_VISIBLE_DEVICES parameter in each
> case, and it is possible that jobs won't run even if there are GPUs
> available because they have a different IDs than the one set.
>
>
>
> I do not know if this a problem of a wrong compilation or of the Nvidia
> driver or something else. I wonder if somebody has found the same problem
> and if there is any fix for this. Any suggestion will be appreciated.
>
>
>
> Regards,
>
>
>
> Jordi
>
>
>
>
>
> ----------------------------------------------------------------------------
> ----------
> Jordi Bujons, PhD
> Dept. of Biological Chemistry (QB)
> Institute of Advanced Chemistry of Catalonia (IQAC)
> National Research Council of Spain (CSIC)
> Address: Jordi Girona 18-26, 08034 Barcelona, Spain
> Phone: <tel:%2B34%20934006100%20ext.%201291> +34 934006100 ext. 1291
> FAX: <tel:%2B34%20932045904> +34 932045904
> <mailto:jordi.bujons.iqac.csic.es> jordi.bujons.iqac.csic.es
> <http://www.iqac.csic.es> http://www.iqac.csic.es
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jan 24 2022 - 07:00:02 PST
Custom Search