Re: [AMBER] Quenstions about pmemd.cuda.MPI from Pengfei Li on 2016-11-06 (Amber Archive Nov 2016)

From: Pengfei Li <lipengfei_mail.126.com>
Date: Mon, 7 Nov 2016 14:27:21 +0800 (CST)

Dear Ross,
Firstly, thanks for your reply.
When I use two GPU cards for parallel calculation, it really works.
But when I use four GPU cards, it causes the machine to crash and fails in parallel calculation.

Here, part of my submitting task script:
#!/bin/sh
export CUDA_VISIBLE_DEVICES="0,1,2,3"
.........
mpirun -np 4 $AMBERHOME/bin/pmemd.cuda.MPI -i md.in -c heat.rst7 -p complex_dc.parm7 -O -o md001.out -inf md001.info -r md001.rst7 -x md001.nc -l md001.log </dev/null
Then, I got the message by nvidia-smi command:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:0F:00.0 Off | 0 |
| N/A 21C P0 53W / 149W | 255MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:10:00.0 Off | 0 |
| N/A 25C P0 69W / 149W | 255MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:17:00.0 Off | 0 |
| N/A 29C P0 54W / 149W | 255MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:18:00.0 Off | 0 |
| N/A 25C P0 70W / 149W | 257MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 34209 C .../software/amber16/bin/pmemd.cuda.MPI 65MiB |
| 0 34210 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 0 34211 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 0 34212 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 1 34209 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 1 34210 C .../software/amber16/bin/pmemd.cuda.MPI 65MiB |
| 1 34211 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 1 34212 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 2 34209 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 2 34210 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 2 34211 C .../software/amber16/bin/pmemd.cuda.MPI 65MiB |
| 2 34212 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 3 34209 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 3 34210 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 3 34211 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
| 3 34212 C .../software/amber16/bin/pmemd.cuda.MPI 65MiB |
+-----------------------------------------------------------------------------+

So, I wonder why does it fail in this situation.
Look forward to your favourable reply.

Best,
Pengfei Li

-------- Forwarding messages --------
From: "Ross Walker" <ross.rosswalker.co.uk>
Date: 2016-11-04 11:48:16
To: "AMBER Mailing List" <amber.ambermd.org>
Subject: Re: [AMBER] Quenstions about pmemd.cuda.MPI
Hi Pengfei,

Yeah, I've never understood this either but it works. ;-)

It's something to do with how P2P copies are handled by CUDA and the driver. So it's really just an issue with how nvidia-smi identifies which process is running on which GPU. Note there are only actually two unique PIDs. Short answer is don't worry about.

All the best
Ross

> On Nov 3, 2016, at 20:20, Pengfei Li <lipengfei_mail.126.com> wrote:
>
> Dear all,
> Recently, I employed multiple GPUs in a single simulation using pmemd.cuda.MPI.
> Part of my submitting task script:
>
> #!/bin/sh
> export CUDA_VISIBLE_DEVICES="0,1"
> .........
> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -i md.in -c heat.rst7 -p complex_dc.parm7 -O -o md001.out -inf md001.info -r md001.rst7 -x md001.nc -l md001.log </dev/null
>
> I got the message by nvidia-smi command:
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 367.48 Driver Version: 367.48 |
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
> |===============================+======================+======================|
> | 0 Tesla K80 Off | 0000:0F:00.0 Off | 0 |
> | N/A 42C P0 116W / 149W | 1259MiB / 11439MiB | 63% Default |
> +-------------------------------+----------------------+----------------------+
> | 1 Tesla K80 Off | 0000:10:00.0 Off | 0 |
> | N/A 58C P0 145W / 149W | 935MiB / 11439MiB | 99% Default |
> +-------------------------------+----------------------+----------------------+
> | 2 Tesla K80 Off | 0000:17:00.0 Off | 0 |
> | N/A 29C P8 26W / 149W | 2MiB / 11439MiB | 0% Default |
> +-------------------------------+----------------------+----------------------+
> | 3 Tesla K80 Off | 0000:18:00.0 Off | 0 |
> | N/A 28C P8 29W / 149W | 2MiB / 11439MiB | 0% Default |
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes: GPU Memory |
> | GPU PID Type Process name Usage |
> |=============================================================================|
> | 0 30442 C .../software/amber16/bin/pmemd.cuda.MPI 1194MiB |
> | 0 30443 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 1 30442 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 1 30443 C .../software/amber16/bin/pmemd.cuda.MPI 870MiB |
> +-----------------------------------------------------------------------------+
>
> I did not understand why the GPU 0 had the two tasks displayed as above and so did the GPU 1.
> And why did the GPU 0 and the GPU 1 have the same task PID:30442, 30443 ?
>
> Best,
> Pengfei Li
>
>
> --
>
> -------------------------------------------------------------------------
> Pengfei Li
> Email:lipengfei_mail.126.com
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Nov 06 2016 - 22:30:02 PST