Re: [AMBER] Quenstions about pmemd.cuda.MPI from Ross Walker on 2016-11-07 (Amber Archive Nov 2016)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 7 Nov 2016 08:29:29 -0600

Hi Peng Fei,

It shouldn't 'fail' but I suspect your hardware may not be capable of running 4 cards at once. In that case it should run but just very slowly. Have you gone through the GPU page which describes how to determine if all 4 cards can do peer to peer communication correctly over all GPUs? Most dual socket motherboards only have pairs of PCI-E slots that can communicate via P2P. Here's some background I wrote a while ago:

http://exxactcorp.com/blog/exploring-the-complexities-of-pcie-connectivity-and-peer-to-peer-communication/ <http://exxactcorp.com/blog/exploring-the-complexities-of-pcie-connectivity-and-peer-to-peer-communication/>

As for it actually crashing - with no details at all as to what happens, error message etc it impossible to offer any help. That said you are unlikely to see performance improvement on 4 GPUs so it probably isn't worth the effort to debug here.

All the best
Ross

> On Nov 7, 2016, at 00:27, Pengfei Li <lipengfei_mail.126.com> wrote:
>
> Dear Ross,
> Firstly, thanks for your reply.
> When I use two GPU cards for parallel calculation, it really works.
> But when I use four GPU cards, it causes the machine to crash and fails in parallel calculation.
>
> Here, part of my submitting task script:
> #!/bin/sh
> export CUDA_VISIBLE_DEVICES="0,1,2,3"
> .........
> mpirun -np 4 $AMBERHOME/bin/pmemd.cuda.MPI -i md.in -c heat.rst7 -p complex_dc.parm7 -O -o md001.out -inf md001.info -r md001.rst7 -x md001.nc -l md001.log </dev/null
> Then, I got the message by nvidia-smi command:
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 367.48 Driver Version: 367.48 |
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
> |===============================+======================+======================|
> | 0 Tesla K80 Off | 0000:0F:00.0 Off | 0 |
> | N/A 21C P0 53W / 149W | 255MiB / 11439MiB | 0% Default |
> +-------------------------------+----------------------+----------------------+
> | 1 Tesla K80 Off | 0000:10:00.0 Off | 0 |
> | N/A 25C P0 69W / 149W | 255MiB / 11439MiB | 0% Default |
> +-------------------------------+----------------------+----------------------+
> | 2 Tesla K80 Off | 0000:17:00.0 Off | 0 |
> | N/A 29C P0 54W / 149W | 255MiB / 11439MiB | 0% Default |
> +-------------------------------+----------------------+----------------------+
> | 3 Tesla K80 Off | 0000:18:00.0 Off | 0 |
> | N/A 25C P0 70W / 149W | 257MiB / 11439MiB | 0% Default |
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes: GPU Memory |
> | GPU PID Type Process name Usage |
> |=============================================================================|
> | 0 34209 C .../software/amber16/bin/pmemd.cuda.MPI 65MiB |
> | 0 34210 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 0 34211 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 0 34212 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 1 34209 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 1 34210 C .../software/amber16/bin/pmemd.cuda.MPI 65MiB |
> | 1 34211 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 1 34212 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 2 34209 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 2 34210 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 2 34211 C .../software/amber16/bin/pmemd.cuda.MPI 65MiB |
> | 2 34212 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 3 34209 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 3 34210 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 3 34211 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
> | 3 34212 C .../software/amber16/bin/pmemd.cuda.MPI 65MiB |
> +-----------------------------------------------------------------------------+
>
> So, I wonder why does it fail in this situation.
> Look forward to your favourable reply.
>
> Best,
> Pengfei Li
>
>
>
>
> -------- Forwarding messages --------
> From: "Ross Walker" <ross.rosswalker.co.uk>
> Date: 2016-11-04 11:48:16
> To: "AMBER Mailing List" <amber.ambermd.org>
> Subject: Re: [AMBER] Quenstions about pmemd.cuda.MPI
> Hi Pengfei,
>
> Yeah, I've never understood this either but it works. ;-)
>
> It's something to do with how P2P copies are handled by CUDA and the driver. So it's really just an issue with how nvidia-smi identifies which process is running on which GPU. Note there are only actually two unique PIDs. Short answer is don't worry about.
>
> All the best
> Ross
>
>> On Nov 3, 2016, at 20:20, Pengfei Li <lipengfei_mail.126.com> wrote:
>>
>> Dear all,
>> Recently, I employed multiple GPUs in a single simulation using pmemd.cuda.MPI.
>> Part of my submitting task script:
>>
>> #!/bin/sh
>> export CUDA_VISIBLE_DEVICES="0,1"
>> .........
>> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -i md.in -c heat.rst7 -p complex_dc.parm7 -O -o md001.out -inf md001.info -r md001.rst7 -x md001.nc -l md001.log </dev/null
>>
>> I got the message by nvidia-smi command:
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 367.48 Driver Version: 367.48 |
>> |-------------------------------+----------------------+----------------------+
>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
>> |===============================+======================+======================|
>> | 0 Tesla K80 Off | 0000:0F:00.0 Off | 0 |
>> | N/A 42C P0 116W / 149W | 1259MiB / 11439MiB | 63% Default |
>> +-------------------------------+----------------------+----------------------+
>> | 1 Tesla K80 Off | 0000:10:00.0 Off | 0 |
>> | N/A 58C P0 145W / 149W | 935MiB / 11439MiB | 99% Default |
>> +-------------------------------+----------------------+----------------------+
>> | 2 Tesla K80 Off | 0000:17:00.0 Off | 0 |
>> | N/A 29C P8 26W / 149W | 2MiB / 11439MiB | 0% Default |
>> +-------------------------------+----------------------+----------------------+
>> | 3 Tesla K80 Off | 0000:18:00.0 Off | 0 |
>> | N/A 28C P8 29W / 149W | 2MiB / 11439MiB | 0% Default |
>> +-------------------------------+----------------------+----------------------+
>>
>> +-----------------------------------------------------------------------------+
>> | Processes: GPU Memory |
>> | GPU PID Type Process name Usage |
>> |=============================================================================|
>> | 0 30442 C .../software/amber16/bin/pmemd.cuda.MPI 1194MiB |
>> | 0 30443 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
>> | 1 30442 C .../software/amber16/bin/pmemd.cuda.MPI 61MiB |
>> | 1 30443 C .../software/amber16/bin/pmemd.cuda.MPI 870MiB |
>> +-----------------------------------------------------------------------------+
>>
>> I did not understand why the GPU 0 had the two tasks displayed as above and so did the GPU 1.
>> And why did the GPU 0 and the GPU 1 have the same task PID:30442, 30443 ?
>>
>> Best,
>> Pengfei Li
>>
>>
>> --
>>
>> -------------------------------------------------------------------------
>> Pengfei Li
>> Email:lipengfei_mail.126.com
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Nov 07 2016 - 06:30:04 PST