Hi Chris,
That's really slow out of the blocks for 200K atoms so no wonder it scales well. What does you mdin file look like for this? Are you running with a large cutoff? For reference NPT 400K atoms gets 20.84 ns/day on a 16GB P100 and 20.14 on a 1080TI. A single Quadro GP100 gets 24.49 ns/day and two Quadros with NVLink (an $800 option) gets 34.62 ns/day. So the scaling to 2 GPU looks similar to what you show here but this is for twice as many atoms and doesn't scale as well to 4 GPUs over nvlink as you show. You might want to check you settings - you might be able to get that 51.3 ns/day on just 1 GPU with the right settings.
All the best
Ross
> On Apr 4, 2017, at 13:28, Chris Neale <candrewn.gmail.com> wrote:
>
> Thank you Andreas and Ross!
>
> Indeed, even when all 3 GPUs can do Peer to Peer communication in any
> combination of pairs, when I ask for exactly 3 GPUs then Peer to Perr is
> reported by amber as not possible.
>
> At least on a DGX, I can get quite good scaling to 4 GPUs (which all allow
> peer-to-peer). For example:
>
> 1 GPU: 22.7 ns/day
> 2 GPU: 34.4 ns/day (76 % efficient)
> 4 GPU: 51.3 ns/day (56 % efficient)
>
> Sure, the efficiency goes down, but 51 vs 34 ns/day is a noticeable
> improvement for this 200,000 atom system (2 fs timestep)
>
> On Tue, Apr 4, 2017 at 9:26 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> Hi Chris,
>>
>> The P2P algorithm used in AMBER 16 only supports power of 2 GPUs. As such
>> you will always see poor performance on 3 GPUs. For such a machine, indeed
>> for almost all machines, your best option is to run 3 independent
>> calculations, one per GPU, you'll get much better overall sampling that way
>> since the multi-GPU scaling is never great. You could also run a 1 x 2GPU
>> job and a 1 x 1 GPU job. On a DGX I wouldn't recommend going above 2 GPUs
>> per run. Sure it will scale to 4 but the improvement is not great and you
>> mostly end up just wasting resources for a few extra %. On a DGX system (or
>> any 8 GPU system for that matter) your best option with AMBER 16 is
>> probably to run either 8 x 1 GPU or 4 x 2 GPU or a combination of those.
>> Unless you are running a large GB calculation in which case you can get
>> almost linear scaling out to 8 GPUs - even over regular PCI-E (no need for
>> gold plated DGX nodes).
>>
>> All the best
>> Ross
>>
>>
>>> On Apr 3, 2017, at 19:43, Chris Neale <candrewn.gmail.com> wrote:
>>>
>>> Dear AMBER users:
>>>
>>> I have a system with ~ 200,000 atoms that scales quite well on 4 GPUs on
>> a
>>> DGX machine with Amber16. I now have access to a different node for
>> testing
>>> purposes that has 3 Tesla P100 GPUs. I find that 1 GPU gives 21 ns/day, 2
>>> GPUs give 31 ns/day and 3 GPUs give 21 ns/day. Strange thing is that 2
>> GPUs
>>> gives a consistent speed when I use GPUs 0,1 or 1,2 or 0,2 -- leading me
>> to
>>> think that there is PCI-based peer-to-peer across all 3 GPUs (though I
>>> don't know how to verify that). So then why does performance drop off
>> with
>>> 3 GPUs? I don't currently have the ability to re-test with 3 GPUs on a
>> DGX,
>>> though I will look into testing that, since it could give a definitve
>>> answer.
>>>
>>> I'm wondering whether there is something obviously inherent to the code
>>> that doesn't like 3 GPUs (vs. 2 or 4)? Any thoughts?
>>>
>>> Thank you for your help,
>>> Chris.
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Apr 04 2017 - 17:00:03 PDT