Thank you Andreas and Ross!
Indeed, even when all 3 GPUs can do Peer to Peer communication in any
combination of pairs, when I ask for exactly 3 GPUs then Peer to Perr is
reported by amber as not possible.
At least on a DGX, I can get quite good scaling to 4 GPUs (which all allow
peer-to-peer). For example:
1 GPU: 22.7 ns/day
2 GPU: 34.4 ns/day (76 % efficient)
4 GPU: 51.3 ns/day (56 % efficient)
Sure, the efficiency goes down, but 51 vs 34 ns/day is a noticeable
improvement for this 200,000 atom system (2 fs timestep)
On Tue, Apr 4, 2017 at 9:26 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Chris,
>
> The P2P algorithm used in AMBER 16 only supports power of 2 GPUs. As such
> you will always see poor performance on 3 GPUs. For such a machine, indeed
> for almost all machines, your best option is to run 3 independent
> calculations, one per GPU, you'll get much better overall sampling that way
> since the multi-GPU scaling is never great. You could also run a 1 x 2GPU
> job and a 1 x 1 GPU job. On a DGX I wouldn't recommend going above 2 GPUs
> per run. Sure it will scale to 4 but the improvement is not great and you
> mostly end up just wasting resources for a few extra %. On a DGX system (or
> any 8 GPU system for that matter) your best option with AMBER 16 is
> probably to run either 8 x 1 GPU or 4 x 2 GPU or a combination of those.
> Unless you are running a large GB calculation in which case you can get
> almost linear scaling out to 8 GPUs - even over regular PCI-E (no need for
> gold plated DGX nodes).
>
> All the best
> Ross
>
>
> > On Apr 3, 2017, at 19:43, Chris Neale <candrewn.gmail.com> wrote:
> >
> > Dear AMBER users:
> >
> > I have a system with ~ 200,000 atoms that scales quite well on 4 GPUs on
> a
> > DGX machine with Amber16. I now have access to a different node for
> testing
> > purposes that has 3 Tesla P100 GPUs. I find that 1 GPU gives 21 ns/day, 2
> > GPUs give 31 ns/day and 3 GPUs give 21 ns/day. Strange thing is that 2
> GPUs
> > gives a consistent speed when I use GPUs 0,1 or 1,2 or 0,2 -- leading me
> to
> > think that there is PCI-based peer-to-peer across all 3 GPUs (though I
> > don't know how to verify that). So then why does performance drop off
> with
> > 3 GPUs? I don't currently have the ability to re-test with 3 GPUs on a
> DGX,
> > though I will look into testing that, since it could give a definitve
> > answer.
> >
> > I'm wondering whether there is something obviously inherent to the code
> > that doesn't like 3 GPUs (vs. 2 or 4)? Any thoughts?
> >
> > Thank you for your help,
> > Chris.
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Apr 04 2017 - 13:30:03 PDT