Re: [AMBER] 3 GPUs ? from Chris Neale on 2017-04-04 (Amber Archive Apr 2017)

From: Chris Neale <candrewn.gmail.com>
Date: Tue, 4 Apr 2017 18:06:04 -0600

Dear Ross:

thanks for the tip. I presume the issue is just the Charmm forcefield
defined cutoffs of 12 A, but a sanity check on my input file would
certainly be welcome. In actual fact, I run my production simulations with
a 4 fs timestep, but the benchmarking that I reported (and this associated
input file) are for a 2 fs timestep. I actually thought the performance was
quite good (it certainly is very good in comparison to gromacs, which was
my engine of choice until recently), but if I can improve on this then that
would be excellent.

Additional questions if you have the time:

1) What cutoff and timestep do you use to get 20.84 ns/day for 400,000
atoms on the P100?

2) Also, am I correct to understand that it's the NVlink that costs $800
and the Quadro GP100 itself will cost many thousands, similar to the
P100's, or is it really that the Quadro GP100 outperforms the GTX-1080TI at
the same price point and therefore that the Quadro GP100 is now the
commodity GPU of choice?

A NPT simulation for common production-level simulations -- params
generally from Charmm-gui + some modifications by CN
&cntrl
    imin=0, ! No minimization
    irest=0, ! ires=1 for restart and irest=0 for new start
    ntx=1, ! ntx=5 to use velocities from inpcrd and ntx=1 to not
use them
    ntb=2, ! constant pressure simulation

    ! Temperature control
    ntt=3, ! Langevin dynamics
    gamma_ln=1.0, ! Friction coefficient (ps^-1)
    temp0=310.0, ! Target temperature
    tempi=310.0, ! Initial temperature -- has no effect if ntx>3

    ! Potential energy control
    cut=12.0, ! nonbonded cutoff, in Angstroms
    fswitch=10.0, ! for charmm.... note charmm-gui suggested cut=0.8 and
no use of fswitch

    ! MD settings
    nstlim=10000, ! 0.25B steps, 1 us total
    dt=0.002, ! time step (ps)

    ! SHAKE
    ntc=2, ! Constrain bonds containing hydrogen
    ntf=2, ! Do not calculate forces of bonds containing hydrogen

    ! Control how often information is printed
    ntpr=5000, ! Print energy frequency
    ntwx=5000, ! Print coordinate frequency
    ntwr=5000, ! Print restart file frequency
    ntxo=2, ! Write NetCDF format
    ioutfm=1, ! Write NetCDF format (always do this!)

    ! Wrap coordinates when printing them to the same unit cell
    iwrap=1,

    ! Constant pressure control. Note that ntp=3 requires barostat=1
    barostat=2, ! Berendsen... change to 2 for MC barostat
    ntp=3, ! 1=isotropic, 2=anisotropic, 3=semi-isotropic w/ surften
    pres0=1.01325, ! Target external pressure, in bar
    taup=4, ! Berendsen coupling constant (ps)
    comp=45, ! compressibility

    ! Constant surface tension (needed for semi-isotropic scaling).
Uncomment
    ! for this feature. csurften must be nonzero if ntp=3 above
    csurften=3, ! Interfaces in 1=yz plane, 2=xz plane, 3=xy plane
    gamma_ten=0.0, ! Surface tension (dyne/cm). 0 gives pure semi-iso
scaling
    ninterface=2, ! Number of interfaces (2 for bilayer)

    ! Set water atom/residue names for SETTLE recognition
    watnam='SOL', ! Water residues are named TIP3
    owtnm='OW', ! Water oxygens are named OH2
    hwtnm1='HW1',
    hwtnm2='HW2',
&end
&ewald
        vdwmeth = 0,
&end

#####

Thanks for your help!
Chris.

On Tue, Apr 4, 2017 at 5:46 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Chris,
>
> That's really slow out of the blocks for 200K atoms so no wonder it scales
> well. What does you mdin file look like for this? Are you running with a
> large cutoff? For reference NPT 400K atoms gets 20.84 ns/day on a 16GB P100
> and 20.14 on a 1080TI. A single Quadro GP100 gets 24.49 ns/day and two
> Quadros with NVLink (an $800 option) gets 34.62 ns/day. So the scaling to 2
> GPU looks similar to what you show here but this is for twice as many atoms
> and doesn't scale as well to 4 GPUs over nvlink as you show. You might want
> to check you settings - you might be able to get that 51.3 ns/day on just 1
> GPU with the right settings.
>
> All the best
> Ross
>
> > On Apr 4, 2017, at 13:28, Chris Neale <candrewn.gmail.com> wrote:
> >
> > Thank you Andreas and Ross!
> >
> > Indeed, even when all 3 GPUs can do Peer to Peer communication in any
> > combination of pairs, when I ask for exactly 3 GPUs then Peer to Perr is
> > reported by amber as not possible.
> >
> > At least on a DGX, I can get quite good scaling to 4 GPUs (which all
> allow
> > peer-to-peer). For example:
> >
> > 1 GPU: 22.7 ns/day
> > 2 GPU: 34.4 ns/day (76 % efficient)
> > 4 GPU: 51.3 ns/day (56 % efficient)
> >
> > Sure, the efficiency goes down, but 51 vs 34 ns/day is a noticeable
> > improvement for this 200,000 atom system (2 fs timestep)
> >
> > On Tue, Apr 4, 2017 at 9:26 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> >
> >> Hi Chris,
> >>
> >> The P2P algorithm used in AMBER 16 only supports power of 2 GPUs. As
> such
> >> you will always see poor performance on 3 GPUs. For such a machine,
> indeed
> >> for almost all machines, your best option is to run 3 independent
> >> calculations, one per GPU, you'll get much better overall sampling that
> way
> >> since the multi-GPU scaling is never great. You could also run a 1 x
> 2GPU
> >> job and a 1 x 1 GPU job. On a DGX I wouldn't recommend going above 2
> GPUs
> >> per run. Sure it will scale to 4 but the improvement is not great and
> you
> >> mostly end up just wasting resources for a few extra %. On a DGX system
> (or
> >> any 8 GPU system for that matter) your best option with AMBER 16 is
> >> probably to run either 8 x 1 GPU or 4 x 2 GPU or a combination of those.
> >> Unless you are running a large GB calculation in which case you can get
> >> almost linear scaling out to 8 GPUs - even over regular PCI-E (no need
> for
> >> gold plated DGX nodes).
> >>
> >> All the best
> >> Ross
> >>
> >>
> >>> On Apr 3, 2017, at 19:43, Chris Neale <candrewn.gmail.com> wrote:
> >>>
> >>> Dear AMBER users:
> >>>
> >>> I have a system with ~ 200,000 atoms that scales quite well on 4 GPUs
> on
> >> a
> >>> DGX machine with Amber16. I now have access to a different node for
> >> testing
> >>> purposes that has 3 Tesla P100 GPUs. I find that 1 GPU gives 21
> ns/day, 2
> >>> GPUs give 31 ns/day and 3 GPUs give 21 ns/day. Strange thing is that 2
> >> GPUs
> >>> gives a consistent speed when I use GPUs 0,1 or 1,2 or 0,2 -- leading
> me
> >> to
> >>> think that there is PCI-based peer-to-peer across all 3 GPUs (though I
> >>> don't know how to verify that). So then why does performance drop off
> >> with
> >>> 3 GPUs? I don't currently have the ability to re-test with 3 GPUs on a
> >> DGX,
> >>> though I will look into testing that, since it could give a definitve
> >>> answer.
> >>>
> >>> I'm wondering whether there is something obviously inherent to the code
> >>> that doesn't like 3 GPUs (vs. 2 or 4)? Any thoughts?
> >>>
> >>> Thank you for your help,
> >>> Chris.
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Apr 04 2017 - 17:30:03 PDT