Re: [AMBER] 3 GPUs ? from Chris Neale on 2017-04-05 (Amber Archive Apr 2017)

From: Chris Neale <candrewn.gmail.com>
Date: Wed, 5 Apr 2017 18:04:27 -0600

Dear Ross:

this is all very helpful information. I really appreciate it.

Chris.

On Wed, Apr 5, 2017 at 1:41 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Chris,
>
> Ah yes the 12 angstrom cutoff is majority of the reason you are seeing
> lower performance. The 20.84ns/day on a 408K system P100 is the following
> input:
>
> Typical Production MD NPT, MC Bar 2fs
> &cntrl
> ntx=5, irest=1,
> ntc=2, ntf=2,
> nstlim=10000,
> ntpr=1000, ntwx=1000,
> ntwr=10000,
> dt=0.002, cut=8.,
> ntt=1, tautp=10.0,
> temp0=300.0,
> ntb=2, ntp=1, barostat=2,
> ioutfm=1,
> /
> So 8 angstrom cutoff which is good most of the time, one can up that to 9
> if needed without it costing too much performance wise. 12 angstroms can
> hurt quite a bit and, in my opinion, is overkill unless one is using the
> Charmm lipid force field, it's a shame they chose to parameterize that
> based exclusively on a 12 angstrom cutoff from a performance perspective.
> Since the Charmm use of a 12 angstrom cutoff is focused on VDW interactions
> one could, in principal, get back some performance by backing off the grid
> spacing for the reciprocal space and still achieve the same Ewald error but
> that might be too much effort for the potential gain. The use of a force
> switch also saps performance. Again, that's probably only needed if using
> the Charmm lipid parameters.
>
> In terms of other minor tweaks. I'd set ntwr = nstlim unless you have a
> real need to have it at 5000. Restarts tend to be messy, output file wise,
> so if a job crashes during a single MD step I tend to just restart that
> step from the beginning. You can get a small performance boost by thus only
> writing a restart file at the end of a run.
>
> You are using the Monte Carlo barostat so that's good - note I don't think
> the following settings do anything in the case of barostat=2.
>
> > taup=4, ! Berendsen coupling constant (ps)
> > comp=45, ! compressibility
>
> The other thing probably affecting performance is the use of XY isotropic
> scaling. You may not need that with the latest versions of the various
> lipid force fields. I'm not sure about the latest charmm but AMBER's lipid
> 14 for example does not need isotropic pressure scaling.
>
> With regards to the GP100 and NVlink. Yeah if only it was $800 for that.
> The $800 is just for the cable that connects the two cards together. :-(
> The Quadro cards themselves will be many thousands of dollars although
> cheaper than P100 cards (and they are PCI-E so you can put them in a cost
> effective node rather than the very expensive SXM2 nodes).
>
> Ultimately the card of choice today is almost certainly the GTX-1080TI at
> ~$699 a card - but no NVLink possible here so you are limited to mediocre
> P2P scaling to 2 GPUs and/or just running multiple runs (1 per GPU) from
> different initial conditions which is almost always a good idea anyway.
>
> Hope that helps.
>
> All the best
> Ross
>
> > On Apr 4, 2017, at 17:06, Chris Neale <candrewn.gmail.com> wrote:
> >
> > Dear Ross:
> >
> > thanks for the tip. I presume the issue is just the Charmm forcefield
> > defined cutoffs of 12 A, but a sanity check on my input file would
> > certainly be welcome. In actual fact, I run my production simulations
> with
> > a 4 fs timestep, but the benchmarking that I reported (and this
> associated
> > input file) are for a 2 fs timestep. I actually thought the performance
> was
> > quite good (it certainly is very good in comparison to gromacs, which was
> > my engine of choice until recently), but if I can improve on this then
> that
> > would be excellent.
> >
> > Additional questions if you have the time:
> >
> > 1) What cutoff and timestep do you use to get 20.84 ns/day for 400,000
> > atoms on the P100?
> >
> > 2) Also, am I correct to understand that it's the NVlink that costs $800
> > and the Quadro GP100 itself will cost many thousands, similar to the
> > P100's, or is it really that the Quadro GP100 outperforms the GTX-1080TI
> at
> > the same price point and therefore that the Quadro GP100 is now the
> > commodity GPU of choice?
> >
> >
> >
> > A NPT simulation for common production-level simulations -- params
> > generally from Charmm-gui + some modifications by CN
> > &cntrl
> > imin=0, ! No minimization
> > irest=0, ! ires=1 for restart and irest=0 for new start
> > ntx=1, ! ntx=5 to use velocities from inpcrd and ntx=1 to not
> > use them
> > ntb=2, ! constant pressure simulation
> >
> > ! Temperature control
> > ntt=3, ! Langevin dynamics
> > gamma_ln=1.0, ! Friction coefficient (ps^-1)
> > temp0=310.0, ! Target temperature
> > tempi=310.0, ! Initial temperature -- has no effect if ntx>3
> >
> > ! Potential energy control
> > cut=12.0, ! nonbonded cutoff, in Angstroms
> > fswitch=10.0, ! for charmm.... note charmm-gui suggested cut=0.8
> and
> > no use of fswitch
> >
> > ! MD settings
> > nstlim=10000, ! 0.25B steps, 1 us total
> > dt=0.002, ! time step (ps)
> >
> > ! SHAKE
> > ntc=2, ! Constrain bonds containing hydrogen
> > ntf=2, ! Do not calculate forces of bonds containing hydrogen
> >
> > ! Control how often information is printed
> > ntpr=5000, ! Print energy frequency
> > ntwx=5000, ! Print coordinate frequency
> > ntwr=5000, ! Print restart file frequency
> > ntxo=2, ! Write NetCDF format
> > ioutfm=1, ! Write NetCDF format (always do this!)
> >
> > ! Wrap coordinates when printing them to the same unit cell
> > iwrap=1,
> >
> > ! Constant pressure control. Note that ntp=3 requires barostat=1
> > barostat=2, ! Berendsen... change to 2 for MC barostat
> > ntp=3, ! 1=isotropic, 2=anisotropic, 3=semi-isotropic w/
> surften
> > pres0=1.01325, ! Target external pressure, in bar
> > taup=4, ! Berendsen coupling constant (ps)
> > comp=45, ! compressibility
> >
> > ! Constant surface tension (needed for semi-isotropic scaling).
> > Uncomment
> > ! for this feature. csurften must be nonzero if ntp=3 above
> > csurften=3, ! Interfaces in 1=yz plane, 2=xz plane, 3=xy plane
> > gamma_ten=0.0, ! Surface tension (dyne/cm). 0 gives pure semi-iso
> > scaling
> > ninterface=2, ! Number of interfaces (2 for bilayer)
> >
> > ! Set water atom/residue names for SETTLE recognition
> > watnam='SOL', ! Water residues are named TIP3
> > owtnm='OW', ! Water oxygens are named OH2
> > hwtnm1='HW1',
> > hwtnm2='HW2',
> > &end
> > &ewald
> > vdwmeth = 0,
> > &end
> >
> >
> > #####
> >
> > Thanks for your help!
> > Chris.
> >
> > On Tue, Apr 4, 2017 at 5:46 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> >
> >> Hi Chris,
> >>
> >> That's really slow out of the blocks for 200K atoms so no wonder it
> scales
> >> well. What does you mdin file look like for this? Are you running with a
> >> large cutoff? For reference NPT 400K atoms gets 20.84 ns/day on a 16GB
> P100
> >> and 20.14 on a 1080TI. A single Quadro GP100 gets 24.49 ns/day and two
> >> Quadros with NVLink (an $800 option) gets 34.62 ns/day. So the scaling
> to 2
> >> GPU looks similar to what you show here but this is for twice as many
> atoms
> >> and doesn't scale as well to 4 GPUs over nvlink as you show. You might
> want
> >> to check you settings - you might be able to get that 51.3 ns/day on
> just 1
> >> GPU with the right settings.
> >>
> >> All the best
> >> Ross
> >>
> >>> On Apr 4, 2017, at 13:28, Chris Neale <candrewn.gmail.com> wrote:
> >>>
> >>> Thank you Andreas and Ross!
> >>>
> >>> Indeed, even when all 3 GPUs can do Peer to Peer communication in any
> >>> combination of pairs, when I ask for exactly 3 GPUs then Peer to Perr
> is
> >>> reported by amber as not possible.
> >>>
> >>> At least on a DGX, I can get quite good scaling to 4 GPUs (which all
> >> allow
> >>> peer-to-peer). For example:
> >>>
> >>> 1 GPU: 22.7 ns/day
> >>> 2 GPU: 34.4 ns/day (76 % efficient)
> >>> 4 GPU: 51.3 ns/day (56 % efficient)
> >>>
> >>> Sure, the efficiency goes down, but 51 vs 34 ns/day is a noticeable
> >>> improvement for this 200,000 atom system (2 fs timestep)
> >>>
> >>> On Tue, Apr 4, 2017 at 9:26 AM, Ross Walker <ross.rosswalker.co.uk>
> >> wrote:
> >>>
> >>>> Hi Chris,
> >>>>
> >>>> The P2P algorithm used in AMBER 16 only supports power of 2 GPUs. As
> >> such
> >>>> you will always see poor performance on 3 GPUs. For such a machine,
> >> indeed
> >>>> for almost all machines, your best option is to run 3 independent
> >>>> calculations, one per GPU, you'll get much better overall sampling
> that
> >> way
> >>>> since the multi-GPU scaling is never great. You could also run a 1 x
> >> 2GPU
> >>>> job and a 1 x 1 GPU job. On a DGX I wouldn't recommend going above 2
> >> GPUs
> >>>> per run. Sure it will scale to 4 but the improvement is not great and
> >> you
> >>>> mostly end up just wasting resources for a few extra %. On a DGX
> system
> >> (or
> >>>> any 8 GPU system for that matter) your best option with AMBER 16 is
> >>>> probably to run either 8 x 1 GPU or 4 x 2 GPU or a combination of
> those.
> >>>> Unless you are running a large GB calculation in which case you can
> get
> >>>> almost linear scaling out to 8 GPUs - even over regular PCI-E (no need
> >> for
> >>>> gold plated DGX nodes).
> >>>>
> >>>> All the best
> >>>> Ross
> >>>>
> >>>>
> >>>>> On Apr 3, 2017, at 19:43, Chris Neale <candrewn.gmail.com> wrote:
> >>>>>
> >>>>> Dear AMBER users:
> >>>>>
> >>>>> I have a system with ~ 200,000 atoms that scales quite well on 4 GPUs
> >> on
> >>>> a
> >>>>> DGX machine with Amber16. I now have access to a different node for
> >>>> testing
> >>>>> purposes that has 3 Tesla P100 GPUs. I find that 1 GPU gives 21
> >> ns/day, 2
> >>>>> GPUs give 31 ns/day and 3 GPUs give 21 ns/day. Strange thing is that
> 2
> >>>> GPUs
> >>>>> gives a consistent speed when I use GPUs 0,1 or 1,2 or 0,2 -- leading
> >> me
> >>>> to
> >>>>> think that there is PCI-based peer-to-peer across all 3 GPUs (though
> I
> >>>>> don't know how to verify that). So then why does performance drop off
> >>>> with
> >>>>> 3 GPUs? I don't currently have the ability to re-test with 3 GPUs on
> a
> >>>> DGX,
> >>>>> though I will look into testing that, since it could give a definitve
> >>>>> answer.
> >>>>>
> >>>>> I'm wondering whether there is something obviously inherent to the
> code
> >>>>> that doesn't like 3 GPUs (vs. 2 or 4)? Any thoughts?
> >>>>>
> >>>>> Thank you for your help,
> >>>>> Chris.
> >>>>> _______________________________________________
> >>>>> AMBER mailing list
> >>>>> AMBER.ambermd.org
> >>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Apr 05 2017 - 17:30:02 PDT