Re: [AMBER] 3 GPUs ? from Ross Walker on 2017-04-05 (Amber Archive Apr 2017)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 5 Apr 2017 12:41:33 -0700

Hi Chris,

Ah yes the 12 angstrom cutoff is majority of the reason you are seeing lower performance. The 20.84ns/day on a 408K system P100 is the following input:

Typical Production MD NPT, MC Bar 2fs
&cntrl
  ntx=5, irest=1,
  ntc=2, ntf=2,
  nstlim=10000,
  ntpr=1000, ntwx=1000,
  ntwr=10000,
  dt=0.002, cut=8.,
  ntt=1, tautp=10.0,
  temp0=300.0,
  ntb=2, ntp=1, barostat=2,
  ioutfm=1,
/
So 8 angstrom cutoff which is good most of the time, one can up that to 9 if needed without it costing too much performance wise. 12 angstroms can hurt quite a bit and, in my opinion, is overkill unless one is using the Charmm lipid force field, it's a shame they chose to parameterize that based exclusively on a 12 angstrom cutoff from a performance perspective. Since the Charmm use of a 12 angstrom cutoff is focused on VDW interactions one could, in principal, get back some performance by backing off the grid spacing for the reciprocal space and still achieve the same Ewald error but that might be too much effort for the potential gain. The use of a force switch also saps performance. Again, that's probably only needed if using the Charmm lipid parameters.

In terms of other minor tweaks. I'd set ntwr = nstlim unless you have a real need to have it at 5000. Restarts tend to be messy, output file wise, so if a job crashes during a single MD step I tend to just restart that step from the beginning. You can get a small performance boost by thus only writing a restart file at the end of a run.

You are using the Monte Carlo barostat so that's good - note I don't think the following settings do anything in the case of barostat=2.

> taup=4, ! Berendsen coupling constant (ps)
> comp=45, ! compressibility

The other thing probably affecting performance is the use of XY isotropic scaling. You may not need that with the latest versions of the various lipid force fields. I'm not sure about the latest charmm but AMBER's lipid 14 for example does not need isotropic pressure scaling.

With regards to the GP100 and NVlink. Yeah if only it was $800 for that. The $800 is just for the cable that connects the two cards together. :-( The Quadro cards themselves will be many thousands of dollars although cheaper than P100 cards (and they are PCI-E so you can put them in a cost effective node rather than the very expensive SXM2 nodes).

Ultimately the card of choice today is almost certainly the GTX-1080TI at ~$699 a card - but no NVLink possible here so you are limited to mediocre P2P scaling to 2 GPUs and/or just running multiple runs (1 per GPU) from different initial conditions which is almost always a good idea anyway.

Hope that helps.

All the best
Ross

> On Apr 4, 2017, at 17:06, Chris Neale <candrewn.gmail.com> wrote:
>
> Dear Ross:
>
> thanks for the tip. I presume the issue is just the Charmm forcefield
> defined cutoffs of 12 A, but a sanity check on my input file would
> certainly be welcome. In actual fact, I run my production simulations with
> a 4 fs timestep, but the benchmarking that I reported (and this associated
> input file) are for a 2 fs timestep. I actually thought the performance was
> quite good (it certainly is very good in comparison to gromacs, which was
> my engine of choice until recently), but if I can improve on this then that
> would be excellent.
>
> Additional questions if you have the time:
>
> 1) What cutoff and timestep do you use to get 20.84 ns/day for 400,000
> atoms on the P100?
>
> 2) Also, am I correct to understand that it's the NVlink that costs $800
> and the Quadro GP100 itself will cost many thousands, similar to the
> P100's, or is it really that the Quadro GP100 outperforms the GTX-1080TI at
> the same price point and therefore that the Quadro GP100 is now the
> commodity GPU of choice?
>
>
>
> A NPT simulation for common production-level simulations -- params
> generally from Charmm-gui + some modifications by CN
> &cntrl
> imin=0, ! No minimization
> irest=0, ! ires=1 for restart and irest=0 for new start
> ntx=1, ! ntx=5 to use velocities from inpcrd and ntx=1 to not
> use them
> ntb=2, ! constant pressure simulation
>
> ! Temperature control
> ntt=3, ! Langevin dynamics
> gamma_ln=1.0, ! Friction coefficient (ps^-1)
> temp0=310.0, ! Target temperature
> tempi=310.0, ! Initial temperature -- has no effect if ntx>3
>
> ! Potential energy control
> cut=12.0, ! nonbonded cutoff, in Angstroms
> fswitch=10.0, ! for charmm.... note charmm-gui suggested cut=0.8 and
> no use of fswitch
>
> ! MD settings
> nstlim=10000, ! 0.25B steps, 1 us total
> dt=0.002, ! time step (ps)
>
> ! SHAKE
> ntc=2, ! Constrain bonds containing hydrogen
> ntf=2, ! Do not calculate forces of bonds containing hydrogen
>
> ! Control how often information is printed
> ntpr=5000, ! Print energy frequency
> ntwx=5000, ! Print coordinate frequency
> ntwr=5000, ! Print restart file frequency
> ntxo=2, ! Write NetCDF format
> ioutfm=1, ! Write NetCDF format (always do this!)
>
> ! Wrap coordinates when printing them to the same unit cell
> iwrap=1,
>
> ! Constant pressure control. Note that ntp=3 requires barostat=1
> barostat=2, ! Berendsen... change to 2 for MC barostat
> ntp=3, ! 1=isotropic, 2=anisotropic, 3=semi-isotropic w/ surften
> pres0=1.01325, ! Target external pressure, in bar
> taup=4, ! Berendsen coupling constant (ps)
> comp=45, ! compressibility
>
> ! Constant surface tension (needed for semi-isotropic scaling).
> Uncomment
> ! for this feature. csurften must be nonzero if ntp=3 above
> csurften=3, ! Interfaces in 1=yz plane, 2=xz plane, 3=xy plane
> gamma_ten=0.0, ! Surface tension (dyne/cm). 0 gives pure semi-iso
> scaling
> ninterface=2, ! Number of interfaces (2 for bilayer)
>
> ! Set water atom/residue names for SETTLE recognition
> watnam='SOL', ! Water residues are named TIP3
> owtnm='OW', ! Water oxygens are named OH2
> hwtnm1='HW1',
> hwtnm2='HW2',
> &end
> &ewald
> vdwmeth = 0,
> &end
>
>
> #####
>
> Thanks for your help!
> Chris.
>
> On Tue, Apr 4, 2017 at 5:46 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> Hi Chris,
>>
>> That's really slow out of the blocks for 200K atoms so no wonder it scales
>> well. What does you mdin file look like for this? Are you running with a
>> large cutoff? For reference NPT 400K atoms gets 20.84 ns/day on a 16GB P100
>> and 20.14 on a 1080TI. A single Quadro GP100 gets 24.49 ns/day and two
>> Quadros with NVLink (an $800 option) gets 34.62 ns/day. So the scaling to 2
>> GPU looks similar to what you show here but this is for twice as many atoms
>> and doesn't scale as well to 4 GPUs over nvlink as you show. You might want
>> to check you settings - you might be able to get that 51.3 ns/day on just 1
>> GPU with the right settings.
>>
>> All the best
>> Ross
>>
>>> On Apr 4, 2017, at 13:28, Chris Neale <candrewn.gmail.com> wrote:
>>>
>>> Thank you Andreas and Ross!
>>>
>>> Indeed, even when all 3 GPUs can do Peer to Peer communication in any
>>> combination of pairs, when I ask for exactly 3 GPUs then Peer to Perr is
>>> reported by amber as not possible.
>>>
>>> At least on a DGX, I can get quite good scaling to 4 GPUs (which all
>> allow
>>> peer-to-peer). For example:
>>>
>>> 1 GPU: 22.7 ns/day
>>> 2 GPU: 34.4 ns/day (76 % efficient)
>>> 4 GPU: 51.3 ns/day (56 % efficient)
>>>
>>> Sure, the efficiency goes down, but 51 vs 34 ns/day is a noticeable
>>> improvement for this 200,000 atom system (2 fs timestep)
>>>
>>> On Tue, Apr 4, 2017 at 9:26 AM, Ross Walker <ross.rosswalker.co.uk>
>> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> The P2P algorithm used in AMBER 16 only supports power of 2 GPUs. As
>> such
>>>> you will always see poor performance on 3 GPUs. For such a machine,
>> indeed
>>>> for almost all machines, your best option is to run 3 independent
>>>> calculations, one per GPU, you'll get much better overall sampling that
>> way
>>>> since the multi-GPU scaling is never great. You could also run a 1 x
>> 2GPU
>>>> job and a 1 x 1 GPU job. On a DGX I wouldn't recommend going above 2
>> GPUs
>>>> per run. Sure it will scale to 4 but the improvement is not great and
>> you
>>>> mostly end up just wasting resources for a few extra %. On a DGX system
>> (or
>>>> any 8 GPU system for that matter) your best option with AMBER 16 is
>>>> probably to run either 8 x 1 GPU or 4 x 2 GPU or a combination of those.
>>>> Unless you are running a large GB calculation in which case you can get
>>>> almost linear scaling out to 8 GPUs - even over regular PCI-E (no need
>> for
>>>> gold plated DGX nodes).
>>>>
>>>> All the best
>>>> Ross
>>>>
>>>>
>>>>> On Apr 3, 2017, at 19:43, Chris Neale <candrewn.gmail.com> wrote:
>>>>>
>>>>> Dear AMBER users:
>>>>>
>>>>> I have a system with ~ 200,000 atoms that scales quite well on 4 GPUs
>> on
>>>> a
>>>>> DGX machine with Amber16. I now have access to a different node for
>>>> testing
>>>>> purposes that has 3 Tesla P100 GPUs. I find that 1 GPU gives 21
>> ns/day, 2
>>>>> GPUs give 31 ns/day and 3 GPUs give 21 ns/day. Strange thing is that 2
>>>> GPUs
>>>>> gives a consistent speed when I use GPUs 0,1 or 1,2 or 0,2 -- leading
>> me
>>>> to
>>>>> think that there is PCI-based peer-to-peer across all 3 GPUs (though I
>>>>> don't know how to verify that). So then why does performance drop off
>>>> with
>>>>> 3 GPUs? I don't currently have the ability to re-test with 3 GPUs on a
>>>> DGX,
>>>>> though I will look into testing that, since it could give a definitve
>>>>> answer.
>>>>>
>>>>> I'm wondering whether there is something obviously inherent to the code
>>>>> that doesn't like 3 GPUs (vs. 2 or 4)? Any thoughts?
>>>>>
>>>>> Thank you for your help,
>>>>> Chris.
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Apr 05 2017 - 16:30:02 PDT