Re: [AMBER] PME Direct Force takes up 40% CPU time? from yunshi11 . on 2013-11-19 (Amber Archive Nov 2013)

From: yunshi11 . <yunshi09.gmail.com>
Date: Tue, 19 Nov 2013 08:35:51 -0800

On Mon, Nov 18, 2013 at 10:10 AM, Jason Swails <jason.swails.gmail.com>wrote:

> On Mon, 2013-11-18 at 09:46 -0800, yunshi11 . wrote:
> > Hello everyone,
> >
> > I wonder if it is normal for my parallel pmemd calculation to have such
> > timing performance. First, I have:
> >
> > | NonSetup CPU Time in Major Routines, Average for All Tasks:
> > |
> > | Routine Sec %
> > | ------------------------------
> > | DataDistrib 18042.04 30.33
> > | Nonbond 39946.88 67.16
> > | Bond 10.77 0.02
> > | Angle 127.51 0.21
> > | Dihedral 421.16 0.71
> > | Shake 227.38 0.38
> > | RunMD 700.59 1.18
> > | Other 1.29 0.00
> > | ------------------------------
> > | Total 59477.62
> >
> >
> > But it seems that the "Nonbond" only represents nonbonded electrostatic
> > interaction (its CPU time equals to that of PME Nonbond Pairlist + PME
> > Direct Force + PME Reciprocal Force + PME Load Balancing)?
> >
> >
> > So where is the timing for vdW interaction?
>
> The direct force/energy includes _all_ of the direct-space calculation:
> vdW and the direct portion of the PME electrostatic energy.
>
> The CPU code computes the direct-space term using the following scheme
> illustrated with pseudo-code
>
> for i=1; i<NATOM; i++
> for j=i+1; j<=NATOM; j++
> elec_nrg = direct_space_elec(charge(i), charge(j), rij)
> vdw_nrg = compute_vdw(acoef(i,j), bcoef(i,j), rij)
> end for
> end for
>
> As a result, there is no way to separate the timings of vdW and
> electrostatic energies in the direct sum. By putting the calculations in
> separate loops, you're wasting the perfect opportunity to reduce your
> cache misses.
>
>
Understood. So in the TIMINGS section, the CPU time for PME Nonbond
Pairlist + PME Direct Force actually account for time spending on
calculating both electrostatic and vdW within the cutoff distance?

> > Then, I have:
> >
> > | PME Direct Force CPU Time, Average for All Tasks:
> > |
> > | Routine Sec %
> > | ---------------------------------
> > | NonBonded Calc 23837.41 40.08
> > | Exclude Masked 446.79 0.75
> > | Other 1756.72 2.95
> > | ---------------------------------
> > | Total 26040.91 43.78
> >
> > | PME Reciprocal Force CPU Time, Average for All Tasks:
> > |
> > | Routine Sec %
> > | ---------------------------------
> > | 1D bspline 529.80 0.89
> > | Grid Charges 505.84 0.85
> > | Scalar Sum 1972.05 3.32
> > | Gradient Sum 670.34 1.13
> > | FFT 4564.17 7.67
> > | ---------------------------------
> > | Total 8242.20 13.86
> >
> > So these indicate a direct/reciprocal ratio of 3.16:1. Would this ratio
> > make it not very efficient?
>
> This is a loaded question. For low processor counts, it is more
> efficient to use a smaller cutoff to reduce the cost of the direct sum
> (which increases the cost of the reciprocal sum). However, the vdW
> energy is computed _only_ in the direct sum (there is no PME equivalent
> for the 12-6 potential implemented in Amber). So the direct space cutoff
> is limited by accuracy requirements of the vdW terms. The value of 8
> was deemed an appropriate compromise.
>
> Second has to do with parallel scaling. The direct-space sum is very
> easily parallelizable -- just change the pseudo-code above to
>
> for i=1+rank; i<NATOM; i+=numtasks
> for j=i+1; j<=NATOM; j++
> ...
>
> where "rank" is the 'index' of the parallel thread and numtasks is the
> total number of CPUs that you have. Follow the double-loop with a
> reduction of the energy terms and you've effectively split up your
> workload quite evenly. The reciprocal sum is not nearly as scalable as
> the direct-space sum is. So the 'optimal' cutoff in terms of maximum
> ns/day depends on how many CPUs you're using, as well.
>
> > I assume pmemd would assign some PME-only CPUs for PME Reciprocal
> > calculations?
>
> I'm not sure what you mean here. In parallel, some of the cores handle
> all of the reciprocal-space work, and the remaining direct-space work is
> divided between the processors such that each processor is doing about
> the same amount of work. The load balancing is quite different between
> pmemd and sander (which is why pmemd scales better). But all of this
> only has to do with parallel scaling...
>
> This is what I meant.

In order to improve the performance, can we specify the numbers of
reciprocal-space CPUs and direct-space (other than reciprocal) CPUs for
pmemd calculations?

Or is pmemd good enough to choose appropriate number of CPUs for each task
(automatically)?

> HTH,
> Jason
>
> --
> Jason M. Swails
> BioMaPS,
> Rutgers University
> Postdoctoral Researcher
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Nov 19 2013 - 09:00:02 PST