On Mon, 2013-11-18 at 09:46 -0800, yunshi11 . wrote:
> Hello everyone,
>
> I wonder if it is normal for my parallel pmemd calculation to have such
> timing performance. First, I have:
>
> | NonSetup CPU Time in Major Routines, Average for All Tasks:
> |
> | Routine Sec %
> | ------------------------------
> | DataDistrib 18042.04 30.33
> | Nonbond 39946.88 67.16
> | Bond 10.77 0.02
> | Angle 127.51 0.21
> | Dihedral 421.16 0.71
> | Shake 227.38 0.38
> | RunMD 700.59 1.18
> | Other 1.29 0.00
> | ------------------------------
> | Total 59477.62
>
>
> But it seems that the "Nonbond" only represents nonbonded electrostatic
> interaction (its CPU time equals to that of PME Nonbond Pairlist + PME
> Direct Force + PME Reciprocal Force + PME Load Balancing)?
>
>
> So where is the timing for vdW interaction?
The direct force/energy includes _all_ of the direct-space calculation:
vdW and the direct portion of the PME electrostatic energy.
The CPU code computes the direct-space term using the following scheme
illustrated with pseudo-code
for i=1; i<NATOM; i++
for j=i+1; j<=NATOM; j++
elec_nrg = direct_space_elec(charge(i), charge(j), rij)
vdw_nrg = compute_vdw(acoef(i,j), bcoef(i,j), rij)
end for
end for
As a result, there is no way to separate the timings of vdW and
electrostatic energies in the direct sum. By putting the calculations in
separate loops, you're wasting the perfect opportunity to reduce your
cache misses.
> Then, I have:
>
> | PME Direct Force CPU Time, Average for All Tasks:
> |
> | Routine Sec %
> | ---------------------------------
> | NonBonded Calc 23837.41 40.08
> | Exclude Masked 446.79 0.75
> | Other 1756.72 2.95
> | ---------------------------------
> | Total 26040.91 43.78
>
> | PME Reciprocal Force CPU Time, Average for All Tasks:
> |
> | Routine Sec %
> | ---------------------------------
> | 1D bspline 529.80 0.89
> | Grid Charges 505.84 0.85
> | Scalar Sum 1972.05 3.32
> | Gradient Sum 670.34 1.13
> | FFT 4564.17 7.67
> | ---------------------------------
> | Total 8242.20 13.86
>
> So these indicate a direct/reciprocal ratio of 3.16:1. Would this ratio
> make it not very efficient?
This is a loaded question. For low processor counts, it is more
efficient to use a smaller cutoff to reduce the cost of the direct sum
(which increases the cost of the reciprocal sum). However, the vdW
energy is computed _only_ in the direct sum (there is no PME equivalent
for the 12-6 potential implemented in Amber). So the direct space cutoff
is limited by accuracy requirements of the vdW terms. The value of 8
was deemed an appropriate compromise.
Second has to do with parallel scaling. The direct-space sum is very
easily parallelizable -- just change the pseudo-code above to
for i=1+rank; i<NATOM; i+=numtasks
for j=i+1; j<=NATOM; j++
...
where "rank" is the 'index' of the parallel thread and numtasks is the
total number of CPUs that you have. Follow the double-loop with a
reduction of the energy terms and you've effectively split up your
workload quite evenly. The reciprocal sum is not nearly as scalable as
the direct-space sum is. So the 'optimal' cutoff in terms of maximum
ns/day depends on how many CPUs you're using, as well.
> I assume pmemd would assign some PME-only CPUs for PME Reciprocal
> calculations?
I'm not sure what you mean here. In parallel, some of the cores handle
all of the reciprocal-space work, and the remaining direct-space work is
divided between the processors such that each processor is doing about
the same amount of work. The load balancing is quite different between
pmemd and sander (which is why pmemd scales better). But all of this
only has to do with parallel scaling...
HTH,
Jason
--
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Nov 18 2013 - 10:30:03 PST