- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Jason Swails <jason.swails.gmail.com>

Date: Mon, 18 Nov 2013 13:10:56 -0500

On Mon, 2013-11-18 at 09:46 -0800, yunshi11 . wrote:

*> Hello everyone,
*

*>
*

*> I wonder if it is normal for my parallel pmemd calculation to have such
*

*> timing performance. First, I have:
*

*>
*

*> | NonSetup CPU Time in Major Routines, Average for All Tasks:
*

*> |
*

*> | Routine Sec %
*

*> | ------------------------------
*

*> | DataDistrib 18042.04 30.33
*

*> | Nonbond 39946.88 67.16
*

*> | Bond 10.77 0.02
*

*> | Angle 127.51 0.21
*

*> | Dihedral 421.16 0.71
*

*> | Shake 227.38 0.38
*

*> | RunMD 700.59 1.18
*

*> | Other 1.29 0.00
*

*> | ------------------------------
*

*> | Total 59477.62
*

*>
*

*>
*

*> But it seems that the "Nonbond" only represents nonbonded electrostatic
*

*> interaction (its CPU time equals to that of PME Nonbond Pairlist + PME
*

*> Direct Force + PME Reciprocal Force + PME Load Balancing)?
*

*>
*

*>
*

*> So where is the timing for vdW interaction?
*

The direct force/energy includes _all_ of the direct-space calculation:

vdW and the direct portion of the PME electrostatic energy.

The CPU code computes the direct-space term using the following scheme

illustrated with pseudo-code

for i=1; i<NATOM; i++

for j=i+1; j<=NATOM; j++

elec_nrg = direct_space_elec(charge(i), charge(j), rij)

vdw_nrg = compute_vdw(acoef(i,j), bcoef(i,j), rij)

end for

end for

As a result, there is no way to separate the timings of vdW and

electrostatic energies in the direct sum. By putting the calculations in

separate loops, you're wasting the perfect opportunity to reduce your

cache misses.

*> Then, I have:
*

*>
*

*> | PME Direct Force CPU Time, Average for All Tasks:
*

*> |
*

*> | Routine Sec %
*

*> | ---------------------------------
*

*> | NonBonded Calc 23837.41 40.08
*

*> | Exclude Masked 446.79 0.75
*

*> | Other 1756.72 2.95
*

*> | ---------------------------------
*

*> | Total 26040.91 43.78
*

*>
*

*> | PME Reciprocal Force CPU Time, Average for All Tasks:
*

*> |
*

*> | Routine Sec %
*

*> | ---------------------------------
*

*> | 1D bspline 529.80 0.89
*

*> | Grid Charges 505.84 0.85
*

*> | Scalar Sum 1972.05 3.32
*

*> | Gradient Sum 670.34 1.13
*

*> | FFT 4564.17 7.67
*

*> | ---------------------------------
*

*> | Total 8242.20 13.86
*

*>
*

*> So these indicate a direct/reciprocal ratio of 3.16:1. Would this ratio
*

*> make it not very efficient?
*

This is a loaded question. For low processor counts, it is more

efficient to use a smaller cutoff to reduce the cost of the direct sum

(which increases the cost of the reciprocal sum). However, the vdW

energy is computed _only_ in the direct sum (there is no PME equivalent

for the 12-6 potential implemented in Amber). So the direct space cutoff

is limited by accuracy requirements of the vdW terms. The value of 8

was deemed an appropriate compromise.

Second has to do with parallel scaling. The direct-space sum is very

easily parallelizable -- just change the pseudo-code above to

for i=1+rank; i<NATOM; i+=numtasks

for j=i+1; j<=NATOM; j++

...

where "rank" is the 'index' of the parallel thread and numtasks is the

total number of CPUs that you have. Follow the double-loop with a

reduction of the energy terms and you've effectively split up your

workload quite evenly. The reciprocal sum is not nearly as scalable as

the direct-space sum is. So the 'optimal' cutoff in terms of maximum

ns/day depends on how many CPUs you're using, as well.

*> I assume pmemd would assign some PME-only CPUs for PME Reciprocal
*

*> calculations?
*

I'm not sure what you mean here. In parallel, some of the cores handle

all of the reciprocal-space work, and the remaining direct-space work is

divided between the processors such that each processor is doing about

the same amount of work. The load balancing is quite different between

pmemd and sander (which is why pmemd scales better). But all of this

only has to do with parallel scaling...

HTH,

Jason

Date: Mon, 18 Nov 2013 13:10:56 -0500

On Mon, 2013-11-18 at 09:46 -0800, yunshi11 . wrote:

The direct force/energy includes _all_ of the direct-space calculation:

vdW and the direct portion of the PME electrostatic energy.

The CPU code computes the direct-space term using the following scheme

illustrated with pseudo-code

for i=1; i<NATOM; i++

for j=i+1; j<=NATOM; j++

elec_nrg = direct_space_elec(charge(i), charge(j), rij)

vdw_nrg = compute_vdw(acoef(i,j), bcoef(i,j), rij)

end for

end for

As a result, there is no way to separate the timings of vdW and

electrostatic energies in the direct sum. By putting the calculations in

separate loops, you're wasting the perfect opportunity to reduce your

cache misses.

This is a loaded question. For low processor counts, it is more

efficient to use a smaller cutoff to reduce the cost of the direct sum

(which increases the cost of the reciprocal sum). However, the vdW

energy is computed _only_ in the direct sum (there is no PME equivalent

for the 12-6 potential implemented in Amber). So the direct space cutoff

is limited by accuracy requirements of the vdW terms. The value of 8

was deemed an appropriate compromise.

Second has to do with parallel scaling. The direct-space sum is very

easily parallelizable -- just change the pseudo-code above to

for i=1+rank; i<NATOM; i+=numtasks

for j=i+1; j<=NATOM; j++

...

where "rank" is the 'index' of the parallel thread and numtasks is the

total number of CPUs that you have. Follow the double-loop with a

reduction of the energy terms and you've effectively split up your

workload quite evenly. The reciprocal sum is not nearly as scalable as

the direct-space sum is. So the 'optimal' cutoff in terms of maximum

ns/day depends on how many CPUs you're using, as well.

I'm not sure what you mean here. In parallel, some of the cores handle

all of the reciprocal-space work, and the remaining direct-space work is

divided between the processors such that each processor is doing about

the same amount of work. The load balancing is quite different between

pmemd and sander (which is why pmemd scales better). But all of this

only has to do with parallel scaling...

HTH,

Jason

-- Jason M. Swails BioMaPS, Rutgers University Postdoctoral Researcher _______________________________________________ AMBER mailing list AMBER.ambermd.org http://lists.ambermd.org/mailman/listinfo/amberReceived on Mon Nov 18 2013 - 10:30:03 PST

Custom Search