Hi Yun,
What do you mean by 48 CPUs + 12 GPUs? - Do you mean you are trying to run
pmemd.cuda.MPI across 48 cores connected to 12 GPUs? - For starters things
like block_fft only apply to CPU runs - they are meaningless in GPU runs.
Secondly I would suggest reading the following page:
http://ambermd.org/gpus/ which will explain how to run AMBER GPU runs.
Essentially 48 CPUs + 12 GPUs does not make sense and even if this was 12
Cores + 12 GPUs the calculation would be unlikely to scale unless it was a
replica exchange run.
In terms of parameters for the CPU only runs. The README is attached
although this has not been maintained so may be out of date in places.
To summarize (although note these are mostly set automatically based on
processor count - so messing with them may or may not help):
block_fft = 0 - use slab fft
= 1 - use block fft; requires at least 4 processors, and not
permitted for minimizations or if nrespa > 1.
fft_blk_y_divisor = 2 .. nfft2 (after axis optimization reorientation);
default=2 or 4 depending on numtasks.
excl_recip = 0..1 - Exclusive reciprocal tasks flag. This flag, when 1,
specifies that tasks that do reciprocal force calcs
will
not also do direct force calculations. This has some
benefits at higher task count. At lower task count,
setting this flag can result in significant
underutilization of reciprocal tasks. This flag will
automatically be cleared if block fft's are not in use.
excl_master = 0..1 - Exclusive master task flag. This flag, when 1,
specifies that the master task will not do force and
energy calculations. At high scaling, what this does
is insure that no tasks are waiting for the master to
initiate collective communications events. The master
is thus basically dedicated to handling loadbalancing
and
output. At lower task count, this is obviously
wasteful. This flag will automatically be cleared if
block fft's are not in use or if excl_recip .ne. 1.
AND NOTE - when block fft's are in use, that implies
that
you are not doing a minimization and are not using
nrespa > 1.
atm_redist_freq = 16..1280 - The frequency (in pairlist build events) for
reassigning atom ownership to tasks. As a run
progresses, diffusion causes the atoms
originally
collocated and assigned to one task to occupy
a
larger volume. With time, this starts to
cause
a higher communications load, though the
increased
load is lower than one might expect.
Currently,
by default we reassign atoms to tasks every
320
pairlist builds at low to medium task count
and
we reassign atoms to tasks every 32 pairlist
builds at higher task counts (currently
defined
as >= 96 tasks, redefinable in config.h). The
user can however specify the specific value he
desires. At low task count, frequent atom
redistribution tends to have a noticeable cost
and little benefit. At higher task count, the
cost is lower and the benefit is higher.
All the best
Ross
On 11/19/13 8:45 AM, "yunshi11 ." <yunshi09.gmail.com> wrote:
>Hi there,
>
>I'm curious about these "PMEMD ewald parallel performance parameters"
>since
>I found pmemd assigns them differently with different computing facilities
>(exactly the same system, i.e. MD run starts with the same .restrt file).
>
>With 8*16=128 CPUs, I have them as:
>
>| block_fft = 1
>| fft_blk_y_divisor = 4
>| excl_recip = 1
>| excl_master = 1
>| atm_redist_freq = 32
>
>
>In another run with 48 CPUs + 12 GPUs, I have:
>
>| block_fft = 0
>| fft_blk_y_divisor = 4
>| excl_recip = 0
>| excl_master = 0
>| atm_redist_freq = 320
>
>
>So I really wonder what these parameters mean as cannot access the "README
>under pmemd/src" for some reason.
>
>Best,
>Yun
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
- application/octet-stream attachment: README
Received on Tue Nov 19 2013 - 11:00:02 PST