Re: [AMBER] NTT=3 or NTT=1 from Jason Swails on 2010-05-11 (Amber Archive May 2010)

From: Jason Swails <jason.swails.gmail.com>
Date: Wed, 12 May 2010 00:09:55 -0400

On Tue, May 11, 2010 at 11:40 PM, Tom Joseph <ttjoseph.gmail.com> wrote:

> 2010/5/11 Ross Walker <ross.rosswalker.co.uk>:
> > If you set NO_NTT3_SYNC then the code just uses a different random seed
> on
> > each thread and thus an independent random number stream (it assumes you
> are
> > setting ig=-1 in the cntrl namelist). This removes a serial bottleneck
> from
> > the code and gives better scaling at higher thread counts for ntt=3.
>
> Thanks for the explanation - this looks worth a try.
>
> 2010/5/11 Jason Swails <jason.swails.gmail.com>:
> > A quick grep shows that there's an ifdef NO_NTT3_SYNC and an ifndef
> > NO_NTT3_SYNC in the pmemd source directory. What's the verdict on this?
>
> Looking at the pmemd source, it seems like the only effect of
> NO_NTT3_SYNC is three extra calls to gauss() - though I may well have
> missed something. I don't see any difference in the use of MPI
> primitives. Is gauss really so slow that this would have a significant
>

There is no difference in the use of MPI primitives, true, but it looks as
though it might be significant. My interpretation of the small block of
code this occurs in is the following:

This occurs in a loop over all atoms. If the atom belongs to a specific
processor, then that processor needs to shell out 3 random numbers,
presumably to provide a random hit in each cartesian direction (though I may
be misunderstanding something here). If it does not own that atom, it has
to simply create 3 unused random numbers for the sake of remaining in-sync
with the rest of the threads. This seems like it can add up to a LOT of
extra calls to gauss for some threads in a highly-multithreaded situation.
Suppose, for instance, that one of the processors is responsible for 400
atoms of a 20,000 atom system (so we're using ~500 processors). This means
that for 19,600 atoms that are looped through, that processor has to shell
out 58,800 random numbers that aren't used at all (same with every other
processor, so for 500 processors that's 2.94 E 7 random numbers that aren't
used... each time step). This is an exaggerated example, I'm sure, but the
point remains.

This is, of course, assuming that atm_cnt is the same as natom in
sander-speak (and therefore is the total atom count). If this is the case,
it seems very much worthwhile (since pmemd is the code that can conceivably
scale much more than sander, so these effects may be more pronounced).

Am I anywhere close? Possibly not. Comments welcome.

Thanks!
Jason

effect? If so I wonder if jury-rigging in, say, an MKL seeded vector
> RNG would be helpful?
>
> (The obvious thing for me to do next is to try it out and benchmark of
> course...)
>
> Thanks,
> --Tom
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Tue May 11 2010 - 21:30:03 PDT