Re: [AMBER] size of real from Robert Duke on 2010-05-02 (Amber Archive May 2010)

From: Robert Duke <rduke.email.unc.edu>
Date: Sun, 2 May 2010 08:31:47 -0400

Hi Tom,
Yes, you hit at least one of the nails on the head :-) First of all, for
current generation cpu's, the difference in actual execution times for dp vs
sp is pretty much negligible, so you get no gain in actual execution time.
Then the remaining major issues are 1) number of communications events per
step and the interconnect latency, 2) quality, or tightness of
synchronization attainable, which depends not only on the quality of your
application's load balancer but also on OS indeterminacy, 3) actual
communications volume and interconnect bandwidth, and 4) non-parallelizable
tasks (Amdahl's law factors). PMEMD uses an asynch mpi (allows compute and
communications overlap), but nonetheless, pretty much "standard" mpi
communications model. Two things that really limit the scaling are: 1) the
fact that the communications events occur at discrete points in the steps,
and 2) basic mpi collectives overhead (mpi collectives typically require
communications cycles that scale at best O(ln2 n), as there is a binary
tree-structured communications pattern). With namd, they have their own
communications layer that basically has the concept of a task queue, and
this allows for better decoupling of communications and execution; this
makes them less susceptible to synchronization and latency issues, but
ultimately they should hit an interconnect bandwidth wall. I know how to do
all this sort of thing, but unfortunately am not funded to do the work in
pmemd. My best guess, given what I know, is that pmemd could be made
between 2 and 4 times faster still with a "bit" of work. Then we would be
bumping up rather firmly against the bandwidth wall, and it would make sense
to start looking at using slightly less precision to decrease the strain on
the interconnect bandwidth. I actually did a compression algorithm a while
back, but the computation cost overwhelmed any bandwidth gains; it was done
in c so that probably didn't help in that it is harder to actually get
benefit out of the vector instruction set for intel (sse2,3, etc. - don't
get me wrong, I really like c for a lot of things). I still would not
recommend a drop all the way to single precision, but instead would look
either at this compression mechanism again (by which, if memory serves, I
could cut bandwidth requirements by somewhere between 25-33%) or a fixed
format as used by Shaw et al.
Regards - Bob
----- Original Message -----
From: "Tom Joseph" <ttjoseph.gmail.com>
To: "AMBER Mailing List" <amber.ambermd.org>
Sent: Sunday, May 02, 2010 2:17 AM
Subject: Re: [AMBER] size of real

2010/5/1 Robert Duke <rduke.email.unc.edu>:
> In theory, you COULD cut the communications load in half by going to 4
> bytes
> on all transmitted data, but I got very little gain when I tried it. There

I guess that illustrates how important network latency can be in MD
simulations relative to sheer throughput...

Anyhow your response is very interesting and informative. Thanks!

--Tom

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun May 02 2010 - 06:00:11 PDT