On Sat, Oct 01, 2011, Jason Swails wrote:
I would like to augment and modify Jason's very helpful summary in a few
places:
> f(*) stores all of the forces. All forces are calculated, and
> this array is populated, in force.f (subroutine force).... Each
> processor then proceeds to calculate the force (according to all of the
> potential terms that are included) on each atom that *that processor is
> responsible for*.
This is not correct for sander: Each processor calculates forces for
whatever part of the energy function has been assigned to it. At this
point, no account is taken of which atoms are involved, or whether they
"belong" to this process or not. Each process knows the coordinates of
all atoms, and is permitted to update the force array for any atoms it
chooses.
>
> If you want to see what's happening, see the subroutines "fdist"
> and "fsum" in $AMBERHOME/src/sander/parallel.f. (The force distribution
> routine is called in force.f -- just search for the fdist call). The key to
> this is that the force distribution is done via a MPI_Reduce_scatter call
> (see around line 550 in parallel.f), rather than a call to MPI_Allreduce().
> The MPI_Reduce_scatter sums up all of the forces on all threads to the
> master thread, then scatters out the relevant parts for each slave processor
> so that it can propagate the coordinates of the atoms it owns. An
> MPI_Allreduce() call would have summed up all forces from all threads *to*
> all threads, which is a more costly communication (and this is what's done
> for the very first dynamics step if you're not doing a restart).
Minor addendum: the mpi_reduce_scatter() routine is logically what is
done, but is actually called only if the number of processes is not a
power of two. For the power-of-two case, the same result is achieved
through a binary tree algorithm that does not require reducing everything
to the master thread, then rebroadcasting. But Jason's basic point is
spot on: understanding the code can be approached by thinking about the
difference between mpi_reduce_scatter() and mpi_allreduce().
>
> (as you can see in runmd), each
> processor propagates only the atomic positions of the atoms it owns. (From
> runmd.f:
>
> do i3 = istart3, iend3
> f(i3) = x(i3)
> if (i3 >= cphfirst_sol3 - 3) &
> x(i3) = x(i3) + v(i3)*dtx
> end do
>
> where istart3 and iend3 are set based on atom ownership). Then of course
> the positions have to be sent from everyone to everyone (xdist does this via
> an MPI_Allgatherv call).
The same very minor point as above: mpi_allgatherv() shows logically what is
being done, but different code is actually used when the number of processes
is a power of two.
[As an aside, a "good" implementation of MPI could detect the power-of-two
condition and use the appropriate more efficient algorithm. This was not the
case for most MPI implementations back when this code was put together 15
years ago. I'm not sure where things stand now -- some eager grad student
might want to time current mpi_reduce_scatter() routines against the binary
tree version to see which is better.]
>
> Note, though, that pmemd's parallelization scheme is _completely_ different
> than sander's. For instance, it works via a spatial decomposition scheme
> rather than sequence decomposition (and pmemd's scaling and performance is
> obviously much better than sander's). Thus, any lessons learned about
> sander parallelization must be relearned if you want to start modifying or
> understanding pmemd.
...dac
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Oct 02 2011 - 11:00:22 PDT