Re: AMBER: Any experience on Dell two quad core system? from Robert Duke on 2007-08-14 (Amber Archive Aug 2007)

From: Robert Duke <rduke.email.unc.edu>
Date: Tue, 14 Aug 2007 12:13:07 -0400

Hi David,
Okay, if I read your comparison correctly, I don't think it is a good one,
because you are comparing 8 independent processes. What you have here in the
case of pmemd would be an 8x larger amount of memory in the pairlist on a 2x
dual-quad box, as compared to 8 processors running the same job (where the
pairlist is divided by 8). The important point is that you are running a
worst-case scenario for pairlists here, where there is a huge amount of
pairlist on the dual quad core, so there is going to be a larger caching
effect; pairlists do have a cost, but spline table access tends to dominate
in the "whallop the cache" effect, and I can partially overcome these
effects by preloading vector arrays in chunks (it's all tradeoffs of course,
between extra instructions and cache effects). It is indeed true that
pairlists become less effective as the stepsize goes up because they have to
be rebuilt more frequently. But so far in my experience, in the range of
1-2 fsec, pairlists remain more effective than a no pairlist alternative,
based on a bit of prototyping as well as back-of-the-envelope calcs, done
both before and after you started talking about what your code will do. I
don't know this for a fact, but I strongly suspect that one factor in
enhancing performance that you have may be that you get better cache
locality by not having to go out and do reciprocal space calcs, and possibly
a few other things as well (I don't have specifics on what all your current
code handles, but at one point I know it was just water; I also don't know
what sort of precision you maintain). At very high scaling, I will actually
see some of this sort of benefit in pmemd as reciprocal and direct space
work segregates to different processors, but we are talking really high
scaling, 128+ processors. If you can whack pmemd without losing
functionality or changing results and make it significantly faster, feel
free. No guarantees, but I would be nuts to not pick up changes that make a
significant performance difference without losing functionality, changing
results, breaking the code, or making the engineering a bigger nightmare
than it already is (though so far there have been no longlasting 3rd party
contributions - there was some stuff in the GB world done by SGI, and it was
fine, but when all the GB methods got stripped out the "patch" was not
maintained).
Regards - Bob

----- Original Message -----
From: "David Cerutti" <dcerutti.mccammon.ucsd.edu>
To: <amber.scripps.edu>
Sent: Tuesday, August 14, 2007 11:05 AM
Subject: Re: AMBER: Any experience on Dell two quad core system?

>I just wanted to note, based on some other comments about performance with
>eight single-threaded processes running in parallel on a 2x dual-quad box,
>I have a bare-bones MD program that I used for water simulations that loses
>no efficiency when running eight independent processes on a 2x dual-quad
>box (I didn't write the code to parallelize). In terms of the real-space
>nonbonded calculation it was about 60%-80% faster than PMEMD (PMEMD works
>more efficiently with smaller time steps, but my code's performance is the
>same across all time steps). If we consider the cost of PME electrostatics
>the speedup was only about 30%. I was able to run eight processes of it
>without losing any efficiency as compared to a single process.
>
> My code does not use neighbor lists, but instead relies on a quick
> distance^2 calculation between water oxygens to determine what to compute,
> so it is actually very light on the RAM bus. I have made recommendations
> to AMBER developers about the possibility of foregoing the neighbor lists
> in the case of water molecules to conserve cache (and, neighbor lists of
> protein atoms would be much longer-lived anyway). With other specialized
> routines for 3, 4, or 5-point water:water interactions, I would still
> expect a 20% speedup during typical simulations.
>
> There may be other improvements that can be made in PMEMD based on these
> midpoint methods and other parallel efficiency algorithms that will not
> only help cluster scaling but also parallel efficiency on multi-core
> processors.
>
> Dave
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Aug 15 2007 - 06:07:53 PDT