AMBER: Suggestions for newer pentium chips and hyperthreading with PMEMD

From: Robert Duke <rduke.nc.rr.com>
Date: Fri, 9 Apr 2004 17:13:59 -0400

Dear Amber Folks -
There was a bit of discussion about a week ago on using hyperthreading with
the newer Pentium/Xeon chips from Intel. I more-or-less intuitively
recommended against it, based on general system principles. In the
meantime, I have actually gotten a new dual Xeon system running that
supports hyperthreading, and done a little benchmarking. I still think
hyperthreading is not worth the trouble, but the results are a little
different than I was expecting, though it now all makes sense. I also have
a slightly different recommendation for building pmemd for the newer pentium
chips with large L3 caches (1 MB, or larger). Basically, for anything with
1MB or more of L3 cache, especially if you are going to mostly run under mpi
where nonbonded pairlist division will occur, I would recommend trying
including -DDIRFRC_BIGCACHE_OPT in the PREPROCFLAGS environment variable in
config.h. This yields about the same performance for 1 processor, and
slightly better performance for 2 processors. For more processors, I would
expect further gains. This defined constant (-DDIRFRC_BIGCACHE_OPT) should
replace -DDIRFRC_VECT_OPT if it exists, as it does for the current default
pentium 4 build. What is going on here is that under the "standard" direct
force optimization and the DIRFRC_VECT_OPT optimization, a pairlist
compression algorithm is used. This is highly beneficial on machines with
smaller caches, because it really helps keep the direct force calcs running
in the L3 cache to the extent possible. Once you move up to having 1 MB of
L3 cache, however, this is generally not necessary, and the pairlist
compression represents additional compute overhead, especially in list
building. The downside of doing this on a small number of processors is
that the pairlist will consume more memory - roughly 50 MB per 100,000 atoms
instead of 10 MB per 100,000 atoms for the pairlist (rough estimates for a
single processor). Anyway, it is something to consider, if you move to
chips with larger L3 caches.

Here I present benchmark data for a couple of our "classic" benchmarks, the
Factor IX constant pressure benchmark (90906 atoms) and the JAC benchmark
(23558 atoms). The test system is a Dell Precision Workstation 650n with
dual Intel Xeon 3.2 GHz processors with 1 MB L3 Caches,
hyperthreading-capable, 2GB DDR266 SDRAM. It was loaded with Redhat
Enterprise Linux 3 smp, fully updated, the Intel fortran compiler 7.1.038,
and mpich 2.5.2. It works great, after incredible grief with the opterons.
I don't have an intel fortran compiler 8 solution yet, but we don't really
need one.

The benchmarks were run with 1 and 2 processes with hyperthreading disabled.
Then hyperthreading was enabled in the bios, and the benchmarks were run
with 1, 2 (mpirun -np 2), and 4 (mpirun -np 4) virtual processes (all this
stuff is on one smp box; there are no network interconnect issues). All runs
were for 250 steps, and all results are in psec/day. PMEMD was run in three
different optimized configurations - big cache optimized, vector-optimized,
and "standard" (no special optimization defines, the default code if you
don't specify -DDIRFRC_*_OPT). Runs were also made on the default build of
sander 8. All runs were repeated two times, with the two values shown to
give a feel for the reproducibility (not great).

Factor IX benchmark, 90906 atoms, constant pressure

#proc pmemd, pmemd, pmemd, sander8
                 bigcache standard vector

1 (real) 74, 76 70, 75 76, 76 54, 55

2 (real) 120, 123 111, 116 111, 116 88, 88

1 (HT) 75, 75 74, 74 75, 76 54, 54

2 (HT) 89, 91 85, 91 92, 96 67, 71

4 (HT) 124. 127 109, 117 120, 120 88, 89

JAC benchmark, 23558 atoms

#proc pmemd, pmemd, pmemd, sander8
                 bigcache standard vector

1 (real) 200, 204 193, 204 189, 208 153, 154

2 (real) 327, 328 309, 332 327, 338 247, 264

1 (HT) 202, 204 191, 206 204, 208 149, 151

2 (HT) 216, 296 245, 267 251, 257 191, 202

4 (HT) 313, 332 327, 332 313, 338 265, 265

So, hyperthreading with 4 virtual processor on 2 real processors may produce
very slight performance gains, but I am not convinced they are significant.
Also, note that if you hyperthread 2 virtual processors on a dual real
processor machine, you actually do worse than would be expected; I imagine
this is due to the occasional scheduling of both virtual processes to one
real processor (kind of a gotcha').

Regarding overall performance on the new pentium chips with large L3 cache,
it looks pretty good to me, and is approaching what one sees with the
Itanium 2 1.3 GHz chip (70-80%). I was getting somewhere around 85 psec/day
for a 2.2 GHz opteron (1proc), but the machine was never stable. Sander 7
running the Factor IX benchmark, default build, cranks out 49 psec/day (1
proc) or 80-81 psec/day (2 proc) on the machine used above.

Regards - Bob Duke


-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Fri Apr 09 2004 - 22:53:00 PDT
Custom Search