RE: AMBER: Suggestions for newer pentium chips and hyperthreading with PMEMD

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 9 Apr 2004 14:38:18 -0700

Dear Robert,

The issue with running a 2 processor job on a 2 processor HT machine (2 Real
+ 2HT = 4 cpus) is something I have seen many times before. This is an issue
with the Linux kernel not differentiating between real and HT processors. I
don't know if Windows can differentiate. I actually think that such a
differentiation may not be possible due to the hardware but I'm not sure on
that.

Thus I agree with you that Hyperthreading, at least on multiprocessor
machines should be switched off. I don't know whether this issue is unique
to the types of simulations we run, I suspect not. Thus I think anything
that uses multiple threads that do the same thing will suffer when run on a
multiprocessor (>1 real cpu) hyperthreaded machine. Hyperthreading is
probably great for SQL database searches but for our types of calculations
it is fundamentally broken. I think Intel's implementation of hyperthreading
was rushed for commercial reasons and hence is not very efficient. It is
fine for single cpu machines (although you should still run jobs as 1 proc
per machine in this environment) but as soon as you go above that the
processor scheduling goes haywire and you actually end up loosing out.

So advice to everyone "If you have a multi-cpu machine with hyperthreading,
TURN IT OFF"

Just my 2 cents
Ross

/\
\/
|\oss Walker

| Department of Molecular Biology TPC15 |
| The Scripps Research Institute |
| Tel:- +1 858 784 8889 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk/ | PGP Key available on request |
 

> -----Original Message-----
> From: owner-amber.scripps.edu
> [mailto:owner-amber.scripps.edu] On Behalf Of Robert Duke
> Sent: 09 April 2004 14:14
> To: amber.scripps.edu
> Subject: AMBER: Suggestions for newer pentium chips and
> hyperthreading with PMEMD
>
> Dear Amber Folks -
> There was a bit of discussion about a week ago on using
> hyperthreading with
> the newer Pentium/Xeon chips from Intel. I more-or-less intuitively
> recommended against it, based on general system principles. In the
> meantime, I have actually gotten a new dual Xeon system running that
> supports hyperthreading, and done a little benchmarking. I
> still think
> hyperthreading is not worth the trouble, but the results are a little
> different than I was expecting, though it now all makes
> sense. I also have
> a slightly different recommendation for building pmemd for
> the newer pentium
> chips with large L3 caches (1 MB, or larger). Basically, for
> anything with
> 1MB or more of L3 cache, especially if you are going to
> mostly run under mpi
> where nonbonded pairlist division will occur, I would recommend trying
> including -DDIRFRC_BIGCACHE_OPT in the PREPROCFLAGS
> environment variable in
> config.h. This yields about the same performance for 1 processor, and
> slightly better performance for 2 processors. For more
> processors, I would
> expect further gains. This defined constant
> (-DDIRFRC_BIGCACHE_OPT) should
> replace -DDIRFRC_VECT_OPT if it exists, as it does for the
> current default
> pentium 4 build. What is going on here is that under the
> "standard" direct
> force optimization and the DIRFRC_VECT_OPT optimization, a pairlist
> compression algorithm is used. This is highly beneficial on
> machines with
> smaller caches, because it really helps keep the direct force
> calcs running
> in the L3 cache to the extent possible. Once you move up to
> having 1 MB of
> L3 cache, however, this is generally not necessary, and the pairlist
> compression represents additional compute overhead, especially in list
> building. The downside of doing this on a small number of
> processors is
> that the pairlist will consume more memory - roughly 50 MB
> per 100,000 atoms
> instead of 10 MB per 100,000 atoms for the pairlist (rough
> estimates for a
> single processor). Anyway, it is something to consider, if
> you move to
> chips with larger L3 caches.
>
> Here I present benchmark data for a couple of our "classic"
> benchmarks, the
> Factor IX constant pressure benchmark (90906 atoms) and the
> JAC benchmark
> (23558 atoms). The test system is a Dell Precision
> Workstation 650n with
> dual Intel Xeon 3.2 GHz processors with 1 MB L3 Caches,
> hyperthreading-capable, 2GB DDR266 SDRAM. It was loaded with Redhat
> Enterprise Linux 3 smp, fully updated, the Intel fortran
> compiler 7.1.038,
> and mpich 2.5.2. It works great, after incredible grief with
> the opterons.
> I don't have an intel fortran compiler 8 solution yet, but we
> don't really
> need one.
>
> The benchmarks were run with 1 and 2 processes with
> hyperthreading disabled.
> Then hyperthreading was enabled in the bios, and the
> benchmarks were run
> with 1, 2 (mpirun -np 2), and 4 (mpirun -np 4) virtual
> processes (all this
> stuff is on one smp box; there are no network interconnect
> issues). All runs
> were for 250 steps, and all results are in psec/day. PMEMD
> was run in three
> different optimized configurations - big cache optimized,
> vector-optimized,
> and "standard" (no special optimization defines, the default
> code if you
> don't specify -DDIRFRC_*_OPT). Runs were also made on the
> default build of
> sander 8. All runs were repeated two times, with the two
> values shown to
> give a feel for the reproducibility (not great).
>
> Factor IX benchmark, 90906 atoms, constant pressure
>
> #proc pmemd, pmemd, pmemd, sander8
> bigcache standard vector
>
> 1 (real) 74, 76 70, 75 76, 76
> 54, 55
>
> 2 (real) 120, 123 111, 116 111, 116 88, 88
>
> 1 (HT) 75, 75 74, 74 75, 76
> 54, 54
>
> 2 (HT) 89, 91 85, 91 92, 96
> 67, 71
>
> 4 (HT) 124. 127 109, 117 120, 120 88, 89
>
> JAC benchmark, 23558 atoms
>
> #proc pmemd, pmemd, pmemd, sander8
> bigcache standard vector
>
> 1 (real) 200, 204 193, 204 189, 208 153, 154
>
> 2 (real) 327, 328 309, 332 327, 338 247, 264
>
> 1 (HT) 202, 204 191, 206 204, 208 149, 151
>
> 2 (HT) 216, 296 245, 267 251, 257 191, 202
>
> 4 (HT) 313, 332 327, 332 313, 338 265, 265
>
> So, hyperthreading with 4 virtual processor on 2 real
> processors may produce
> very slight performance gains, but I am not convinced they
> are significant.
> Also, note that if you hyperthread 2 virtual processors on a dual real
> processor machine, you actually do worse than would be
> expected; I imagine
> this is due to the occasional scheduling of both virtual
> processes to one
> real processor (kind of a gotcha').
>
> Regarding overall performance on the new pentium chips with
> large L3 cache,
> it looks pretty good to me, and is approaching what one sees with the
> Itanium 2 1.3 GHz chip (70-80%). I was getting somewhere
> around 85 psec/day
> for a 2.2 GHz opteron (1proc), but the machine was never
> stable. Sander 7
> running the Factor IX benchmark, default build, cranks out 49
> psec/day (1
> proc) or 80-81 psec/day (2 proc) on the machine used above.
>
> Regards - Bob Duke
>
>
> --------------------------------------------------------------
> ---------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Fri Apr 09 2004 - 22:53:00 PDT
Custom Search