Re: AMBER: amber 10: sander and pmemd performance from Robert Duke on 2008-07-25 (Amber Archive Jul 2008)

From: Robert Duke <rduke.email.unc.edu>
Date: Fri, 25 Jul 2008 12:06:41 -0400

Hi Vlad -
Well, I don't think I did a really good job of directly addressing the
ifort - gfortran difference, but really what happens is that the compiler
matters a lot more in some circumstances than others, and if something else
like gigabit ethernet, problem size - cache mismatch, some other h/w or s/w
config problem or other comes into play, then the speedup due to the
compiler can sort of disappear on you. And parallelization overhead really
can start eating up performance gains on single processors pretty quickly
sometimes. A good example is fftw. I went to a fair bit of trouble to add
support for fftw to pmemd, anticipating that it would help. I also happened
to optimize our own public fft's at the same time, and it turned out that on
intel p4 - and up cpu's, the fftw advantage on a single cpu was only around
10% (partly due to my public fft optimization effort). This is sort of
worth it. But by the time you are out to say 4 to 8 processors, the
advantage is barely noticeable because the distributed transpose cost of the
fft is coming up as well as other parallelization costs, and running a bit
faster on the individual processors is suddenly at best benefiting you a
percent or two. So anyway, back to your problem. I could believe that
gfortran is getting better - I don't follow this stuff very closely because
I believe that the commercial guys will always be fastest by at least a bit
(if they are not, they are dead). Now, add in potential problems with
misconfiguration of the system overall, potential problems with nonoptimal
configuration of ifort 10 (which I am not yet using - I should be, but have
been wrapped up with other stuff), potential problems hitting some amd
opteron with ifort (I vaguely recollect that ifort did not give the best
performance on amd chips - this is not particularly surprising, and I would
not be at all surprised if the open source guys are working a bit harder to
hit the opteron) - anyway, bottom line is there are lots of reasons you
could see this happen, and the best comparison would be uniprocessor,
different benchmarks, careful look at optimization options in both the
compilers and in pmemd. That's what needs to be done as a start, and then
you move on to what happens with parallelization. On mpich2 vs. openmpi, I
have not tried to use openmpi, but presume it is easier for the masses to
configure than mpich2. With mpich2, they have this demons model, and I
found that it required considerable dinking around to get it to work
properly if there were any complexities in your system (I had multiple net
cards, with a server net card in each node dedicated to mpi i/o, and
configuring this was easy to screw up, with really bad performance, or
screwball connectivity problems the result). So possibly mpich2 on your
machines has some problems of this sort. Possibly also, openmpi has some
slight improvements over mpich2 for some combination of factors - one would
hope things get better, but often they don't. I still run mpich 1 because
it is simpler to configure, more reliable, just as fast on my machines. I
tend to not replace software with new software unless the new software is
demonstrably better in some way that I care about. Okay, enough
philosophizing from me on this sort of thing; hope it helps, or at least
everyone better understands where I am coming from on pmemd development.
Best Regards - Bob

----- Original Message -----
From: "Vlad Cojocaru" <Vlad.Cojocaru.eml-r.villa-bosch.de>
To: <amber.scripps.edu>
Sent: Friday, July 25, 2008 11:14 AM
Subject: Re: AMBER: amber 10: sander and pmemd performance

> Dear Bob,
>
> Thanks a lot for your email. For a person that just realizes how important
> is to compile and use a personalized form of a piece of software such as
> AMBER, the details you give here are very important. In the last week or
> so I learned a lot about compilation, libraries, compilers and so on...
> and it all started with that "output problem" I reported some weeks ago
> when I realized that I am better of doing the compilation myself.
>
> So, for now I am far from wanting to run pmemd at the highest performance
> level possible. I managed to compile the versions I mentioned previously
> and they all run fine for now. For my system, the speed achieved on 4
> cores (1 AMD opteron node with 2 dual core CPUs) of about 0.27 ns/day on
> 64K atoms system (nve, t.s = 1fs) is fine for now. We have small gigabit
> ethernet clusters, so I didnt put too much effort in testing at higher
> processor count using different nodes for the same run because the scaling
> of pmemd and sander is very poor on our cluster. And we had some nasty I/O
> problems when running jobs on different nodes.
>
> I was just surprised that compiling with ifort did not improve the
> performance of pmemd comparing to gfortran. Also, I was a bit surprised to
> see that pmemd compiled with ifort+mpich2 is about 5 % slower than pmemd
> compiled with ifort+opempi. I thought maybe there is some obvious option
> for the ifort compiler that I didn't consider and that was the reason I
> asked the question.
>
> However, soon we'll get a infiniband cluster of AMD opterons (do not know
> the exact configuration yet) and for sure I'll be using the information
> you send here to build AMBER10 for that cluster. So, I'm happy that you
> took your time and write down all these details.
>
> Thanks again
>
> Cheers
> vlad
>
> Robert Duke wrote:
>> There are probably at least 10, if not 20 different things going on here,
>> some of which you are talking about, some of which you are not. I have
>> no idea how many porcessors you are using. I don't know your
>> interconnect. This stuff can be impacted by 1) compiler choice, 2)
>> compiler options choice, 3) mpi choice, 4) how mpi was built, 5) how
>> mpi was configured, 5a) how the system communications stacks are
>> configured, 6) how pmemd was configured to be optimized given the
>> hardware and software in play, 7) the hardware that is being used, in
>> terms of specfics about a) cpu speed, b) cpu cache size, c) multicore
>> impacts on memory and other communications bandwidths, d) the system
>> buses in use, e) the net cards in use, 8) the actual benchmarks in use -
>> the size of the benchmark can make a big difference in performance,
>> depending on how the modeled system size matches the cache size, and the
>> processor count (so as the processor count comes down, more is done in
>> each individual processor in terms of total memory requirements, and at
>> some point you run out of cache, and that can really make a difference in
>> performance, for example). A wide range of options chosen in mdin can
>> totally whack performance. So what I did in the amber 8 and 9 timeframes
>> is cook up a bunch of specific configurations with known characteristics,
>> and I carefully optimized the software and provided configuration options
>> to target these machines. It is not a simple matter to then move to any
>> new machine/new compiler/new implementation of any other supporting
>> library and see performance STAY THE SAME, LET ALONE GET BETTER. It is
>> really really really really easy to dink up the performance of this sort
>> of code; sad but true. It basically is optimized to sit on the edge of a
>> bunch of interlocking bottlenecks; push it a little in any direction, and
>> you start running slower for a different reason. So I am sorry about
>> that, but the community needs to maybe realize that this is not a simple
>> matter. I have spent a couple of decades working to some extent or
>> another on performance and reliability issues on computers in the
>> computer industry, and I have to approach each new configuration
>> carefully, or I won't get particularly good performance (I actually build
>> about two dozen different versions of pmemd and just run them at
>> present). I choose not to spend the rest of my life doing this for every
>> combination of hardware and software folks can dream up; I would really
>> recommend that if you are serious about running amber fast, you take a
>> look at what we support well currently, and consider making purchases in
>> that direction. Right now, that probably means that the best choice in
>> mpi (for cluster builders) is good infiniband hardware + mvapich and the
>> ifort compiler. I would choose faster cpu's, lower core count, hang the
>> additional cost, if molecular dynamics is your thing. If I had or really
>> wanted amd processors, I would choose pathscale compilers - they are fast
>> and work well. PGI is my third choice in compilers, but this may be
>> because of past issues that are not that big a deal anymore - they have
>> made an honest effort to respond to past problems and should be given
>> credit. For me, intel compilers have always been pretty darn fast, but a
>> bit of a pain to use in terms of them changing things; still they really
>> know how to write a code optimizer. If you are stuck with ethernet,
>> well, don't expect much, but we support lam, mpich 1, mpich 2, they all
>> work well, and they are really pretty easy to install (I think lam in
>> particular may be pretty easy; for historical reasons mostly I settled in
>> on mpich 1-vintage stuff myself; I found that all the additional features
>> of mpich 2 where mostly a hassle for my small clusters). If you want to
>> run a totally new configuration and see what it will do with pmemd, then
>> you need to 1) optimize the mpi configuration, carefully, on your
>> machine, 2) optimized the compiler configuration, carefully, on your new
>> machine, with pretty aggressive compiler options, and 3) go through
>> building pmemd every way possible (all the various optimization options)
>> to see what you get. Then run different size benchmarks, different size
>> runs, and on and on. And another big point. If you are not willing to
>> spend money on the rest of your configuration, to get stuff that is
>> recommended and known to work, and spend the effort to set up the
>> recommended configurations, then maybe you shouldn't be so surprised that
>> you don't get the best performance. I wouldn't be...
>>
>> Two potentially interesting notes:
>> 1) You should not really expect pmemd 10 to be much faster than pmemd 9
>> on a small cpu configuration unless you are running NVE or NVT, with the
>> default value for ene_avg_sampling; in the development work for 10, I
>> found very little aside from things associated with this option that
>> would improve performance at low processor count (and as I have said
>> elsewhere, for the right nve benchmarks on the right machines, the single
>> processor nve performance can be as much as 30% better, but this is sort
>> of best case).
>> 2) We will reasonably soon be supporting configurations using Intel MPI
>> on Infiniband. This stuff has better performance than anything I have
>> seen for commodity clusters - SUBSTANTIALLY better, and looks to me to be
>> worth the money.
>>
>> Regards - Bob Duke
>>
>> ----- Original Message ----- From: "Vlad Cojocaru"
>> <Vlad.Cojocaru.eml-r.villa-bosch.de>
>> To: "AMBER list" <amber.scripps.edu>
>> Sent: Friday, July 25, 2008 5:57 AM
>> Subject: AMBER: amber 10: sander and pmemd performance
>>
>>
>>> Dear ambers,
>>>
>>> I have compiled AMBER10 with the intel compilers ifort and icc (10.1)
>>> with 2 different mpi libs: v1= version with mpich2 1.0.7; v2 = version
>>> with openmpi 1.2.6 (MKL were used in both). All MPI versions were
>>> compiled with the same compilers as the AMBER package. Netcdf support
>>> was included in all compilations.
>>>
>>> I tested these compilations on a 60K atoms system on 4 cores of an AMD
>>> opteron machine with 2 double core CPUs (OS: Debian Linux) . After this
>>> I compared with one older compilation of AMBER9 with gcc 4.1 (gfortran)
>>> and openmpi 1.2.5 (v3) . I used both NPT and NVE ensemble for testing.
>>>
>>> To my surprise there is little difference in performance between v1 and
>>> v2 compilations of AMBER 10 and the old compilation of AMBER 9 with gcc.
>>> The new sander.MPI is just about 6-8 % faster while the new pmemd is
>>> just about the same speed. Interestingly, the v2 compilation
>>> (intel+openmpi) was slightly faster than v1 (compiled with mpich2).
>>> Also, a different compilation using pgi 7.1 and openmpi 1.2.5 is very
>>> similar in performance with the intel ones.
>>>
>>> In general sander.MPI runs 0.155 to 0.165 ns per day of NPT simulation
>>> while pmemd runs 0.24 to 0.25 ns/day of NPT simulation and 0.245 to
>>> 0.2670 ns/day of NVE simulation . All simulations have a time step of
>>> 1fs.
>>>
>>> I would like to ask you if according to your experience these
>>> performance parameters are what one would expect on such a machine? I
>>> was hoping that the intel compilers would compile significantly faster
>>> executables (around 15-20 %) comparing to gfortran but this is not the
>>> case (or maybe the increase in performance comes with higher CPU counts
>>> ?). Is there something one can play with during the compilation with the
>>> intel compilers to increase performance ? There are several messages int
>>> he AMBER archives suggesting that the intel compilers provide faster
>>> executables ....
>>>
>>> Best wishes
>>> vlad
>>>
>>>
>>> --
>>> ----------------------------------------------------------------------------
>>>
>>> Dr. Vlad Cojocaru
>>>
>>> EML Research gGmbH
>>> Schloss-Wolfsbrunnenweg 33
>>> 69118 Heidelberg
>>>
>>> Tel: ++49-6221-533266
>>> Fax: ++49-6221-533298
>>>
>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>>
>>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>>
>>> ----------------------------------------------------------------------------
>>>
>>> EML Research gGmbH
>>> Amtgericht Mannheim / HRB 337446
>>> Managing Partner: Dr. h.c. Klaus Tschira
>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>>> http://www.eml-r.org
>>> ----------------------------------------------------------------------------
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> The AMBER Mail Reflector
>>> To post, send mail to amber.scripps.edu
>>> To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
>>> to majordomo.scripps.edu
>>>
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber.scripps.edu
>> To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
>> to majordomo.scripps.edu
>>
>
> --
> ----------------------------------------------------------------------------
> Dr. Vlad Cojocaru
>
> EML Research gGmbH
> Schloss-Wolfsbrunnenweg 33
> 69118 Heidelberg
>
> Tel: ++49-6221-533266
> Fax: ++49-6221-533298
>
> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>
> http://projects.villa-bosch.de/mcm/people/cojocaru/
>
> ----------------------------------------------------------------------------
> EML Research gGmbH
> Amtgericht Mannheim / HRB 337446
> Managing Partner: Dr. h.c. Klaus Tschira
> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
> http://www.eml-r.org
> ----------------------------------------------------------------------------
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
> to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
to majordomo.scripps.edu
Received on Sun Jul 27 2008 - 06:07:53 PDT