RE: AMBER: about parallelization in QM-MM from Ross Walker on 2007-08-17 (Amber Archive Aug 2007)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 17 Aug 2007 11:45:08 -0700

In fact last time I checked
Gaussian 98 just did all it's diagonalization in parallel.
^^^^^^^^
-------------------------------------------------|

I meant to say "serial" here. Sorry...

> -----Original Message-----
> From: owner-amber.scripps.edu
> [mailto:owner-amber.scripps.edu] On Behalf Of Ross Walker
> Sent: Friday, August 17, 2007 09:59
> To: amber.scripps.edu
> Subject: RE: AMBER: about parallelization in QM-MM
>
>
> > USER (fr), PR(25), NI(0), VIRT(251m), RES(28m), SHR(from 10m
> > to 7000), S(R),
> > %CPU(100), %MEM(0.2), TIME(...), COMMND(sander.MPI)
> >
> > about the same for the other three nodes. The small %MEM for
> > node reflects the
> > availability of 4GB per node. As Ross warned, it might be
> > that reducing
> > drastically the available memory per node (being mainly used
> > for "ab initio"
> > calculation, things are set so that all available memory can
> > be used) might
> > accelerate the procedure.
>
> No.... What I said was for 'other codes' such as Gaussian it
> may improve
> performance to tell them NOT to use as much memory. For AMBER
> QM/MM it will
> make no difference you cannot tell it how much memory to use.
> Sander is
> smart enough (due to the fact that we changed over to Fortran
> 95 ages ago)
> to figure out the optimum amount of memory itself. So you
> really really
> really really don't need to concern yourself with anything
> regarding memory
> except to make sure that you at least have enough so that the
> code is not
> swapping to disk!
>
> > One has to know how the program works to judge about that.
>
> You have the source code, go take a look... That's how I learnt it.
>
> > Increase in speed with respect to previous quite similar
> > runs, where I had
> > forgotten to state "mpirun -nc 4", is about 10-20%.
>
> Speed up for pure QM runs will not be good right now since the matrix
> diagonalization step dominates and this is not parallel. If
> you are feeling
> bored (or masochistic) then by all means please parallelize
> this step for
> me.
>
> I am working to improve parallel performance (as well as
> serial performance)
> of this section of the code in AMBER 10 and beyond but how
> much gets done
> really depends on my funding situation over the next year.
> Right now it
> looks like certain people, who shall for the moment remain
> nameless, at NSF
> are intent on destroying SDSC and with it my ability to independently
> determine which projects I work on. Instead I am the at mercy
> of "what I can
> get funded." and thus can really only work on this stuff as a hobby.
>
> Note for QM/MM runs with explicit water, periodic boundaries
> and PME you
> should see a reasonable speed up on 4 processors of maybe 2
> to 2.5x. It will
> depend on the size of the QM system and the size of the MM system. The
> matrix diagonalization scales as N^3 and so quickly dominates
> as the atom
> count increases.
>
> You might ask why codes such as Gaussian and NWChem scale
> reasonably well in
> parallel for pure QM calculations so I shall jump the gun on this and
> attempt to explain. The key difference here is the time taken
> to do an SCF
> step. For a reasonably large abinitio calculation the time
> per SCF step can
> be seconds to hours. Here it is calculation of the two
> electron integrals
> that typically dominate and NOT the matrix diagonalization.
> This is easily
> parallelized hence why they see speed up. In fact last time I checked
> Gaussian 98 just did all it's diagonalization in parallel.
> G03 might now do
> it but I haven't looked. The matricies for ab initio QM
> calculation are also
> orders of magnitude larger in size which means they take much
> much much
> longer to do and so it is easier to do to in parallel, with
> something like a
> block Jacobi method since the communication latency doesn't kill you.
>
> The key point is that for an optimization using ab initio QM
> you might be
> looking at say a days computation time during which the code
> might do say
> 300 to 400 individual SCF steps and thus 300 to 400 matrix
> diagonalizations.
> Thus if you can make everything else go to zero time you still have 4
> minutes per matrix diagonalization. Thus there is a lot of scope for
> improving the efficiency in parallel since there is plenty of
> work to be
> done for a given amount of communication.
>
> On the other hand with semi-empirical, especially when you
> want to run MD
> you do many orders of magnitude more SCF steps. E.g. assume
> you want to do
> 1ns of MD at 1fs time step. This is 10^6 iterations. Then
> assume you need 10
> scf steps per MD step, so 10^7 SCF steps. Say you want to do 1ns in 24
> hours. This equates to a rate of 8.6ms per MD step. Hence if
> again you make
> everything go to zero you have at most 8.6ms to do a matrix
> diagonalization.
> And this then gets very very hard to do in parallel. E.g.
> communication
> latency is typically around 4 micro seconds or so and
> bandwidth is maybe 2GB
> per sec sustained if we are just talking shared memory (note
> for comparison
> gigabit ethernet has an achievable bandwidth of around 100MB
> per sec at the
> top end). So if your matrix is say 1000,1000 (7.6MB) just
> distributing it to
> the other processors (or reading it into their cache)
> requires at a minimum
> of 8 micro secs latency + 3.7 ms transport time. This leaves
> less than 5ms
> for the computation and even then you have to store the
> result somewhere.
> Hence you should be able to see the underlying problem here
> and why doing QM
> runs in parallel on MD type timescales is very hard. Not to
> mention all the
> issues concerning cache coherency, race conditions etc that accompany
> running calculations in parallel.
>
> Note above about 90 QM atoms or so you may see better
> performance, assuming
> you have MKL installed and built against this, to edit the
> config.h file and
> add -DUSE_LAPACK to the FPP_FLAGS line. Then make clean and
> build sander
> again. This uses the lapack diagonalization routine in place
> of the built in
> routine. It can be quicker, depending on machine and lapack
> installation,
> for upwards of around 90 QM atoms or so. For less than this
> it is likely to
> be notably slower. Note this is UNDOCUMENTED, UNSUPPORTED and
> EXPERIMENTAL
> so use at your own risk, make sure you run all the test cases
> and I suggest
> you keep two executables, one for small QM atom counts and
> one for large. I
> am hoping to make this all automatic by Amber 10 so that the
> code will pick
> what it believes will be the faster routine. Again though
> whether or not
> this gets done really depends on the stability of my funding
> over the coming
> months.
>
> > From
> > DivCon manual one
> > learns that parallelized and non parallelized versions are
> > released. Which one
> > here I was unable to find out.
>
> My understanding is that all parallel divcon actually does is
> do each of the
> fragments individually on different processors. Hence only
> the divide and
> conquer algorithm runs in parallel and then it is determined
> by how many
> fragments you have. The parallel divide and conquer algorithm
> does not work
> with the divcon interface to sander and it would likely take
> a lot of effort
> to get it to work. This is besides whether or not divide and conquer
> approaches are even appropriate for MD simulations. I see no
> problem with
> them for minimization but it seems to me that there may be certain
> discontinuities in the gradients that would prevent you
> running an accurate
> MD simulation with it. In addition if you want to make use of
> this you would
> also have to forego periodic boundaries and more importantly PME
> electrostatics to use it. It is not clear to me if or how a divide and
> conquer algorithm could be made to work with the concept of
> doing PME for
> QM/MM calculations. This would take significant theoretical work to
> determine the appropriate mathematics before one could even start to
> implement it.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of
> delivery, may not
> be read every day, and should not be used for urgent or
> sensitive issues.
>
>
> --------------------------------------------------------------
> ---------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Aug 19 2007 - 06:07:43 PDT