From: Ross Walker <>
Date: Thu, 4 Nov 2010 10:38:23 -0700

Hi Xiaohu,

> I was thinking about using amber(version 10) to do some qm/mm. I did
> some benchmark about a system around 2200 atoms where 184 atoms are

At this number of QM atoms the matrix diagonalization is going to dominate
so you won't gain a huge amount. You can try linking to one of the latest
MKL libraries and setting diag_routine=0 in the qmmm namelist. This is in
the AMBER 10 manual but for some reason seems to have been mysteriously
removed from the AMBER 11 manual. It should work in both AMBER 10 and AMBER
11 though. This will test all the diagonalizers and will probably give you a
60 or 70% performance increase right off the bat if you use MKL for this
size of matrix. In principal you could also link in the OPENMP version of
the MKL libraries - there is a -openmp switch in the amber10 configure if I
remember correctly. And an IFDEF QMMM_OMP in the code. You might be ableto
get that to benefit over 2 to 4 processors. This is undocumented and
experimental though so if you try this you are on your own and may need to
do some hacking around in the code.
> | QMMM hcore calc 0.82 ( 1.18% of QMMM )
> | QMMM fock build 2.13 ( 3.11% of QMMM )
> | QMMM elec-energy cal 0.17 ( 0.25% of QMMM )
> | *QMMM full matrix dia 36.15 (52.64% of QMMM )*
> | *QMMM pseudo matrix d 21.90 (31.89% of QMMM )*
> | *QMMM density build 8.31 (12.10% of QMMM )*
> | *QMMM scf 68.67 (98.81% of QMMM )*

MPI 2 proc
> | QMMM hcore calc 0.38 ( 0.54% of QMMM )
> | QMMM fock build 1.17 ( 1.70% of QMMM )
> | QMMM fock dist 0.46 ( 0.67% of QMMM )
> | QMMM elec-energy cal 0.72 ( 1.04% of QMMM )
> *| QMMM full matrix dia 18.45 (26.81% of QMMM )
> | QMMM pseudo matrix d 10.56 (15.33% of QMMM )
> | QMMM density build 4.05 ( 5.88% of QMMM )*
> *| QMMM density dist 33.43 (48.56% of QMMM )*
> | *QMMM scf 68.84 (99.45% of QMMM )*

> as you can see, when the number of processors are doubled, the full matrix
> diagnalization and density build time are both reduced to half, which is
> contrary to what the manuel says. In addition, for the mpi code, there is
> > addition term called qmmm density dist, which is significant. So
although it
> seems that the matrix diagnalization and density build are reduced to
> this new term in mpi causes no change in the overal time spent in the qm
> part.

This is just cosmetic. What happens is that all the threads run through the
SCF routine and just the master calls the diagonalization routine. All the
others fly through until they get to the barrier waiting for the master to
send them the results of the diagonalization. Thus the master accumulates
time in the full matrix diag timer and the others accumulate time in the
timer waiting for the density distribution. What is printed at the end of
the output file is the average across threads hence why you see this.

set profile_mpi=1 in the cntrl namelist and run again. Then take a look at
the file created called profile_mpi - this will contain the raw results from
the timers for each individual thread. This is really what you should be
looking at if you want to understand where the time is being used by each
thread in parallel.

My first advice though is to get the latest Intel MKL. Compile against this
and set diag_routine=0. You will get a LARGE speedup. - This should probably
be setup as the default but not everyone has MKL.

I hope that helps.

All the best

|\oss Walker

| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| | |
| Tel: +1 858 822 0854 | EMail:- |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

AMBER mailing list
Received on Thu Nov 04 2010 - 11:00:03 PDT
Custom Search