RE: AMBER: Problem of QM/MM calculation with amber 9 parallel version

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 2 Nov 2006 09:05:38 -0800

Dear Lee,

It is the problem I guessed it might be. It is only ewald related QM/MM in
parallel that causes problems. Did you run 'make test.parallel' after
compiling amber in parallel? Although we have found the problem it is
imperative that people do this before running their own simulations as it
will often uncover problems very quickly. In your case the initial parallel
QM/MM test cases would have run and it would only have hung on the 9th QM/MM
test case which is the one that tests PME for QM/MM.

Anyway, here is the issue:

|QMMM: Quantum atom + link atom division among threads:
|QMMM: Start End Count
|QMMM: Thread( 0): 1-> 11 ( 11)
|QMMM: Thread( 1): 12-> 22 ( 11)

>>> Looks good here

|QMMM: KVector division among threads:
|QMMM: Start End Count
|QMMM: Thread( 0): 1-> 155 ( 155)
|QMMM: Thread( 0): 156-> 309 ( 154)

But this part confirms it. It is yet another bug in the compilers :'( :'(
:'( :'(

Here is the code that prints this:

#ifdef MPI
        if (qmmm_mpi%master) then
          write (6,'(/a)') '|QMMM: KVector division among threads:'
          write (6,'(a)') '|QMMM: Start End
Count'
          !Already know my own.
          write(6,'(a,i8,a,i8,a,i8,a)') &
                '|QMMM: Thread( 0):
',qmmm_mpi%kvec_start,'->',qmmm_mpi%kvec_end, &
                                       '
(',qmmm_mpi%kvec_end-qmmm_mpi%kvec_start+1,')'
          do i = 1, sandersize-1
            call
mpi_recv(istartend,2,mpi_integer,i,0,commsander,istatus,ier)
            write(6,'(a,i4,a,i8,a,i8,a,i8,a)') &
                '|QMMM: Thread(',i,'): ',istartend(1),'->',istartend(2), &
                                       ' (',istartend(2)-istartend(1)+1,')'
          end do
        else
          !Send a message to the master with our counts in.
          istartend(1) = qmmm_mpi%kvec_start
          istartend(2) = qmmm_mpi%kvec_end
          call mpi_send(istartend,2,mpi_integer,0,0,commsander,ier)
        end if
#endif

So the issue is with the loop:

          do i = 1, sandersize-1
            call
mpi_recv(istartend,2,mpi_integer,i,0,commsander,istatus,ier)
            write(6,'(a,i4,a,i8,a,i8,a,i8,a)') &
                '|QMMM: Thread(',i,'): ',istartend(1),'->',istartend(2), &
                                       ' (',istartend(2)-istartend(1)+1,')'
          end do

In here if you print the value of i what happens is that it starts the loop
with i=1, calls mpi_recv with i = 1 and so expects a message from the thread
with taskid = 1. This it receives okay since the correct values of
istartend(1) and istartend(2) are printed:

|QMMM: KVector division among threads:
|QMMM: Start End Count
|QMMM: Thread( 0): 1-> 155 ( 155)
|QMMM: Thread( 0): 156-> 309 ( 154)
             ^^^^
BUT!!!-------!

Here i = 0. So after calling mpi_recv with i = 1 the value of i gets
corrupted and comes back as 0. The loop then executes again and the master
expects another blocking receive from taskid 1 which it never gets since
taskid 1 is already off doing other things. Hence it just hangs for ever...

I came across this myself about a month ago with ifort v9.1.039 on an x86_64
with our development version of Amber. However, in this case the problem
only occured when you turned on debug symbols for this routine and turned
off the optimization. This is contrary to most compiler bugs which only show
up when optimization is on. Since I figured it was likely to be a transient
compiler problem and only affected debugging versions of the code I never
bothered to produce a bugfix for amber 9 that works around this compiler
bug. However, it seems that on IA64 systems the bug exists even in the
optimized code so I will post a workaround.

Can you try the attached patch and see if it fixes the problem.

Save the file in $AMBERHOME/src then do:

cd $AMBERHOME/src
make clean
patch -p0 <intel_bug_workaround.patch
cd ..
make parallel
cd ../test
export DO_PARALLEL='mpirun -np 2'
make test.sander.QMMM.MPI

Let me know if this works and I'll post a proper bugfix for it.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.


-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Sun Nov 05 2006 - 06:07:30 PST
Custom Search