[AMBER] Scale sander.MPI to 64 ncpus - Amber11

From: Leong Wye Kit Damien <damienl.acrc.a-star.edu.sg>
Date: Fri, 18 Feb 2011 01:05:09 +0800

Good Morning Amber user group

I hope someone can give me some clues or point me to the correct direction.

I'm helping an amber ( Amber11 ) user to scale his sander.MPI computation from 32 ncpus to 64 ncpus. Everything seems to work fine when we use 32 ncpus, once we try to scale up to 64 ncpus ( eventually, if possible 128 npcus ), the results did not compute. I believe there is a limit to 256 ncpus in sander.MPI ( please correct me if I'm wrong ).

If it did compute correctly ( with 32 cpus ), I should be able to see:

Example
------------------------------
| Local SIZE OF NONBOND LIST = 139474
| TOTAL SIZE OF NONBOND LIST = 4808242
   NSTEP ENERGY RMS GMAX NAME NUMBER
      1 -4.0791E+04 3.1384E+01 2.8014E+03 O 1754
 BOND = 77.0003 ANGLE = 215.9942 DIHED = 295.2149
 VDWAALS = 6162.9204 EEL = -48661.2899 HBOND = 0.0000
 1-4 VDW = 81.5555 1-4 EEL = 1037.1123 RESTRAINT = 0.0000
   NSTEP ENERGY RMS GMAX NAME NUMBER
     50 -5.0768E+04 1.9930E+00 1.1115E+01 H1 12978
 BOND = 2690.2212 ANGLE = 57.6311 DIHED = 266.7564
 VDWAALS = 4728.9004 EEL = -59638.3394 HBOND = 0.0000
 1-4 VDW = 73.5857 1-4 EEL = 1053.2269 RESTRAINT = 0.0000
   NSTEP ENERGY RMS GMAX NAME NUMBER
    100 -5.2988E+04 9.6958E-01 7.1804E+00 H2 14272
---------------------------------------


OUR PROBLEM :


We have try various methods on 64 cpus, we end up at ( incorrect ):
---------------------------------------
4. RESULTS

 APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
 using 5000.0 points per unit in tabled values
 TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
| CHECK switch(x): max rel err = 0.2738E-14 at 2.422500
| CHECK d/dx switch(x): max rel err = 0.8314E-11 at 2.736960
 ---------------------------------------------------
| Local SIZE OF NONBOND LIST = 75749
| TOTAL SIZE OF NONBOND LIST = 4808242
---------------------------------------
Nothing comes up further, it just did not continue, we have try to give it a longer CPU walltime, but it still does not continue.

We are using a GNU/Linux X86 cluster, each compute node has 8 cpu cores, there is 200 compute nodes using LSF Job Scheduler.
We are using Intel MPI (4.00 ).


Our LSF-sander.MPI script:
----------------------
#!/bin/bash

#BSUB -q normal
#BSUB -R "span[ptile=8]"
#BSUB -W 1:00
#BSUB -J "min64"
#BSUB -o lsf%J.o
#BSUB -e lsf%J.e
#BSUB -R "rusage[mem=1000]"

#BSUB -n 64
TOTAL_CPUS=64
NODES=8

export AMBERHOME=/apps/Amber11/amber11
export PATH=$PATH:$AMBERHOME/bin

MACHINEFILE=mymacs.$LSB_JOBID
for i in `echo $LSB_HOSTS`; do echo $i; done > $MACHINEFILE

/usr/local/intel/impi/4.0.0.025/intel64/bin/mpdboot -n $NODES -f $MACHINEFILE
if [ $? -ne 0 ] ; then
  exit 1
fi
RESULT=`/usr/local/intel/impi/4.0.0.025/intel64/bin/mpdtrace | wc -l`
if [ "$RESULT" != "$NODES" ] ; then
  /usr/local/intel/impi/4.0.0.025/intel64/bin/mpdallexit
  exit 1
fi

mpiexec -machinefile $MACHINEFILE -n $TOTAL_CPUS /apps/Amber11/amber11/bin/sander.MPI -O -i min.in -o min_ARNO.out -p ARNO.parmtop -c ARNO.inpcrd -r ARNO.rst

/usr/local/intel/impi/4.0.0.025/intel64/bin/mpdallexit

---------------------------------------------

More info:
-------------------
cat min.in
ringheterodim : initial minimisation prior to MD, whole system
 &cntrl
  imin = 1,
  maxcyc = 1000,
  ncyc = 500,
  ntb = 1,
  ntr = 0,
  cut = 10,
 /
---------------------

We believe the amber11 compiling or binaries are in order, else we should not be able to compute ( sander.MPI ) successfully with 32 ncpus, What we
do not understand is why we cannot scale to 64 ncpus and beyond.

Some helpful folks, please advise or at least diverts us to the correct direction. Many Thanks.

If you have any clarification, please let us know. Thanks.
 


Cheers

Damien Leong
(Computing Systems Group )

A*STAR Compute Resource Centre (A*CRC). Biopolis

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Feb 17 2011 - 09:30:04 PST
Custom Search