Good Morning Amber user group
I hope someone can give me some clues or point me to the correct direction.
I'm helping an amber ( Amber11 ) user to scale his sander.MPI computation from 32 ncpus to 64 ncpus. Everything seems to work fine when we use 32 ncpus, once we try to scale up to 64 ncpus ( eventually, if possible 128 npcus ), the results did not compute. I believe there is a limit to 256 ncpus in sander.MPI ( please correct me if I'm wrong ).
If it did compute correctly ( with 32 cpus ), I should be able to see:
Example
------------------------------
| Local SIZE OF NONBOND LIST = 139474
| TOTAL SIZE OF NONBOND LIST = 4808242
NSTEP ENERGY RMS GMAX NAME NUMBER
1 -4.0791E+04 3.1384E+01 2.8014E+03 O 1754
BOND = 77.0003 ANGLE = 215.9942 DIHED = 295.2149
VDWAALS = 6162.9204 EEL = -48661.2899 HBOND = 0.0000
1-4 VDW = 81.5555 1-4 EEL = 1037.1123 RESTRAINT = 0.0000
NSTEP ENERGY RMS GMAX NAME NUMBER
50 -5.0768E+04 1.9930E+00 1.1115E+01 H1 12978
BOND = 2690.2212 ANGLE = 57.6311 DIHED = 266.7564
VDWAALS = 4728.9004 EEL = -59638.3394 HBOND = 0.0000
1-4 VDW = 73.5857 1-4 EEL = 1053.2269 RESTRAINT = 0.0000
NSTEP ENERGY RMS GMAX NAME NUMBER
100 -5.2988E+04 9.6958E-01 7.1804E+00 H2 14272
---------------------------------------
OUR PROBLEM :
We have try various methods on 64 cpus, we end up at ( incorrect ):
---------------------------------------
4. RESULTS
APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
using 5000.0 points per unit in tabled values
TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
| CHECK switch(x): max rel err = 0.2738E-14 at 2.422500
| CHECK d/dx switch(x): max rel err = 0.8314E-11 at 2.736960
---------------------------------------------------
| Local SIZE OF NONBOND LIST = 75749
| TOTAL SIZE OF NONBOND LIST = 4808242
---------------------------------------
Nothing comes up further, it just did not continue, we have try to give it a longer CPU walltime, but it still does not continue.
We are using a GNU/Linux X86 cluster, each compute node has 8 cpu cores, there is 200 compute nodes using LSF Job Scheduler.
We are using Intel MPI (4.00 ).
Our LSF-sander.MPI script:
----------------------
#!/bin/bash
#BSUB -q normal
#BSUB -R "span[ptile=8]"
#BSUB -W 1:00
#BSUB -J "min64"
#BSUB -o lsf%J.o
#BSUB -e lsf%J.e
#BSUB -R "rusage[mem=1000]"
#BSUB -n 64
TOTAL_CPUS=64
NODES=8
export AMBERHOME=/apps/Amber11/amber11
export PATH=$PATH:$AMBERHOME/bin
MACHINEFILE=mymacs.$LSB_JOBID
for i in `echo $LSB_HOSTS`; do echo $i; done > $MACHINEFILE
/usr/local/intel/impi/4.0.0.025/intel64/bin/mpdboot -n $NODES -f $MACHINEFILE
if [ $? -ne 0 ] ; then
exit 1
fi
RESULT=`/usr/local/intel/impi/4.0.0.025/intel64/bin/mpdtrace | wc -l`
if [ "$RESULT" != "$NODES" ] ; then
/usr/local/intel/impi/4.0.0.025/intel64/bin/mpdallexit
exit 1
fi
mpiexec -machinefile $MACHINEFILE -n $TOTAL_CPUS /apps/Amber11/amber11/bin/sander.MPI -O -i min.in -o min_ARNO.out -p ARNO.parmtop -c ARNO.inpcrd -r ARNO.rst
/usr/local/intel/impi/4.0.0.025/intel64/bin/mpdallexit
---------------------------------------------
More info:
-------------------
cat min.in
ringheterodim : initial minimisation prior to MD, whole system
&cntrl
imin = 1,
maxcyc = 1000,
ncyc = 500,
ntb = 1,
ntr = 0,
cut = 10,
/
---------------------
We believe the amber11 compiling or binaries are in order, else we should not be able to compute ( sander.MPI ) successfully with 32 ncpus, What we
do not understand is why we cannot scale to 64 ncpus and beyond.
Some helpful folks, please advise or at least diverts us to the correct direction. Many Thanks.
If you have any clarification, please let us know. Thanks.
Cheers
Damien Leong
(Computing Systems Group )
A*STAR Compute Resource Centre (A*CRC). Biopolis
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Feb 17 2011 - 09:30:04 PST