Re: [AMBER] Scale sander.MPI to 64 ncpus - Amber11

From: Leong Wye Kit Damien <damienl.acrc.a-star.edu.sg>
Date: Fri, 18 Feb 2011 10:22:51 +0800

Good Morning Dac

Actually we are trying on a bigger datasets before:

-------------------------
cat equilibrate.groupfile
-O -rem 0 -i equilibrate.mdin.001 -o equilibrate.mdout.001 -c min.rst -r equilibrate.rst.001 -x equilibrate.mdcrd.001 -inf equilibrate.mdinfo.001 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.002 -o equilibrate.mdout.002 -c min.rst -r equilibrate.rst.002 -x equilibrate.mdcrd.002 -inf equilibrate.mdinfo.002 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.003 -o equilibrate.mdout.003 -c min.rst -r equilibrate.rst.003 -x equilibrate.mdcrd.003 -inf equilibrate.mdinfo.003 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.004 -o equilibrate.mdout.004 -c min.rst -r equilibrate.rst.004 -x equilibrate.mdcrd.004 -inf equilibrate.mdinfo.004 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.005 -o equilibrate.mdout.005 -c min.rst -r equilibrate.rst.005 -x equilibrate.mdcrd.005 -inf equilibrate.mdinfo.005 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.006 -o equilibrate.mdout.006 -c min.rst -r equilibrate.rst.006 -x equilibrate.mdcrd.006 -inf equilibrate.mdinfo.006 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.007 -o equilibrate.mdout.007 -c min.rst -r equilibrate.rst.007 -x equilibrate.mdcrd.007 -inf equilibrate.mdinfo.007 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.008 -o equilibrate.mdout.008 -c min.rst -r equilibrate.rst.008 -x equilibrate.mdcrd.008 -inf equilibrate.mdinfo.008 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.009 -o equilibrate.mdout.009 -c min.rst -r equilibrate.rst.009 -x equilibrate.mdcrd.009 -inf equilibrate.mdinfo.009 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.010 -o equilibrate.mdout.010 -c min.rst -r equilibrate.rst.010 -x equilibrate.mdcrd.010 -inf equilibrate.mdinfo.010 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.011 -o equilibrate.mdout.011 -c min.rst -r equilibrate.rst.011 -x equilibrate.mdcrd.011 -inf equilibrate.mdinfo.011 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.012 -o equilibrate.mdout.012 -c min.rst -r equilibrate.rst.012 -x equilibrate.mdcrd.012 -inf equilibrate.mdinfo.012 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.013 -o equilibrate.mdout.013 -c min.rst -r equilibrate.rst.013 -x equilibrate.mdcrd.013 -inf equilibrate.mdinfo.013 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.014 -o equilibrate.mdout.014 -c min.rst -r equilibrate.rst.014 -x equilibrate.mdcrd.014 -inf equilibrate.mdinfo.014 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.015 -o equilibrate.mdout.015 -c min.rst -r equilibrate.rst.015 -x equilibrate.mdcrd.015 -inf equilibrate.mdinfo.015 -p pep.parmtop
-O -rem 0 -i equilibrate.mdin.016 -o equilibrate.mdout.016 -c min.rst -r equilibrate.rst.016 -x equilibrate.mdcrd.016 -inf equilibrate.mdinfo.016 -p pep.parmtop
--------------------------------------

We use the minimization for troubleshooting purpose, using the above datasets, the results:
-------------------------------------
 ls-alh equilibrate.mdout.*
-rwxrwxrwx 1 biiliuy bii 7.4K Feb 16 13:36 equilibrate.mdout.001
-rwxrwxrwx 1 biiliuy bii 11K Feb 16 13:49 equilibrate.mdout.002
-rwxrwxrwx 1 biiliuy bii 15K Feb 16 13:38 equilibrate.mdout.003
-rwxrwxrwx 1 biiliuy bii 7.4K Feb 16 13:36 equilibrate.mdout.004
-rwxrwxrwx 1 biiliuy bii 9.7K Feb 16 13:49 equilibrate.mdout.005
-rwxrwxrwx 1 biiliuy bii 7.4K Feb 16 13:36 equilibrate.mdout.006
-rwxrwxrwx 1 biiliuy bii 7.4K Feb 16 13:36 equilibrate.mdout.007
-rwxrwxrwx 1 biiliuy bii 7.4K Feb 16 13:36 equilibrate.mdout.008
-rwxrwxrwx 1 biiliuy bii 9.7K Feb 16 13:41 equilibrate.mdout.009
-rwxrwxrwx 1 biiliuy bii 7.4K Feb 16 13:36 equilibrate.mdout.010
-rwxrwxrwx 1 biiliuy bii 7.6K Feb 16 13:50 equilibrate.mdout.011
-rwxrwxrwx 1 biiliuy bii 7.4K Feb 16 13:36 equilibrate.mdout.012
-rwxrwxrwx 1 biiliuy bii 7.6K Feb 16 13:50 equilibrate.mdout.013
-rwxrwxrwx 1 biiliuy bii 7.6K Feb 16 13:50 equilibrate.mdout.014
-rwxrwxrwx 1 biiliuy bii 7.6K Feb 16 13:50 equilibrate.mdout.015
-rwxrwxrwx 1 biiliuy bii 7.4K Feb 16 13:36 equilibrate.mdout.016
-----------------------------------
Look closely at their file sizes, of the 16 only 'Two' of them completed the run ( for 64 ncpus this case ), for 32 ncpus all 16 compute the results successfully. So some did work, but not all.

Sander just did not hang, it just did not carry on with the compute for the rest.

We are using infiniband connections ( IB ). I believe it could be a mis-configure on the amber side, but I could not figure out where.


Kindly advise. Thanks.



Damien Leong
(Computing Systems Group )

A*STAR Compute Resource Centre (A*CRC). Biopolis
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Feb 17 2011 - 18:30:04 PST
Custom Search