Re: AMBER: Amber 9 parallel test fail on 4096wat/Run.column_fft

From: Yu Chen <chen.hhmi.umbc.edu>
Date: Fri, 3 Nov 2006 10:18:35 -0500

Hello, Ross

Thanks for the replying. See replys inline.

> Dear Yu,
>
> Assuming that you did not compile in support for binary
> trajectories, the -bintraj option to configure, then you can safely
> ignore the first error. The second error is strange. Are you
> certain DO_PARALLEL is set to 'mpirun -np 4'?

Yeah, I ignored the first error, and I am certain DO_PARALLEL is set
to "mpirun -np 4". Afterwards, I commented it out, and everything
finished nicely.

> Nonetheless we should try and track down where this is coming from.
> Can you try the following:
>
> export TESTsander=$AMBERHOME/exe/sander.MPI
> export DO_PARALLEL='mpirun -np 2'
> cd $AMBERHOME/test/4096wat
> ./Run.column_fft
> export DO_PARALLEL='mpirun -np 4'
> ./Run.column_fft
> export DO_PARALLEL='mpirun -np 8'
> ./Run.column_fft
>

Here is the interesting part. I did the tests. it passed on np=2, 8,
32, 128, but failed on np=4,16,64 with the "ASSERTion 'processor ==
numtasks' failed in spatial_fft.f " error. And, just for try, it all
failed on nps not power of 2.

Physically, we have a 25 nodes cluster plus the head node, each with
two AMD Athlon CPUs, and using LAM/MPI.

BTW, any other programs in Amber require number of processors be set
to power of 2?

Thanks,
Chen

> And let us know which tests pass and which don't here. Then perhaps
> Mike Crowley might have a better chance of tracking down where the
> problem is coming from.
>
> All the best
> Ross
> /\
> \/
> |\oss Walker
>
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery,
> may not be read every day, and should not be used for urgent or
> sensitive issues.
>
>
>
> From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On
> Behalf Of Yu Chen
> Sent: Wednesday, November 01, 2006 08:23
> To: Amber Maillist
> Subject: AMBER: Amber 9 parallel test fail on 4096wat/Run.column_fft
>
> Hi, I have successfully compiled and installed Amber 9 on our RHEL
> AS 3 Linux cluster. It passed serial test, but in parallel test, I
> got the following errors, hope someone can help me with. Thanks
> in advance!
>
> First, our configurations:
> =================================
> RHEL AS 3 on Athlon,
> Using LAM 7.0.6 which was compiled with intel compiler version 8.0
> of icc/icpc/ifort
> Amber 9 was compiled using the same compilers
> DO_PARALLEL set to 'mpirun -np 4'
>
> Second, the error messages:
> ====================================
> ...
> ...
> cd bintraj; ./Run.bintraj
> sander and ptraj: test sander netCDF output and ptraj netCDF input
> ----------------------------------------------------------------------
> -------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 29219 failed on node n1 (10.0.0.8) with exit status 1.
> ----------------------------------------------------------------------
> -------
> ./Run.bintraj: Program error
> make[1]: [test.sander.BASIC] Error 1 (ignored)
> make[1]: Leaving directory `/raid5/p2/raid1_p12/hhmi/software/Amber/
> v9/test'
> export TESTsander=/hhmi/software/Amber/v9/exe/sander.MPI; cd
> 4096wat; ./Run.column_fft
> ASSERTion 'processor == numtasks' failed in spatial_fft.f at line 488.
> ----------------------------------------------------------------------
> -------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 29221 failed on node n1 (10.0.0.8) with exit status 1.
> ----------------------------------------------------------------------
> -------
> ./Run.column_fft: Program error
> make: *** [test.sander.BASIC.MPI] Error 1
>
> ===============================================
>
>
> Yu Chen
> chen.hhmi.umbc.edu
> Baltimore, MD 21250
>
>
>

Yu Chen
chen.hhmi.umbc.edu
Baltimore, MD 21250




-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Nov 05 2006 - 06:07:42 PST
Custom Search