RE: [AMBER] amber parallel test fail

From: Nahoum Anthony <nahoum.anthony.strath.ac.uk>
Date: Fri, 23 Oct 2009 16:12:40 +0100

Hi Jason,

Thanks for your reply. I've tried what you said and have had no luck so
far...
After cleaning up the $AMBERHOME/tests/jar_multi directory, ls -l gives:

total 528
-rw-rw-r-- 1 amber amber 92 Apr 4 2006 dist.RST
-rw-rw-r-- 1 amber amber 55000 Feb 14 2007 dist_vs_t.000.save
-rw-rw-r-- 1 amber amber 55000 Feb 14 2007 dist_vs_t.001.save
-rw-rw-r-- 1 amber amber 46822 Apr 4 2006 dna.crd.000
-rw-rw-r-- 1 amber amber 46822 Apr 4 2006 dna.crd.001
-rw-rw-r-- 1 amber amber 166 Oct 23 15:10 groups
-rw-rw-r-- 1 amber amber 17493 Mar 18 2008 mdout.jar.000.save
-rw-rw-r-- 1 amber amber 17551 Mar 18 2008 mdout.jar.001.save
-rw-rw-r-- 1 amber amber 259028 Apr 4 2006 prmtop
-rwxrwxr-x 1 amber amber 1520 Feb 27 2008 Run.jar

There I run: make test.parallel.MM < /dev/null and get my error for
jar_multi:

...
==============================================================
cd bintraj && ./Run.bintraj
diffing nc_headers.save with nc_headers
PASSED
==============================================================
make[1]: Leaving directory `/home/amber/AMBER/amber10/test'
export TESTsander=/home/amber/AMBER/amber10/exe/sander.MPI; cd 4096wat &&
./Run.column_fft
diffing mdout.column_fft.save with mdout.column_fft
PASSED
==============================================================
export TESTsander=/home/amber/AMBER/amber10/exe/sander.MPI; cd jar_multi &&
./Run.jar

 Running multisander version of sander amber10
    Total processors = 2
    Number of groups = 2
 

  Unit 6 Error on OPEN: mdout.jar.001

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
rank 1 in job 393 imp1.sibs.strath.ac.uk_50001 caused collective abort of
all ranks
  exit status of rank 1: killed by signal 9
  ./Run.jar: Program error
make: *** [test.sander.BASIC.MPI] Error 1

after that, the ls -l $AMBERHOME/test/jar_multi gives:

total 544
-rw-rw-r-- 1 amber amber 92 Apr 4 2006 dist.RST
-rw-rw-r-- 1 amber amber 55000 Feb 14 2007 dist_vs_t.000.save
-rw-rw-r-- 1 amber amber 55000 Feb 14 2007 dist_vs_t.001.save
-rw-rw-r-- 1 amber amber 46822 Apr 4 2006 dna.crd.000
-rw-rw-r-- 1 amber amber 46822 Apr 4 2006 dna.crd.001
-rw-rw-r-- 1 amber amber 396 Oct 23 15:17 gbin
-rw-rw-r-- 1 amber amber 459 Oct 23 15:17 gbin.000
-rw-rw-r-- 1 amber amber 459 Oct 23 15:17 gbin.001
-rw-rw-r-- 1 amber amber 166 Oct 23 15:17 groups
-rw-rw-r-- 1 amber amber 2900 Oct 23 15:17 mdout.jar.000
-rw-rw-r-- 1 amber amber 17493 Mar 18 2008 mdout.jar.000.save
-rw-rw-r-- 1 amber amber 17551 Mar 18 2008 mdout.jar.001.save
-rw-rw-r-- 1 amber amber 259028 Apr 4 2006 prmtop
-rwxrwxr-x 1 amber amber 1520 Feb 27 2008 Run.jar

So it looks as if mdout.jar.001 was never generated and as the size of
mdout.jar.000 is very different from mdout.jar.000.save, here's the output
clearly showing that not much has happened (problem with mpich2 ?):


          -------------------------------------------------------
          Amber 10 SANDER 2008
          -------------------------------------------------------

| Run on 10/23/2009 at 16:40:33
  [-O]verwriting output

File Assignments:
| MDIN: gbin.000

| MDOUT: mdout.jar.000

|INPCRD: dna.crd.000

| PARM: prmtop

|RESTRT: restart.000

| REFC: refc

| MDVEL: mdvel

| MDEN: mden

| MDCRD: mdcrd.0

|MDINFO: mdinfo

|INPDIP: inpdip

|RSTDIP: rstdip


|INPTRA: inptraj

|
 
 Here is the input file:
 
 test of Jarzynski for a distance in DNA

 &cntrl

   nstlim=1000, cut=12.0, igb=1, saltcon=0.1,

   ntpr=100, ntwr=100000, ntt=3, gamma_ln=5.0,

   ntx=5, irest=1, ntwx=0, ig = 99931,

   ntc=2, ntf=2, tol=0.000001,

   dt=0.002, ntb=0, tempi=300., temp0=300.,

   jar=1,

 /

 &wt type='DUMPFREQ', istep1=1 /

 &wt type='END' /

DISANG=dist.RST

DUMPAVE=dist_vs_t.000

LISTIN=POUT

LISTOUT=POUT


----------------------------------------------------------------------------
----
   1.  RESOURCE   USE: 
----------------------------------------------------------------------------
----
| Flags: MPI USE_MPI_IN_PLACE
| New format PARM file being parsed.
| Version =    1.000 Date = 07/12/01 Time = 15:10:28
My MPI_HOME is /usr/local with all the mpi programs (mpiexec...) in
/usr/local/bin... once again, any help would be most welcome.
Cheers,
Nahoum
 
-----Original Message-----
From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On Behalf
Of Jason Swails
Sent: 23 October 2009 12:58
To: AMBER Mailing List
Subject: Re: [AMBER] amber parallel test fail
Nahoum,
It appears as though mdout.jar.001 already exists.  This is probably due to
the fact that the test has already been run before in that directory, and
Run.jar does not include a command to remove the mdouts (it does, though,
remove every other output file created.  That's probably a fix that's
readily applied).  If you go into the directory $AMBERHOME/tests/jar_multi
and rm -f mdout.jar.001 mdout.jar.000 and rerun the tests, it should finish
just fine (assuming of course previous tests don't have the same issue now
that you've run them as well).  The sander call in Run.jar does not specify
the -O (overwrite) flag, so it quits in error if it tries to open a 'new'
file that already exists.
Good luck!
Jason
On Fri, Oct 23, 2009 at 6:20 AM, Nahoum Anthony <nahoum.anthony.strath.ac.uk
> wrote:
> Dear Amber users,
>
>
>
> I've compiled AMBER in parallel using ifort and the Math Kernel Libraries
> after successful installation in serial (passing all tests), make clean
and
> configure with mpich2 (Myricom's version as we're using Myrinet
> interconnect). The make parallel command compiles without problem, but the
> jar_multi test fails and aborts the testing process... my terminal has
this
> output:
>
>
>
> ...
>
> ==============================================================
>
> cd plane_plane_restraint && ./Run.dinuc_pln
>
> SANDER: Dinucleoside restrained with new plane-plane angle
>
>        restraint that was defined with new natural
>
>        language restraint input.
>
> diffing mdout.dinucAU_pln.save with mdout.dinucAU_pln
>
> PASSED
>
> ==============================================================
>
> diffing dinuc_pln_vs_t.save with dinuc_pln_vs_t
>
> PASSED
>
> ==============================================================
>
> cd bintraj && ./Run.bintraj
>
> diffing nc_headers.save with nc_headers
>
> PASSED
>
> ==============================================================
>
> make[1]: Leaving directory `/home/amber/AMBER/amber10/test'
>
> export TESTsander=/home/amber/AMBER/amber10/exe/sander.MPI; cd 4096wat &&
> ./Run.column_fft
>
> diffing mdout.column_fft.save with mdout.column_fft
>
> PASSED
>
> ==============================================================
>
> export TESTsander=/home/amber/AMBER/amber10/exe/sander.MPI; cd jar_multi
&&
> ./Run.jar
>
>
>
>  Running multisander version of sander amber10
>
>    Total processors =     2
>
>    Number of groups =     2
>
>
>
>
>
>  Unit    6 Error on OPEN: mdout.jar.001
>
>
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>
> rank 1 in job 197  imp1.sibs.strath.ac.uk_50001   caused collective abort
> of
> all ranks
>
>  exit status of rank 1: return code 1
>
>  ./Run.jar:  Program error
>
> make: *** [test.sander.BASIC.MPI] Error 1
>
>
>
>
>
> For the purpose of this test, I have DO _PARALLEL set to 'mpiexec -n 2'
and
> I can see sander.MPI appearing on both nodes when I use 'top' to check
> processes whilst the test is running... I've checked the Amber archives
> mailing list and couldn't find anything to direct me as to what caused
that
> problem. Anyone got an idea ? anymore information required ?
>
>
>
> Best regards and thanks for your time,
>
>
>
> Nahoum
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
--
---------------------------------------
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Oct 23 2009 - 09:00:03 PDT
Custom Search