Re: Error running AMBER6 on Beowulf cluster

From: <arubin_at_unmc.edu>
Date: Fri 22 Nov 2002 15:38:12 -0600

Dear Dr. Case,
    Thanks a lot for your recommendations. This is really our first
  parallel run on this Beowulf cluster (RedHat, Myranet, PG compiler). A
  calculation using two processors stops abnormally with similar error
  message (see below). The calculation using one processor runs
  successfully.

  ********************************************************************
  # message ? sander.07732
  Warning: no access to tty (Bad file descriptor).
  Thus no job control in this shell.
  Name "main::arch" used only once: possible typo at
  /home/usr/mpich.pgi/bin/mpirun.ch_gm.pl line 26.
  | Atom division among processors:
  | 0 5610 11220
  | Atom division among processors for gb:
  | 0 5610 11220
  | Running AMBER/MPI version on 2 nodes


       Sum of charges from parm topology file = 0.00000000
       Forcing neutrality...
   ---------------------------------------------------
   APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
   using 5000.0 points per unit in tabled values
   TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
  | CHECK switch(x): max rel err = 0.3242E-14 at 2.436720
  | CHECK d/dx switch(x): max rel err = 0.8064E-11 at 2.761360
   ---------------------------------------------------
       Total number of mask terms = 12578
       Total number of mask terms = 25156
  | Total Ewald setup time = 0.14000000

  ------------------------------------------------------------------------------

  Unit 7 Error on OPEN:
     Unit 7 Error on OPEN:
  [1] MPI Abort by user Aborting program !
  [1] Aborting program!
  done
  ***************************************************************

  Thanks a lot.
  Sincerely yours,


Alexander Rubinshtein, Ph.D.
UNMC Eppley Cancer Center
Molecular Modeling Core Facility
_________________________________
University of Nebraska Medical Center
986805 Nebraska Medical Center
Omaha, Nebraska 68198-6805
USA
Office: (402) 559-5319
Fax: (402) 559-4651
E-mail: arubin_at_unmc.edu
WWW: http://www.unmc.edu/Eppley


                                                                                                      
                      "David A. Case"
                      <arubin_at_unmc.edu
> cc: amber_at_heimdal.compchem.ucsf.edu
                                               Subject: Re: Error running AMBER6 on Beowulf cluster
                      11/21/2002 07:31
                      PM
                      Please respond to
                      amber
                                                                                                      
                                                                                                      




On Thu, Nov 21, 2002, arubin_at_unmc.edu wrote:
>
> We ran into a problem with MD simulation using AMBER6 on the Beowulf
> cluster (RedHat, Myranet, PG compiler). To run an MPI job on 8
processors
> we used "mpirun.ch_gm" script. Calculation stops abnormally. Could you
> help us to find out what is going on? If anyone has some idea? I am
> attaching the output file and error message(see below).
> ********************************************************************
> ---------------------------------------------------
> APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
> using 5000.0 points per unit in tabled values
> TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
> ---------------------------------------------------
> APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
> using 5000.0 points per unit in tabled values
> TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
> ---------------------------------------------------

etc.

I don't understand why all of the processors are printing out messages
like this...it should only happen for processor number 0. Somehow, all
of your jobs think they are processor 0 somehow, but I don't understand
why.

> Unit 7 Error on OPEN:

This is the real error. Unit 7 is used for the "mdinfo" file. I'm
guessing
that several processors are all trying to write to this file at once, but
again, I don't know why...


Obvious questions:

1. Can you run a short job on one processor? Can you run a short job
on two processors?

2. Is this your first parallel run on this hardware/OS/software
configuration?
That is, do other jobs work and this particular one fails, or do all
parallel
Amber jobs fail, etc.

3. You will probably have to go into sander.f, and (someplace after
mpi_init(), print out "mytaskid" for each processor; also the value of the
"master" variable (which should be true on node0, false everywhere else.)
Then maybe later on, say inside runmd(), print the same info. Maybe
something in memory is being clobbered.

..good luck...dac


--
==================================================================
David A. Case                     |  e-mail:      case_at_scripps.edu
Dept. of Molecular Biology, TPC15 |  fax:          +1-858-784-8896
The Scripps Research Institute    |  phone:        +1-858-784-9768
10550 N. Torrey Pines Rd.         |  home page:
La Jolla CA 92037  USA            |    http://www.scripps.edu/case
==================================================================
Received on Fri Nov 22 2002 - 13:38:12 PST
Custom Search