Re: Error running AMBER6 on Beowulf cluster

From: David A. Case <case_at_scripps.edu>
Date: Thu 21 Nov 2002 17:31:47 -0800

On Thu, Nov 21, 2002, arubin_at_unmc.edu wrote:
>
> We ran into a problem with MD simulation using AMBER6 on the Beowulf
> cluster (RedHat, Myranet, PG compiler). To run an MPI job on 8 processors
> we used "mpirun.ch_gm" script. Calculation stops abnormally. Could you
> help us to find out what is going on? If anyone has some idea? I am
> attaching the output file and error message(see below).
> ********************************************************************
> ---------------------------------------------------
> APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
> using 5000.0 points per unit in tabled values
> TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
> ---------------------------------------------------
> APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
> using 5000.0 points per unit in tabled values
> TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
> ---------------------------------------------------

etc.

I don't understand why all of the processors are printing out messages
like this...it should only happen for processor number 0. Somehow, all
of your jobs think they are processor 0 somehow, but I don't understand why.

> Unit 7 Error on OPEN:

This is the real error. Unit 7 is used for the "mdinfo" file. I'm guessing
that several processors are all trying to write to this file at once, but
again, I don't know why...


Obvious questions:

1. Can you run a short job on one processor? Can you run a short job
on two processors?

2. Is this your first parallel run on this hardware/OS/software configuration?
That is, do other jobs work and this particular one fails, or do all parallel
Amber jobs fail, etc.

3. You will probably have to go into sander.f, and (someplace after
mpi_init(), print out "mytaskid" for each processor; also the value of the
"master" variable (which should be true on node0, false everywhere else.)
Then maybe later on, say inside runmd(), print the same info. Maybe
something in memory is being clobbered.

..good luck...dac


-- 
==================================================================
David A. Case                     |  e-mail:      case_at_scripps.edu
Dept. of Molecular Biology, TPC15 |  fax:          +1-858-784-8896
The Scripps Research Institute    |  phone:        +1-858-784-9768
10550 N. Torrey Pines Rd.         |  home page:                   
La Jolla CA 92037  USA            |    http://www.scripps.edu/case
==================================================================
Received on Thu Nov 21 2002 - 17:31:47 PST
Custom Search