Re: Error running AMBER6 on Beowulf cluster from Carlos Simmerling on 2002-11-22 (Amber Archive Nov 2002)

From: Carlos Simmerling <carlos_at_ilion.bio.sunysb.edu>
Date: Fri 22 Nov 2002 18:33:49 -0500

we've successfully used the PGI compiler with mpich
but have not seen this type of behavior. We have not used
myrinet, but we have used giganet clan and it also worked.
did your myrinet setup come with any types of testing
programs? I recommend getting those to work before
running amber.
Carlos

===================================================================
Carlos L. Simmerling, Ph.D.
Assistant Professor Phone: (631) 632-1336
Center for Structural Biology Fax: (631) 632-1555
Stony Brook University Web: http://comp.chem.sunysb.edu/carlos
Stony Brook, NY 11794-5115 E-mail: carlos.simmerling_at_stonybrook.edu
===================================================================

----- Original Message -----
From: <arubin_at_unmc.edu>
To: <amber_at_heimdal.compchem.ucsf.edu>
Sent: Friday, November 22, 2002 4:38 PM
Subject: Re: Error running AMBER6 on Beowulf cluster

>
> Dear Dr. Case,
> Thanks a lot for your recommendations. This is really our first
> parallel run on this Beowulf cluster (RedHat, Myranet, PG compiler). A
> calculation using two processors stops abnormally with similar error
> message (see below). The calculation using one processor runs
> successfully.
>
> ********************************************************************
> # message ? sander.07732
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> Name "main::arch" used only once: possible typo at
> /home/usr/mpich.pgi/bin/mpirun.ch_gm.pl line 26.
> | Atom division among processors:
> | 0 5610 11220
> | Atom division among processors for gb:
> | 0 5610 11220
> | Running AMBER/MPI version on 2 nodes
>
>
> Sum of charges from parm topology file = 0.00000000
> Forcing neutrality...
> ---------------------------------------------------
> APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
> using 5000.0 points per unit in tabled values
> TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
> | CHECK switch(x): max rel err = 0.3242E-14 at 2.436720
> | CHECK d/dx switch(x): max rel err = 0.8064E-11 at 2.761360
> ---------------------------------------------------
> Total number of mask terms = 12578
> Total number of mask terms = 25156
> | Total Ewald setup time = 0.14000000
>
> ------------------------------------------------------------------------
------
>
> Unit 7 Error on OPEN:
> Unit 7 Error on OPEN:
> [1] MPI Abort by user Aborting program !
> [1] Aborting program!
> done
> ***************************************************************
>
> Thanks a lot.
> Sincerely yours,
>
>
> Alexander Rubinshtein, Ph.D.
> UNMC Eppley Cancer Center
> Molecular Modeling Core Facility
> _________________________________
> University of Nebraska Medical Center
> 986805 Nebraska Medical Center
> Omaha, Nebraska 68198-6805
> USA
> Office: (402) 559-5319
> Fax: (402) 559-4651
> E-mail: arubin_at_unmc.edu
> WWW: http://www.unmc.edu/Eppley
>
>
>
> "David A. Case"
> <arubin_at_unmc.edu
> > cc:
amber_at_heimdal.compchem.ucsf.edu
> Subject: Re: Error running
AMBER6 on Beowulf cluster
> 11/21/2002 07:31
> PM
> Please respond to
> amber
>
>
>
>
>
>
> On Thu, Nov 21, 2002, arubin_at_unmc.edu wrote:
> >
> > We ran into a problem with MD simulation using AMBER6 on the Beowulf
> > cluster (RedHat, Myranet, PG compiler). To run an MPI job on 8
> processors
> > we used "mpirun.ch_gm" script. Calculation stops abnormally. Could you
> > help us to find out what is going on? If anyone has some idea? I am
> > attaching the output file and error message(see below).
> > ********************************************************************
> > ---------------------------------------------------
> > APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
> > using 5000.0 points per unit in tabled values
> > TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
> > ---------------------------------------------------
> > APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
> > using 5000.0 points per unit in tabled values
> > TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
> > ---------------------------------------------------
>
> etc.
>
> I don't understand why all of the processors are printing out messages
> like this...it should only happen for processor number 0. Somehow, all
> of your jobs think they are processor 0 somehow, but I don't understand
> why.
>
> > Unit 7 Error on OPEN:
>
> This is the real error. Unit 7 is used for the "mdinfo" file. I'm
> guessing
> that several processors are all trying to write to this file at once, but
> again, I don't know why...
>
>
> Obvious questions:
>
> 1. Can you run a short job on one processor? Can you run a short job
> on two processors?
>
> 2. Is this your first parallel run on this hardware/OS/software
> configuration?
> That is, do other jobs work and this particular one fails, or do all
> parallel
> Amber jobs fail, etc.
>
> 3. You will probably have to go into sander.f, and (someplace after
> mpi_init(), print out "mytaskid" for each processor; also the value of the
> "master" variable (which should be true on node0, false everywhere else.)
> Then maybe later on, say inside runmd(), print the same info. Maybe
> something in memory is being clobbered.
>
> ..good luck...dac
>
> --
>
> ==================================================================
> David A. Case | e-mail: case_at_scripps.edu
> Dept. of Molecular Biology, TPC15 | fax: +1-858-784-8896
> The Scripps Research Institute | phone: +1-858-784-9768
> 10550 N. Torrey Pines Rd. | home page:
> La Jolla CA 92037 USA | http://www.scripps.edu/case
> ==================================================================
>
>
>
>
>
>
>
Received on Fri Nov 22 2002 - 15:33:49 PST