Re: AMBER: MPI parallel problem in AMBER 8.0

From: mkseo <seo.ualberta.ca>
Date: Tue, 2 May 2006 16:14:16 -0600

Thanks for the reply, Ross.

I tested my own example with serial version and I got the following
message again.

ELAN_EXCEPTION . --: 6 (Initialisation error)
    elan_init: No capability, can't continue
forrtl: error (76): IOT trap signal
    0: __FINI_00_remove_gp_range [0x3ff81a21488]
    1: __FINI_00_remove_gp_range [0x3ff81a2a910]
    2: __FINI_00_remove_gp_range [0x3ff800d9cc0]
    3: __FINI_00_remove_gp_range [0x3ff800ed7d4]
    4: __FINI_00_remove_gp_range [0x3000200f298]
    5: __FINI_00_remove_gp_range [0x3000200a9f0]
    6: __FINI_00_remove_gp_range [0x3000200d578]
    7: __FINI_00_remove_gp_range [0x3ffbff9f928]
    8: __FINI_00_remove_gp_range [0x3ffbffd4c70]
    9: __FINI_00_remove_gp_range [0x3ffbffd3ff0]
   10: multisander_ [_sander.f: 610, 0x12004da88]
   11: main [for_main.c: 203, 0x12013338c]
   12: __start [0x12001b038]


What is this about?

The bottom is the input I used for the test.
Do you see any problem?

------------------------------------------------------------------------
-----------------------------------------------------------
# Control section
  &cntrl
   ntwe = 200, ntwx = 200, ntwv = 200, ntpr = 200,
   ntt = 1, temp0 = 300.0, tempi = 5.0, tautp = 1.0,
   scnb = 2.0, scee = 1.2, dielc = 1,
   ntb = 0, ntc = 2, ntf = 2,
   nstlim = 250000, dt = 0.0010,
   ntt = 1, ntp = 0,
   ntx = 1, nmropt = 1
  &end

  &wt
   type = 'TEMP0', istep1 = 1, istep2 = 50000, value1 = 5.0, value2 =
300.0,
  &end

  &wt
   type = 'TEMP0', istep1 = 30001, istep2 = 200000, value1 = 300.0,
value2 = 300.0,
  &end

  &wt
   type='END'
  &end

LISTOUT=POUT
DISANG = restraints_2.in
------------------------------------------------------------------------
-----------------------------------------------------------




On 2-May-06, at 3:42 PM, Ross Walker wrote:

> Hi MK
>
> These sorts of tracebacks are often not of much use, however in your
> case it
> does appear to yield some light on your problem.
>
>> [0] MPI Abort by user Aborting program !
>> forrtl: error (76): IOT trap signal
>> 0: __FINI_00_remove_gp_range [0x3ff81a21488]
>> 1: __FINI_00_remove_gp_range [0x3ff81a2a910]
>> 2: __FINI_00_remove_gp_range [0x3ff800d9cc0]
>> 3: __FINI_00_remove_gp_range [0x3ff800ed7d4]
>> 4: __FINI_00_remove_gp_range [0x3ff802206a0]
>> 5: __FINI_00_remove_gp_range [0x3ff80140554]
>> 6: __FINI_00_remove_gp_range [0x3ff801d2748]
>> 7: __FINI_00_remove_gp_range [0x3ffbffa08d0]
>> 8: __FINI_00_remove_gp_range [0x3ffbff9c9c8]
>> 9: __FINI_00_remove_gp_range [0x3ffbff9ca08]
>> 10: mexit_ [_mexit.f: 355, 0x12011ad10]
>
> Line 10 here is the important line. This was a call to the function
> mexit()
> which is sander's legitimate exit routine. This generated errors 1 to 9
> which is some funky stuff from your MPI implementation. I would just
> ignore
> it. The thing to note is that the Abort was by user. I.e. not a
> segfault but
> the code itself actually quit. In this case sander quit by calling
> mexit.
>
> If we look further back along the trace:
>
>> 11: mdread2_ [_mdread.f: 3690, 0x120095848]
>> 12: sander_ [_sander.f: 2497, 0x1200504e4]
>> 13: multisander_ [_sander.f: 885, 0x12004f240]
>> 14: main [for_main.c: 203, 0x12013338c]
>> 15: __start [0x12001b038]
>
> We see that the routine that called mexit was mdread2. This routine is
> responsible for reading in the namelist and other control variables.
> This
> obviously found some error and aborted. The error is probably printed
> somewhere in your output, either in the mdout file or in the stderr or
> stdout file generated by your batch scheduler. Alternatively,
> depending on
> how your system is setup it may have simply been lost. This loss of
> error
> message is a common occurance with codes running in parallel...
>
> Either way the problem is likely an error in your input. Especially if
> the
> tests work but your run doesn't. Have you successfully tested this
> system
> with a serial run? You should always do this first to check things are
> okay
> before running in parallel. E.g. run it for 10 steps or so in serial
> and see
> if it works okay. Error messages are generally much much clearer when
> running in serial.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may
> not
> be read every day, and should not be used for urgent or sensitive
> issues.
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed May 03 2006 - 06:07:12 PDT
Custom Search