Re: [AMBER] mm_pbsa.pl in PARALLEL - part 3

From: Thomas Zeiser <thomas.zeiser.rrze.uni-erlangen.de>
Date: Thu, 22 Jul 2010 14:13:08 +0200

On Wed, Jul 21, 2010 at 12:33:00PM +0200, Anselm Horn wrote:
> Dear all,
>
> when I try to run some MM-GBSA calculations for a large number of
> snapshots (> 10,000) in parallel (with the number of processes less than
> the number of available cores on the machine, just to be sure), the
> mm_pbsa.pl script exits with the following error:
>
> Finished process 256 with PID 9546 has wrong exit code 0
> For details see: http://ambermd.org/Questions/mm_pbsa.html#ana_finished_proc
>
> This program error happens in different runs at a similar snapshot
> number (around 256). Has anyone else experienced such behaviour?
> Is there a solution, or a known limitation of the script?

I'd say it's a limitation of the script - or the way the forked
processes communicate with the master ... It should fail on any
*nix system when run on more than 255 snapshots.


The relevant parts are in src/mm_pbsa/mm_pbsa_calceneent.pm

line 239:
      # Finish fork
      #############
      $pm->finish($pnumber);

i.e the "snapshot" number is used as return values and checked in
subroutine analyze_finished_proc

line 282:
  if($pnumber != $exit_code){
    die("Finished process $pnumber with PID $pid has wrong exit code $exit_code\n

Unfortunately, exit codes on Unix/Linux are limited to 0-255; thus,
for $pnumber > 256 the exit code will always be $pnumber MODULO
256.


If one can assume that the number of concurrent running threads is
<255 it should be quite save to change line 282 in the following
way, i.e. compare the exit code with the snapshot MODULO 256

--- mm_pbsa_calceneent.pm.orig 2010-07-22 13:54:52.000000000 +0200
+++ mm_pbsa_calceneent.pm 2010-07-22 13:55:05.000000000 +0200
.@ -279,7 +279,8 @@
   my $exit_code = shift;
   my $pnumber = shift;

- if($pnumber != $exit_code){
+ # ATTENTION: exit codes can only be in the range of 0-255, thus, do a modulo on the snapshot number
+ if($pnumber%256 != $exit_code){
     die("Finished process $pnumber with PID $pid has wrong exit code $exit_code\nFor details see: $HTMLPATH#ana_finished_proc\n");
   }


Although not really mandatory, I also would pass the snapshot
number modulo 256 to the finish call for consistency reasons. (If
not done explicitly, the truncation will still silently be done
automatically by the internals of Perl/OS)

--- mm_pbsa_calceneent.pm.orig 2010-07-22 13:54:52.000000000 +0200
+++ mm_pbsa_calceneent.pm 2010-07-22 14:02:26.000000000 +0200
.@ -234,9 +234,9 @@
         unlink $npdb . $number;
       }

- # Finish fork
+ # Finish fork (ATTENTION: valid exit codes are only 0-255)
       #############
- $pm->finish($pnumber);
+ $pm->finish($pnumber%256);

       # Delete procs_data
       ###################



Regards,

thomas
-- 
Dr.-Ing. Thomas Zeiser, HPC Services
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen (RRZE)
http://www.rrze.uni-erlangen.de/hpc/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 22 2010 - 05:30:06 PDT
Custom Search