RE: Beowulf killing jobs

From: Ross Walker <ross_at_rosswalker.co.uk>
Date: Mon 16 Sep 2002 15:46:45 +0100

Dear Peter

>I am just beginning to run jobs on a new Beowulf cluster and am having
a strange problem. >Jobs running with 1 cpu/node, 1 node and 2 nodes or
2 cpu/node and 1 node run just fine. >But, if I try to run with 4 CPUs

>[pgannett_at_energy b_nomod_ss_prod]$ cat sample_1ppn_4no.e1086
>=>> PBS: job killed: node 3 (node2) requested job die, code 1099

This sounds very much like a setting in your PBS system that is stopping
you running 4 processes at once. Check if there is a per user process
limit for your cluster. You could also try running 4 copies of the
following and see if they complete or some get killed:

        Awk 'BEGIN {for(i=0;i<100000000;i++)for(j=0;j<100000000;j++);}'
        echo "Process completed"

This will very quickly tell you if you can run 4 jobs concurrently.

Note, in my experience using PBS for mpi jobs can be a real pain, is
there a facility for you to run the jobs without subbmitting them via a
pbs batch queue?

All the best
Ross.

/\
\/
|\oss Walker

| Imperial College of Science, Technology & Medicine |
| Department of Chemistry | Theoretical Division |
| Tel:- +44 20 759(45851) |
| EMail:- http://www.rosswalker.co.uk/ |
| PGP Key available on request |
Received on Mon Sep 16 2002 - 07:46:45 PDT
Custom Search