Beowulf killing jobs

From: Peter Gannett <pgannett_at_hsc.wvu.edu>
Date: Fri 13 Sep 2002 17:03:19 -0400

Dear amber users:

I am just beginning to run jobs on a new Beowulf cluster and am having a strange problem. Jobs running with 1 cpu/node, 1 node and 2 nodes or 2 cpu/node and 1 node run just fine. But, if I try to run with 4 CPUs (either 1 cpu/node and 4 nodes or 2 cpu/node and 2 nodes, my jobs get randomly killed and I get an error message:

[pgannett_at_energy b_nomod_ss_prod]$ cat sample_1ppn_4no.e1086
=>> PBS: job killed: node 3 (node2) requested job die, code 1099
Killed by signal 15.
Killed by signal 15.
Killed by signal 15.

and I did not kill the job. If it helps, I am running jobs under the PBS scheduling system (qsub).

Has anyone had this problem. My sysad is not being very helpful and is claiming there must be something in my code (sander, version 7) doing this.

Thanks.
Pete Gannett
Received on Fri Sep 13 2002 - 14:03:19 PDT
Custom Search