Hello Ross,
I'm working with Nick on this problem and can try to fill in some details ..
Firstly, by "single pmemd job" we mean a multiprocessor pmemd job running
alone on the cluster, not a serial pmemd job. Our problems arise when we try
to run two (multiprocessor) pmemd jobs on the cluster at the same time. To
give an example: if we submit a single 16cpu pmemd job via SGE we
get reasonable pmemd performance. Only when we simultaneously submit a
second 16 cpu pmemd job (the cluster has >32 cpus) do the problems start ...
Secondly, we don't see any error messages: the pmemd output files are there
and look normal, and the SGE logfiles don't report any problems either. As
Nick said, what happens is that pmemd jobs disappear from the queueing
system, but continue to run on the compute nodes.
To add some specific information: we have used both "round-robin" and
"fill-up" allocation/scheduling rules under SGE. With "fill-up" we
sporadically (and, at the moment, non-reproducibly!) see the issue described
above. With "round-robin" we additionally notice a drastic slow-down -- jobs
running side-by-side with another complete an order of magnitude fewer
timesteps per unit walltime than a jobs running alone on the cluster. For
both allocation rules, the SGE delete command "qdel" removes the job from
the queue but it persists on the compute node.
If anyone has seen anything like this and can direct us to the source of the
problem, we'd be very grateful,
Many thanks,
Frank.
On Sat, Mar 28, 2009 at 2:04 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Nick,
>
> There really isn't enough information in here to be able to tell what is
> going on. Do you get any type of error message? Do you see an output file?
> What about the log files produced by the queuing system do they tell you
> anything? Normally stderr will have been redirected somewhere and you would
> need to find this to see what was said. There are a number of problems that
> could be occurring including file permission / path problems if all nodes
> don't share the same filesystem, problems with shared libraries due to
> environment variables not being exported correctly, stack limitation issues
> causing segfaults, insufficient memory etc etc. Clues to which of these it
> is will be in the log file.
>
> Note, you say you can launch single pmemd jobs but don't explain this. The
> parallel version of pmemd can only run at 2cpus and greater. Did you
> compile
> a serial version as well? Is this what you means by single pmemd jobs?
>
> All the best
> Ross
>
> > -----Original Message-----
> > From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
> > Behalf Of Nick Holway
> > Sent: Friday, March 27, 2009 8:56 AM
> > To: amber.ambermd.org
> > Subject: [AMBER] PMEMD 9 on MVAPICH / Infiniband problem
> >
> > Dear all.
> >
> > We've compiled PMEMD 9 using ifort 10, MVAPICH2 1.2 and OFED 1.4 on
> > 64bit Rocks 5.1 (ie Centos 5.2 and SGE 6.1u5). I'm able to launch
> > single pmemd jobs via qsub using mpirun_rsh and they run well. The
> > problem we see is when two jobs are launched at once is that some of
> > the jobs disappear from qstat in SGE as well as continue to run
> > indefinitely.
> >
> > I'm calling PMEMD with this line - $MPIHOME/bin/mpirun_rsh -np $NSLOTS
> > -hostfile $TMPDIR/machines $AMBERHOME/pmemd -O -i xxxx.inp -c
> > xxxx_min.rest -o xxxx.out -p xxxx.top -r xxxx_eqt.rest -x xxxx.trj
> >
> > Does anyone know what I've got to do to make the PMEMD jobs run properly?
> >
> > Thanks for any help.
> >
> > Nick
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Apr 01 2009 - 01:13:57 PDT