Re: [AMBER] pmemd.cuda.MPI not running well with SGE from Chris Neale on 2017-01-18 (Amber Archive Jan 2017)

From: Chris Neale <candrewn.gmail.com>
Date: Wed, 18 Jan 2017 11:06:59 -0700

Within a queueing system that we have, I have to set the CPU affinity to
run multiple concurrent jobs on the same node. Code is below, which can be
modified to simply set the CPU affinity for a single mpirun call per node
if that is what you need. We're using openmpi.

{
  export CUDA_VISIBLE_DEVICES=0,1
  {
    echo "rank 0=localhost slot=0:0"
    echo "rank 1=localhost slot=0:1"
  } > my.rankfile.A
  mpirun --report-bindings --rankfile my.rankfile.A -np 2
${AMBERHOME}/bin/pmemd.cuda.MPI -i $amdp -o ${athis}.out -p this.prmtop -c
${aprev}.rst -r ${athis}.rst -x ${athis}.mdcrd -inf ${athis}.info -l
${athis}.log
} &

{
  export CUDA_VISIBLE_DEVICES=2,3
  {
    echo "rank 0=localhost slot=0:2"
    echo "rank 1=localhost slot=0:3"
  } > my.rankfile.B
  mpirun --report-bindings --rankfile my.rankfile.B -np 2
${AMBERHOME}/bin/pmemd.cuda.MPI -i $bmdp -o ${bthis}.out -p this.prmtop -c
${bprev}.rst -r ${bthis}.rst -x ${bthis}.mdcrd -inf ${bthis}.info -l
${bthis}.log
} &

wait

## Hope it helps.
Chris.

On Tue, Jan 17, 2017 at 7:07 AM, Daniel Roe <daniel.r.roe.gmail.com> wrote:

> On Tue, Jan 17, 2017 at 6:22 AM, Wang, Yin <Yin.Wang.uibk.ac.at> wrote:
> > We tested a system with 166K atoms, for 1-GPU job with “pmemd.cuda”, we
> got ~13 ns/day.
> >
> > We tested the same system with 2-GPUs with “mpirun -np 2 pmemd.cuda.MPI
> -O”,
> > we got a problem.
> >
> > (1) If we run the command directly in the calculation node without using
> the
> > SGE queuing system, we got ~20 ns/day.
> >
> > (2) If we submit the 2-GPU jobs with the same command using our SGE
> queuing
> > system, we got ~5 ns/day.
>
> Since you can run just fine outside the queuing system, this is a
> problem with your queuing system, not pmemd. My guess is that the
> process affinity is not being set correctly and both threads are
> hammering the same CPU core or something.
>
> -Dan
>
> >
> >
> >
> > In both cases, we are sure we have “Peer to Peer support: ENABLED” in
> both
> > out files.
> >
> > The differences are in the timings section:
> >
> >
> >
> > In the first case,
> >
> > | Routine Sec %
> >
> > | ------------------------------
> >
> > | DataDistrib 0.03 0.06
> >
> > | Nonbond 36.62 83.68
> >
> > | Bond 0.00 0.00
> >
> > | Angle 0.00 0.00
> >
> > | Dihedral 0.00 0.00
> >
> > | Shake 0.08 0.18
> >
> > | RunMD 7.02 16.05
> >
> > | Other 0.01 0.03
> >
> > | ------------------------------
> >
> > | Total 43.76
> >
> >
> >
> > In the second case,
> >
> > | Routine Sec %
> >
> > | ------------------------------
> >
> > | DataDistrib 27.04 27.21
> >
> > | Nonbond 66.06 66.49
> >
> > | Bond 0.00 0.00
> >
> > | Angle 0.00 0.00
> >
> > | Dihedral 0.00 0.00
> >
> > | Shake 0.04 0.04
> >
> > | RunMD 6.21 6.24
> >
> > | Other 0.01 0.01
> >
> > | ------------------------------
> >
> > | Total 99.36
> >
> >
> >
> > Kind Regards,
> >
> >
> >
> > Yin Wang
> >
> >
> >
> > Theoretical Chemistry
> >
> > Leopold-Franzens-Universität Innsbruck
> >
> > Innrain 82, 6020 Innsbruck, Austria
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
>
> --
> -------------------------
> Daniel R. Roe
> Laboratory of Computational Biology
> National Institutes of Health, NHLBI
> 5635 Fishers Ln, Rm T900
> Rockville MD, 20852
> https://www.lobos.nih.gov/lcb
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jan 18 2017 - 10:30:03 PST