Solved. The issue appears to be with the binding to CPU cores. If I run mpirun with --report-bindings, then it seems that two separate single-GPU jobs go to the same physical core
[node012:03755] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..]
[node012:03754] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..]
Best solution seems to be using a rankfile, where now I can get each of 4 separate jobs to run at 115 ns/day (which is full speed for a single 1-GPU job per node for this system).
$ export CUDA_VISIBLE_DEVICES=$i
$ echo "rank 0=localhost slot=0:$i" > my.rankfile.$i
$ mpirun --report-bindings --rankfile my.rankfile.$i $i -np 1 pmemd.cuda.MPI ...
[node012:05275] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..]
[node012:05274] MCW rank 0 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../..][../../../../../..]
[node012:05273] MCW rank 0 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../..][../../../../../..]
[node012:05276] MCW rank 0 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../..][../../../../../..]
### Test was like this:
INP=MD1
for i in 0 1 2 3; do
{
export CUDA_VISIBLE_DEVICES=$i
NAM=cont${i}
echo "rank 0=localhost slot=0:$i" > my.rankfile.$i
mpirun --report-bindings --rankfile my.rankfile.$i -np 1 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i md_restart.in -o ${NAM}.out -p this.prmtop -c ${INP}.rst -r ${NAM}.rst -x ${NAM}.mdcrd -inf ${NAM}.info -l ${NAM}.log
} &
done
wait
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Dec 19 2016 - 11:30:02 PST