[AMBER] Amber16 on GPUs and speed differences between CUDA_VISIBLE_DEVICES=0, 1 or 0, 2 from Neale, Christopher Andrew on 2016-12-19 (Amber Archive Dec 2016)

From: Neale, Christopher Andrew <cneale.lanl.gov>
Date: Mon, 19 Dec 2016 19:22:21 +0000

Solved. The issue appears to be with the binding to CPU cores. If I run mpirun with --report-bindings, then it seems that two separate single-GPU jobs go to the same physical core

[node012:03755] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..]
[node012:03754] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..]

Best solution seems to be using a rankfile, where now I can get each of 4 separate jobs to run at 115 ns/day (which is full speed for a single 1-GPU job per node for this system).
$ export CUDA_VISIBLE_DEVICES=$i
$ echo "rank 0=localhost slot=0:$i" > my.rankfile.$i
$ mpirun --report-bindings --rankfile my.rankfile.$i $i -np 1 pmemd.cuda.MPI ...
[node012:05275] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..]
[node012:05274] MCW rank 0 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../..][../../../../../..]
[node012:05273] MCW rank 0 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../..][../../../../../..]
[node012:05276] MCW rank 0 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../..][../../../../../..]

### Test was like this:
INP=MD1
for i in 0 1 2 3; do
{
  export CUDA_VISIBLE_DEVICES=$i
  NAM=cont${i}
  echo "rank 0=localhost slot=0:$i" > my.rankfile.$i
  mpirun --report-bindings --rankfile my.rankfile.$i -np 1 ${AMBERHOME}/bin/pmemd.cuda.MPI -O -i md_restart.in -o ${NAM}.out -p this.prmtop -c ${INP}.rst -r ${NAM}.rst -x ${NAM}.mdcrd -inf ${NAM}.info -l ${NAM}.log
} &
done
wait

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Dec 19 2016 - 11:30:02 PST