[AMBER] multiple cpus per single gpu job from Meij, Henk on 2015-07-30 (Amber Archive Jul 2015)

From: Meij, Henk <hmeij.wesleyan.edu>
Date: Thu, 30 Jul 2015 15:11:13 +0000

Hi, I know nothing about Amber but am observing the following and trying to help out on our cluster. User compiled Amber14 and is running across hardware that has 4 K20 gpus and 32 cpu cores (hyper threaded) per node on CentOS 6.5 (nividia 5.5)

The scheduler shows 4 jobs on node n36 each invoking pmemd.cuda.MPI which start up on the GPUs allocated. However when I query the cpu process pmemd.cuda.MPI appears to have forked itself multiple times (3x in this case). That implies that in our environment the user should use settings to request via scheduler cpu=3, gpu=1 instead of the cpu=1, gpu=1 we normally use (Amber12).

Is this expected? And what controls this? So we know beforehand what the scheduler should allocate per gpu.

-Henk
PS/I can get more information from the user if specific Amber routines cause this.

[root.sharptail homedirs]# bjobs -u blakhani -m n36
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
484276 blakhani RUN mwgpu n34 n36 /home/blakhani/Manju/MUTS-FILES/MUTS-Mutation/GLN468ALA Jul 29 15:04
484808 blakhani RUN mwgpu n34 n36 /home/blakhani/Manju/MUTS-FILES/MUTS-Mutation/GLU500ALA Jul 29 17:55
484811 blakhani RUN mwgpu n34 n36 /home/blakhani/Manju/MUTS-FILES/MUTS-Mutation/GLU699ALA Jul 29 18:34
484838 blakhani RUN mwgpu n34 n36 /home/blakhani/Manju/MUTS-FILES/MUTS-Mutation/SER151ALA Jul 30 00:39

[root.sharptail homedirs]# ssh n36 top -u blakhani -b -n 1 | grep pmem
4976 blakhani 20 0 272g 484m 98m R 25.9 0.2 103:19.14 pmemd.cuda.MPI
9865 blakhani 20 0 272g 479m 98m R 24.1 0.2 18:33.71 pmemd.cuda.MPI
11477 blakhani 20 0 272g 476m 98m R 24.1 0.2 17:18.74 pmemd.cuda.MPI
18148 blakhani 20 0 272g 484m 98m R 24.1 0.2 74:38.58 pmemd.cuda.MPI

[root.sharptail homedirs]# ssh n36 ps -L 4976
PID LWP TTY STAT TIME COMMAND
4976 4976 ? RLl 103:14 pmemd.cuda.MPI -O -i 1NNE-dna-atp-mg_equil.in -o 1NNE-dna-atp-mg_equil.1.out -p 1NNE-dna-atp-mg.prmtop -c 1NNE-dna-atp-mg_heat.rst -r 1NNE-dna-atp-mg_equil.1.rst -x 1NNE-dna-atp-mg_equil.1.mdcrd
4976 4977 ? SLl 0:00 pmemd.cuda.MPI -O -i 1NNE-dna-atp-mg_equil.in -o 1NNE-dna-atp-mg_equil.1.out -p 1NNE-dna-atp-mg.prmtop -c 1NNE-dna-atp-mg_heat.rst -r 1NNE-dna-atp-mg_equil.1.rst -x 1NNE-dna-atp-mg_equil.1.mdcrd
4976 6260 ? SLl 0:07 pmemd.cuda.MPI -O -i 1NNE-dna-atp-mg_equil.in -o 1NNE-dna-atp-mg_equil.1.out -p 1NNE-dna-atp-mg.prmtop -c 1NNE-dna-atp-mg_heat.rst -r 1NNE-dna-atp-mg_equil.1.rst -x 1NNE-dna-atp-mg_equil.1.mdcrd

[root.sharptail homedirs]# ssh n36 ps -L 11477
PID LWP TTY STAT TIME COMMAND
11477 11477 ? RLl 17:22 pmemd.cuda.MPI -O -i 1NNE-dna-atp-mg_equil.in -o 1NNE-dna-atp-mg_equil.2.out -p 1NNE-dna-atp-mg.prmtop -c 1NNE-dna-atp-mg_equil.1.rst -r 1NNE-dna-atp-mg_equil.2.rst -x 1NNE-dna-atp-mg_equil.2.mdcrd
11477 11478 ? SLl 0:00 pmemd.cuda.MPI -O -i 1NNE-dna-atp-mg_equil.in -o 1NNE-dna-atp-mg_equil.2.out -p 1NNE-dna-atp-mg.prmtop -c 1NNE-dna-atp-mg_equil.1.rst -r 1NNE-dna-atp-mg_equil.2.rst -x 1NNE-dna-atp-mg_equil.2.mdcrd
11477 12728 ? SLl 0:01 pmemd.cuda.MPI -O -i 1NNE-dna-atp-mg_equil.in -o 1NNE-dna-atp-mg_equil.2.out -p 1NNE-dna-atp-mg.prmtop -c 1NNE-dna-atp-mg_equil.1.rst -r 1NNE-dna-atp-mg_equil.2.rst -x 1NNE-dna-atp-mg_equil.2.mdcrd

[root.sharptail homedirs]# ssh n36 gpu-info
====================================================
Device Model Temperature Utilization
====================================================
0 Tesla K20m 38 C 38 %
1 Tesla K20m 39 C 41 %
2 Tesla K20m 37 C 0 %
3 Tesla K20m 36 C 39 %
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 30 2015 - 08:30:02 PDT