[AMBER] PMEMD.CUDA.MPI PME tests/Benchmarks

From: Martin Peters <martin.b.peters.me.com>
Date: Fri, 3 Feb 2012 15:08:46 +0000

Hi,

We have just upgraded our stoney cluster here at ICHEC with a number of M2090s and I'd like to do some scaling tests using PMEMD (24 nodes with 2 cards each).
The cluster has 64 nodes each with two 2.8GHz Intel (Nehalem EP) Xeon X5560 quad core processors and 48GB of RAM.
The nodes are interconnected via a half-blocking fat-tree network of ConnectX Infiniband (DDR).

Amber 11/AmberTools 1.5 (all patches up until the 17th Jan 2012 were applied) was build using the following: Intel C/Fortran 2011.1.107, mvapich2 v1.5.1, MKL v10.2.6.038, and CUDA v4.0. All compilation steps completed without any problems.

The GB/nucleosome test (downloaded from http://ambermd.org/gpus/benchmarks.htm) scales all the way up to 24 nodes/48 M2090s. However I'm running into a few problems with the PME cases.

I'm executing the program using the commands similar to the following:
for i in {1..1}; do cat $PBS_NODEFILE | uniq >> hosts.1.2; done; mpirun_rsh -n 2 -hostfile hosts.1.2 $AMBERHOME/exe/pmemd.cuda.MPI -O -i mdin -p prmtop -c inpcrd -ref inpcrd -suffix parallelCuda.n1.g2

Here is the output from FactorIX_production_NVE benchmark using 1 node with two M2090s:
|------------------- GPU DEVICE INFO --------------------
|
| Task ID: 0
| CUDA Capable Devices Detected: 2
| CUDA Device ID in use: 0
| CUDA Device Name: Tesla M2090
| CUDA Device Global Mem Size: 5375 MB
| CUDA Device Num Multiprocessors: 16
| CUDA Device Core Freq: 1.30 GHz
|
|
| Task ID: 1
| CUDA Capable Devices Detected: 2
| CUDA Device ID in use: 1
| CUDA Device Name: Tesla M2090
| CUDA Device Global Mem Size: 5375 MB
| CUDA Device Num Multiprocessors: 16
| CUDA Device Core Freq: 1.30 GHz
|
|--------------------------------------------------------
....
| Nonbonded Pairs Initial Allocation: 10363284
....
| GPU memory information:
| KB of GPU memory in use: 304037
| KB of CPU memory in use: 73726
 
| Running AMBER/MPI version on 2 nodes

However using when 2 nodes + 4 M2090s:
.....
|------------------- GPU DEVICE INFO --------------------
|
| Task ID: 0
| CUDA Capable Devices Detected: 2
| CUDA Device ID in use: 0
| CUDA Device Name: Tesla M2090
| CUDA Device Global Mem Size: 5375 MB
| CUDA Device Num Multiprocessors: 16
| CUDA Device Core Freq: 1.30 GHz
|
|
| Task ID: 1
| CUDA Capable Devices Detected: 2
| CUDA Device ID in use: 0
| CUDA Device Name: Tesla M2090
| CUDA Device Global Mem Size: 5375 MB
| CUDA Device Num Multiprocessors: 16
| CUDA Device Core Freq: 1.30 GHz
|
|
| Task ID: 2
| CUDA Capable Devices Detected: 2
| CUDA Device ID in use: 1
| CUDA Device Name: Tesla M2090
| CUDA Device Global Mem Size: 5375 MB
| CUDA Device Num Multiprocessors: 16
| CUDA Device Core Freq: 1.30 GHz
|
|
| Task ID: 3
| CUDA Capable Devices Detected: 2
| CUDA Device ID in use: 1
| CUDA Device Name: Tesla M2090
| CUDA Device Global Mem Size: 5375 MB
| CUDA Device Num Multiprocessors: 16
| CUDA Device Core Freq: 1.30 GHz
|
|--------------------------------------------------------
.....
--------------------------------------------------------------------------------
   3. ATOMIC COORDINATES AND VELOCITIES
--------------------------------------------------------------------------------

factor IX (ACTIVATED PROTEIN)
 begin time read from input coords = 2542.675 ps


 Number of triangulated 3-point waters found: 28358

     Sum of charges from parm topology file = 0.00031225
     Forcing neutrality...

I get the following out of memory error:
cudaMalloc GpuBuffer::Allocate failed out of memory

Going back a step and rerunning the parallel_cuda tests I've noticed the following:

------------------------------------Running CUDA Explicit solvent tests.
  Precision Model = SPDP
           GPU_ID = -1
------------------------------------
cd 4096wat/ && ./Run.pure_wat -1 SPDP netcdf.mod
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
 pmemd.cuda_SPDP.M 000000000065286A Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000708348 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000006C55C6 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000669350 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000051E3CD Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000501DC8 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000004DF51D Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000040A31C Unknown Unknown Unknown
libc.so.6 0000003F3601D974 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000040A229 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
 pmemd.cuda_SPDP.M 000000000065286A Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000708348 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000006C55C6 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000669350 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000051E3CD Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000501DC8 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000004DF51D Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000040A31C Unknown Unknown Unknown
libc.so.6 0000003F3601D974 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000040A229 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
 
libmlx4-rdmav2.so 00002AFB541D0D17 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000006979B9 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000677D5D Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000006765FF Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000065DDF3 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000006C9ED4 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000006C9311 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000066050D Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000065FB2E Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000007093DD Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000706EA9 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000006C552A Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000669730 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000005207E5 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 0000000000501DC8 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 00000000004DF51D Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000040A31C Unknown Unknown Unknown
libc.so.6 0000003F3601D974 Unknown Unknown Unknown
pmemd.cuda_SPDP.M 000000000040A229 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
mpiexec_raw: Warning: tasks 0-1 exited with status 174.
mpiexec_raw: Warning: tasks 2-3 exited with status 1.
  ./Run.pure_wat: Program error
make[1]: *** [test.pmemd.cuda.pme] Error 1
make[1]: Target `test.pmemd.cuda.MPI' not remade because of errors.
make[1]: Leaving directory `/ichec/packages/amber/11/test/cuda'
make: *** [test.pmemd.cuda.MPI] Error 2
make: Target `test.parallel.cuda' not remade because of errors.
6 file comparisons passed
0 file comparisons failed
7 tests experienced errors

Am I missing something obvious? Could any of the AMBER use shed some light on the issue I'm having?

Thanks in advance,
Martin



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 03 2012 - 07:30:02 PST
Custom Search