Re: [AMBER] PMEMD.CUDA.MPI PME tests/Benchmarks from Ross Walker on 2012-02-03 (Amber Archive Feb 2012)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 3 Feb 2012 16:15:15 -0800

Hi Martin,

I don't mean to burst your bubble but with a DDR interconnect and not even
full bisection bandwidth you will be extremely unlikely to get any PME runs
to scale on multiple GPUs across nodes. There just isn't the interconnect
bandwidth there unfortunately.

With regards to the segfault, assuming you are running with all the latest
bugfixes applied try setting the following on ALL nodes:

export CUDA_NIC_INTEROP=1

All the best
Ross

> -----Original Message-----
> From: Martin Peters [mailto:martin.b.peters.me.com]
> Sent: Friday, February 03, 2012 7:09 AM
> To: AMBER Mailing List
> Subject: [AMBER] PMEMD.CUDA.MPI PME tests/Benchmarks
>
> Hi,
>
> We have just upgraded our stoney cluster here at ICHEC with a number of
> M2090s and I'd like to do some scaling tests using PMEMD (24 nodes with
> 2 cards each).
> The cluster has 64 nodes each with two 2.8GHz Intel (Nehalem EP) Xeon
> X5560 quad core processors and 48GB of RAM.
> The nodes are interconnected via a half-blocking fat-tree network of
> ConnectX Infiniband (DDR).
>
> Amber 11/AmberTools 1.5 (all patches up until the 17th Jan 2012 were
> applied) was build using the following: Intel C/Fortran 2011.1.107,
> mvapich2 v1.5.1, MKL v10.2.6.038, and CUDA v4.0. All compilation steps
> completed without any problems.
>
> The GB/nucleosome test (downloaded from
> http://ambermd.org/gpus/benchmarks.htm) scales all the way up to 24
> nodes/48 M2090s. However I'm running into a few problems with the PME
> cases.
>
> I'm executing the program using the commands similar to the following:
> for i in {1..1}; do cat $PBS_NODEFILE | uniq >> hosts.1.2; done;
> mpirun_rsh -n 2 -hostfile hosts.1.2 $AMBERHOME/exe/pmemd.cuda.MPI -O -i
> mdin -p prmtop -c inpcrd -ref inpcrd -suffix parallelCuda.n1.g2
>
> Here is the output from FactorIX_production_NVE benchmark using 1 node
> with two M2090s:
> |------------------- GPU DEVICE INFO --------------------
> |
> | Task ID: 0
> | CUDA Capable Devices Detected: 2
> | CUDA Device ID in use: 0
> | CUDA Device Name: Tesla M2090
> | CUDA Device Global Mem Size: 5375 MB
> | CUDA Device Num Multiprocessors: 16
> | CUDA Device Core Freq: 1.30 GHz
> |
> |
> | Task ID: 1
> | CUDA Capable Devices Detected: 2
> | CUDA Device ID in use: 1
> | CUDA Device Name: Tesla M2090
> | CUDA Device Global Mem Size: 5375 MB
> | CUDA Device Num Multiprocessors: 16
> | CUDA Device Core Freq: 1.30 GHz
> |
> |--------------------------------------------------------
> ....
> | Nonbonded Pairs Initial Allocation: 10363284
> ....
> | GPU memory information:
> | KB of GPU memory in use: 304037
> | KB of CPU memory in use: 73726
>
> | Running AMBER/MPI version on 2 nodes
>
> However using when 2 nodes + 4 M2090s:
> .....
> |------------------- GPU DEVICE INFO --------------------
> |
> | Task ID: 0
> | CUDA Capable Devices Detected: 2
> | CUDA Device ID in use: 0
> | CUDA Device Name: Tesla M2090
> | CUDA Device Global Mem Size: 5375 MB
> | CUDA Device Num Multiprocessors: 16
> | CUDA Device Core Freq: 1.30 GHz
> |
> |
> | Task ID: 1
> | CUDA Capable Devices Detected: 2
> | CUDA Device ID in use: 0
> | CUDA Device Name: Tesla M2090
> | CUDA Device Global Mem Size: 5375 MB
> | CUDA Device Num Multiprocessors: 16
> | CUDA Device Core Freq: 1.30 GHz
> |
> |
> | Task ID: 2
> | CUDA Capable Devices Detected: 2
> | CUDA Device ID in use: 1
> | CUDA Device Name: Tesla M2090
> | CUDA Device Global Mem Size: 5375 MB
> | CUDA Device Num Multiprocessors: 16
> | CUDA Device Core Freq: 1.30 GHz
> |
> |
> | Task ID: 3
> | CUDA Capable Devices Detected: 2
> | CUDA Device ID in use: 1
> | CUDA Device Name: Tesla M2090
> | CUDA Device Global Mem Size: 5375 MB
> | CUDA Device Num Multiprocessors: 16
> | CUDA Device Core Freq: 1.30 GHz
> |
> |--------------------------------------------------------
> .....
> -----------------------------------------------------------------------
> ---------
> 3. ATOMIC COORDINATES AND VELOCITIES
> -----------------------------------------------------------------------
> ---------
>
> factor IX (ACTIVATED PROTEIN)
> begin time read from input coords = 2542.675 ps
>
>
> Number of triangulated 3-point waters found: 28358
>
> Sum of charges from parm topology file = 0.00031225
> Forcing neutrality...
>
> I get the following out of memory error:
> cudaMalloc GpuBuffer::Allocate failed out of memory
>
> Going back a step and rerunning the parallel_cuda tests I've noticed
> the following:
>
> ------------------------------------Running CUDA Explicit solvent
> tests.
> Precision Model = SPDP
> GPU_ID = -1
> ------------------------------------
> cd 4096wat/ && ./Run.pure_wat -1 SPDP netcdf.mod
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line
> Source
> pmemd.cuda_SPDP.M 000000000065286A Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000708348 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000006C55C6 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000669350 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000051E3CD Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000501DC8 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000004DF51D Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000040A31C Unknown Unknown
> Unknown
> libc.so.6 0000003F3601D974 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000040A229 Unknown Unknown
> Unknown
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line
> Source
> pmemd.cuda_SPDP.M 000000000065286A Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000708348 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000006C55C6 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000669350 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000051E3CD Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000501DC8 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000004DF51D Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000040A31C Unknown Unknown
> Unknown
> libc.so.6 0000003F3601D974 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000040A229 Unknown Unknown
> Unknown
> forrtl: error (78): process killed (SIGTERM)
> Image PC Routine Line
> Source
>
> libmlx4-rdmav2.so 00002AFB541D0D17 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000006979B9 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000677D5D Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000006765FF Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000065DDF3 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000006C9ED4 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000006C9311 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000066050D Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000065FB2E Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000007093DD Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000706EA9 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000006C552A Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000669730 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000005207E5 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 0000000000501DC8 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 00000000004DF51D Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000040A31C Unknown Unknown
> Unknown
> libc.so.6 0000003F3601D974 Unknown Unknown
> Unknown
> pmemd.cuda_SPDP.M 000000000040A229 Unknown Unknown
> Unknown
> forrtl: error (78): process killed (SIGTERM)
> mpiexec_raw: Warning: tasks 0-1 exited with status 174.
> mpiexec_raw: Warning: tasks 2-3 exited with status 1.
> ./Run.pure_wat: Program error
> make[1]: *** [test.pmemd.cuda.pme] Error 1
> make[1]: Target `test.pmemd.cuda.MPI' not remade because of errors.
> make[1]: Leaving directory `/ichec/packages/amber/11/test/cuda'
> make: *** [test.pmemd.cuda.MPI] Error 2
> make: Target `test.parallel.cuda' not remade because of errors.
> 6 file comparisons passed
> 0 file comparisons failed
> 7 tests experienced errors
>
> Am I missing something obvious? Could any of the AMBER use shed some
> light on the issue I'm having?
>
> Thanks in advance,
> Martin
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 03 2012 - 16:30:03 PST