Re: [AMBER] PMEMD.CUDA.MPI PME tests/Benchmarks from Martin Peters on 2012-02-07 (Amber Archive Feb 2012)

From: Martin Peters <martin.b.peters.me.com>
Date: Tue, 7 Feb 2012 09:07:23 +0000

Hi Ross,

On 4 Feb 2012, at 00:15, Ross Walker wrote:

> Hi Martin,
>
> I don't mean to burst your bubble but with a DDR interconnect and not even
> full bisection bandwidth you will be extremely unlikely to get any PME runs
> to scale on multiple GPUs across nodes. There just isn't the interconnect
> bandwidth there unfortunately.

No burst bubble all I'm trying to do is put together some docs for our users.
It would be nice to show the poor scaling and avoid others doing the same.
I still see pretty good speed up on a single node so its not all bad. Its just a
little weird that the program runs fine on a single node with two gpus but
seg faults when requesting two nodes and four nodes. I don't believe it is
a memory issue but I could be wrong.

> With regards to the segfault, assuming you are running with all the latest
> bugfixes applied try setting the following on ALL nodes:
>
> export CUDA_NIC_INTEROP=1

I gave this a go too but I'm afraid it didn't change the outcome.

I was using CUDA/4.0 would 3.2 or 4.1 work any better?
Will AMBER12 resolve or give more debug messages re this seg fault to the user?

One last question, could it be a configure issue? This is how I compiled pmemd.cuda.mpi:

cd $AMBERHOME/AmberTools/src/
make clean
./configure -cuda -mpi intel
cd ../../
./AT15_Amber.py
cd src/
make clean
make cuda_parallel

config.h:
#MODIFIED FOR AMBERTOOLS 1.5
# Amber configuration file, created with: ./configure -cuda -mpi intel

###############################################################################

# (1) Location of the installation

BINDIR=/ichec/packages/amber/11/bin
LIBDIR=/ichec/packages/amber/11/lib
INCDIR=/ichec/packages/amber/11/include
DATDIR=/ichec/packages/amber/11/dat

###############################################################################

# (2) If you want to search additional libraries by default, add them
# to the FLIBS variable here. (External libraries can also be linked into
# NAB programs simply by including them on the command line; libraries
# included in FLIBS are always searched.)

FLIBS= -L$(LIBDIR) -lsff_mpi -lpbsa $(LIBDIR)/arpack.a $(LIBDIR)/libnetcdf.a -Wl,--start-group /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_intel_lp64.a /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_sequential.a /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_core.a -Wl,--end-group -lpthread -L/ichec/packages/intel/composerxe_fc/2011.1.107/composerxe-2011.1.107/lib/intel64/ -lifport -lifcore -lsvml
FLIBS_PTRAJ= $(LIBDIR)/arpack.a -Wl,--start-group /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_intel_lp64.a /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_sequential.a /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_core.a -Wl,--end-group -lpthread -L/ichec/packages/intel/composerxe_fc/2011.1.107/composerxe-2011.1.107/lib/intel64/ -lifport -lifcore -lsvml
FLIBSF= $(LIBDIR)/arpack.a -Wl,--start-group /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_intel_lp64.a /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_sequential.a /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_core.a -Wl,--end-group -lpthread -lsvml
FLIBS_FFTW2=-L$(LIBDIR)
###############################################################################

# (3) Modify any of the following if you need to change, e.g. to use gcc
# rather than cc, etc.SHELL=/bin/sh
INSTALLTYPE=cuda_parallel

# Set the C compiler, etc.

# For GNU: CC-->gcc; LEX-->flex; YACC-->bison -y -t;
# Note: If your lexer is "really" flex, you need to set
# LEX=flex below. For example, on many linux distributions,
# /usr/bin/lex is really just a pointer to /usr/bin/flex,
# so LEX=flex is necessary. In general, gcc seems to need flex.

# The compiler flags CFLAGS and CXXFLAGS should always be used.
# By contrast, *OPTFLAGS and *NOOPTFLAGS will only be used with
# certain files, and usually at compile-time but not link-time.
# Where *OPTFLAGS and *NOOPTFLAGS are requested (in Makefiles,
# makedepend and depend), they should come before CFLAGS or
# CXXFLAGS; this allows the user to override *OPTFLAGS and
# *NOOPTFLAGS using the BUILDFLAGS variable.
CC=mpicc
CFLAGS= -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DBINTRAJ -DMPI $(CUSTOMBUIL
DFLAGS) $(AMBERCFLAGS)
OCFLAGS= $(COPTFLAGS) $(AMBERCFLAGS)
CNOOPTFLAGS=
COPTFLAGS=-ip -O3 -xHost -DBINTRAJ -DHASGZ -DHASBZ2
AMBERCFLAGS= $(AMBERBUILDFLAGS)

CXX=icpc
CPLUSPLUS=icpc
CXXFLAGS=-std=c++0x -DMPI $(CUSTOMBUILDFLAGS)
CXXNOOPTFLAGS=
CXXOPTFLAGS=-O3
AMBERCXXFLAGS=-std=c++0x $(AMBERBUILDFLAGS)

NABFLAGS=

LDFLAGS=-shared-intel $(CUSTOMBUILDFLAGS) $(AMBERLDFLAGS)
AMBERLDFLAGS=$(AMBERBUILDFLAGS)

LEX= flex
YACC= $(BINDIR)/yacc
AR= ar rv
M4= m4
RANLIB=ranlib

# Set the C-preprocessor. Code for a small preprocessor is in
# ucpp-1.3; it gets installed as $(BINDIR)/ucpp;
# this can generally be used (maybe not on 64-bit machines like altix).

CPP= $(BINDIR)/ucpp -l

# These variables control whether we will use compiled versions of BLAS
# and LAPACK (which are generally slower), or whether those libraries are
# already available (presumably in an optimized form).

LAPACK=skip
BLAS=skip
F2C=skip

# These variables determine whether builtin versions of certain components
# can be used, or whether we need to compile our own versions.

UCPP=install
C9XCOMPLEX=skip

# For Windows/cygwin, set SFX to ".exe"; for Unix/Linux leave it empty:
# Set OBJSFX to ".obj" instead of ".o" on Windows:

SFX=
OSFX=.o
MV=mv
RM=rm
CP=cp

# Information about Fortran compilation:

FC=mpif90
FFLAGS= $(LOCALFLAGS) $(CUSTOMBUILDFLAGS) $(FNOOPTFLAGS)
FNOOPTFLAGS= -O0
FOPTFLAGS= -ip -O3 -xHost $(LOCALFLAGS) $(CUSTOMBUILDFLAGS)
AMBERFFLAGS=$(AMBERBUILDFLAGS)
FREEFORMAT_FLAG= -FR
LM=-lm
FPP=cpp -traditional $(FPPFLAGS) $(AMBERFPPFLAGS)
FPPFLAGS=-P -DMKL -DBINTRAJ -DMPI $(CUSTOMBUILDFLAGS)
AMBERFPPFLAGS=$(AMBERBUILDFLAGS)

BUILD_SLEAP=install_sleap
XHOME= /usr/X11R6
XLIBS= -L/usr/X11R6/lib64 -L/usr/X11R6/lib
MAKE_XLEAP=install_xleap
NETCDF=netcdf.modNETCDFLIB=$(LIBDIR)/libnetcdf.aPNETCDF=yesPNETCDFLIB=$(LIBDIR)/libpnetcdf.aZLIB=-lz
BZLIB=-lbz2

HASFC=yesMDGX=yesCPPTRAJ=yes
MTKPP=

COMPILER=intel
MKL=/ichec/packages/intel/mkl/10.2.6.038MKL_PROCESSOR=em64t

#CUDA Specific build flags NVCC=$(CUDA_HOME)/bin/nvcc -use_fast_math -O3 -gencode arch=compute_13,code=sm_13 -gencode arch=compute_20,code=sm_20PMEMD_CU_INCLUDES=-I$(CUDA_HOME)/include -IB40C -IB40C/KernelCommon -I/ichec/pac
kages/mvapich/1.5.1-intel/include
PMEMD_CU_LIBS=-L$(CUDA_HOME)/lib64 -L$(CUDA_HOME)/lib -lcurand -lcufft -lcudart
./cuda/cuda.a
PMEMD_CU_DEFINES=-DCUDA -DMPI -DMPICH_IGNORE_CXX_SEEK

#PMEMD Specific build flags
PMEMD_FPP=cpp -traditional -DMPI -P -DMKL -DBINTRAJ -DDIRFRC_EFS -DDIRFRC_COMT
RANS -DDIRFRC_NOVEC -DFFTLOADBAL_2PROC -DPUBFFT
PMEMD_NETCDFLIB= $(NETCDFLIB)
PMEMD_F90=mpif90
PMEMD_FOPTFLAGS=-ip -O3 -no-prec-div -xHost
PMEMD_CC=mpicc
PMEMD_COPTFLAGS=-ip -O3 -no-prec-div -xHost -DMPICH_IGNORE_CXX_SEEK -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DBINTRAJ -DMPI
PMEMD_FLIBSF=-Wl,--start-group /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_intel_lp64.a /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_sequential.a /ichec/packages/intel/mkl/10.2.6.038/lib/em64t/libmkl_core.a -Wl,--end-group -lpthread
PMEMD_LD= mpif90
LDOUT= -o

#3D-RISM MPI
RISMSFF=
SANDER_RISM_MPI=sander.RISM.MPI$(SFX)
TESTRISM=

#PUPIL
PUPILLIBS=-lrt -lm -lc -L${PUPIL_PATH}/lib -lPUPIL -lPUPILBlind

#Python
PYINSTALL=

All the best,
Martin

> All the best
> Ross
>
>> -----Original Message-----
>> From: Martin Peters [mailto:martin.b.peters.me.com]
>> Sent: Friday, February 03, 2012 7:09 AM
>> To: AMBER Mailing List
>> Subject: [AMBER] PMEMD.CUDA.MPI PME tests/Benchmarks
>>
>> Hi,
>>
>> We have just upgraded our stoney cluster here at ICHEC with a number of
>> M2090s and I'd like to do some scaling tests using PMEMD (24 nodes with
>> 2 cards each).
>> The cluster has 64 nodes each with two 2.8GHz Intel (Nehalem EP) Xeon
>> X5560 quad core processors and 48GB of RAM.
>> The nodes are interconnected via a half-blocking fat-tree network of
>> ConnectX Infiniband (DDR).
>>
>> Amber 11/AmberTools 1.5 (all patches up until the 17th Jan 2012 were
>> applied) was build using the following: Intel C/Fortran 2011.1.107,
>> mvapich2 v1.5.1, MKL v10.2.6.038, and CUDA v4.0. All compilation steps
>> completed without any problems.
>>
>> The GB/nucleosome test (downloaded from
>> http://ambermd.org/gpus/benchmarks.htm) scales all the way up to 24
>> nodes/48 M2090s. However I'm running into a few problems with the PME
>> cases.
>>
>> I'm executing the program using the commands similar to the following:
>> for i in {1..1}; do cat $PBS_NODEFILE | uniq >> hosts.1.2; done;
>> mpirun_rsh -n 2 -hostfile hosts.1.2 $AMBERHOME/exe/pmemd.cuda.MPI -O -i
>> mdin -p prmtop -c inpcrd -ref inpcrd -suffix parallelCuda.n1.g2
>>
>> Here is the output from FactorIX_production_NVE benchmark using 1 node
>> with two M2090s:
>> |------------------- GPU DEVICE INFO --------------------
>> |
>> | Task ID: 0
>> | CUDA Capable Devices Detected: 2
>> | CUDA Device ID in use: 0
>> | CUDA Device Name: Tesla M2090
>> | CUDA Device Global Mem Size: 5375 MB
>> | CUDA Device Num Multiprocessors: 16
>> | CUDA Device Core Freq: 1.30 GHz
>> |
>> |
>> | Task ID: 1
>> | CUDA Capable Devices Detected: 2
>> | CUDA Device ID in use: 1
>> | CUDA Device Name: Tesla M2090
>> | CUDA Device Global Mem Size: 5375 MB
>> | CUDA Device Num Multiprocessors: 16
>> | CUDA Device Core Freq: 1.30 GHz
>> |
>> |--------------------------------------------------------
>> ....
>> | Nonbonded Pairs Initial Allocation: 10363284
>> ....
>> | GPU memory information:
>> | KB of GPU memory in use: 304037
>> | KB of CPU memory in use: 73726
>>
>> | Running AMBER/MPI version on 2 nodes
>>
>> However using when 2 nodes + 4 M2090s:
>> .....
>> |------------------- GPU DEVICE INFO --------------------
>> |
>> | Task ID: 0
>> | CUDA Capable Devices Detected: 2
>> | CUDA Device ID in use: 0
>> | CUDA Device Name: Tesla M2090
>> | CUDA Device Global Mem Size: 5375 MB
>> | CUDA Device Num Multiprocessors: 16
>> | CUDA Device Core Freq: 1.30 GHz
>> |
>> |
>> | Task ID: 1
>> | CUDA Capable Devices Detected: 2
>> | CUDA Device ID in use: 0
>> | CUDA Device Name: Tesla M2090
>> | CUDA Device Global Mem Size: 5375 MB
>> | CUDA Device Num Multiprocessors: 16
>> | CUDA Device Core Freq: 1.30 GHz
>> |
>> |
>> | Task ID: 2
>> | CUDA Capable Devices Detected: 2
>> | CUDA Device ID in use: 1
>> | CUDA Device Name: Tesla M2090
>> | CUDA Device Global Mem Size: 5375 MB
>> | CUDA Device Num Multiprocessors: 16
>> | CUDA Device Core Freq: 1.30 GHz
>> |
>> |
>> | Task ID: 3
>> | CUDA Capable Devices Detected: 2
>> | CUDA Device ID in use: 1
>> | CUDA Device Name: Tesla M2090
>> | CUDA Device Global Mem Size: 5375 MB
>> | CUDA Device Num Multiprocessors: 16
>> | CUDA Device Core Freq: 1.30 GHz
>> |
>> |--------------------------------------------------------
>> .....
>> -----------------------------------------------------------------------
>> ---------
>> 3. ATOMIC COORDINATES AND VELOCITIES
>> -----------------------------------------------------------------------
>> ---------
>>
>> factor IX (ACTIVATED PROTEIN)
>> begin time read from input coords = 2542.675 ps
>>
>>
>> Number of triangulated 3-point waters found: 28358
>>
>> Sum of charges from parm topology file = 0.00031225
>> Forcing neutrality...
>>
>> I get the following out of memory error:
>> cudaMalloc GpuBuffer::Allocate failed out of memory
>>
>> Going back a step and rerunning the parallel_cuda tests I've noticed
>> the following:
>>
>> ------------------------------------Running CUDA Explicit solvent
>> tests.
>> Precision Model = SPDP
>> GPU_ID = -1
>> ------------------------------------
>> cd 4096wat/ && ./Run.pure_wat -1 SPDP netcdf.mod
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image PC Routine Line
>> Source
>> pmemd.cuda_SPDP.M 000000000065286A Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000708348 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000006C55C6 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000669350 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000051E3CD Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000501DC8 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000004DF51D Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000040A31C Unknown Unknown
>> Unknown
>> libc.so.6 0000003F3601D974 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000040A229 Unknown Unknown
>> Unknown
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image PC Routine Line
>> Source
>> pmemd.cuda_SPDP.M 000000000065286A Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000708348 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000006C55C6 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000669350 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000051E3CD Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000501DC8 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000004DF51D Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000040A31C Unknown Unknown
>> Unknown
>> libc.so.6 0000003F3601D974 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000040A229 Unknown Unknown
>> Unknown
>> forrtl: error (78): process killed (SIGTERM)
>> Image PC Routine Line
>> Source
>>
>> libmlx4-rdmav2.so 00002AFB541D0D17 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000006979B9 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000677D5D Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000006765FF Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000065DDF3 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000006C9ED4 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000006C9311 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000066050D Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000065FB2E Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000007093DD Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000706EA9 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000006C552A Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000669730 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000005207E5 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 0000000000501DC8 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 00000000004DF51D Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000040A31C Unknown Unknown
>> Unknown
>> libc.so.6 0000003F3601D974 Unknown Unknown
>> Unknown
>> pmemd.cuda_SPDP.M 000000000040A229 Unknown Unknown
>> Unknown
>> forrtl: error (78): process killed (SIGTERM)
>> mpiexec_raw: Warning: tasks 0-1 exited with status 174.
>> mpiexec_raw: Warning: tasks 2-3 exited with status 1.
>> ./Run.pure_wat: Program error
>> make[1]: *** [test.pmemd.cuda.pme] Error 1
>> make[1]: Target `test.pmemd.cuda.MPI' not remade because of errors.
>> make[1]: Leaving directory `/ichec/packages/amber/11/test/cuda'
>> make: *** [test.pmemd.cuda.MPI] Error 2
>> make: Target `test.parallel.cuda' not remade because of errors.
>> 6 file comparisons passed
>> 0 file comparisons failed
>> 7 tests experienced errors
>>
>> Am I missing something obvious? Could any of the AMBER use shed some
>> light on the issue I'm having?
>>
>> Thanks in advance,
>> Martin
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Feb 07 2012 - 01:30:03 PST