[AMBER] CUDA PMEMD on Longhorn: MPI_Win_free fatal error

From: Ben Roberts <ben.roberts.geek.nz>
Date: Mon, 12 Dec 2011 13:00:36 -0500

Hi all,

I've been trying to run a simulation on Longhorn, the GPU cluster at the University of Texas. When I do so, I hit a wee snag, specifically a fatal error in MPI_Win_free:

Fatal error in MPI_Win_free:
Invalid MPI_Win, error stack:
MPI_Win_free(120): MPI_Win_free(win=0x2f745e8) failed
MPI_Win_free(66).: Invalid MPI_Win
Fatal error in MPI_Win_free:
Invalid MPI_Win, error stack:
MPI_Win_free(120): MPI_Win_free(win=0x195985e8) failed
MPI_Win_free(66).: Invalid MPI_Win
Exit code -5 signaled from c207-106.longhorn
MPI process (rank: 1) terminated unexpectedly on c207-106.longhorn

This is the tail end of my mdout file:

| PMEMD ewald parallel performance parameters:
| block_fft = 0
| fft_blk_y_divisor = 2
| excl_recip = 0
| excl_master = 0
| atm_redist_freq = 320


 begin time read from input coords = 100.000 ps

 Number of triangulated 3-point waters found: 91555

Does anyone have any idea what might be causing this?

These are the conditions:

Operating system: CentOS release 5.6 (so says the file /etc/redhat-release, anyway)
MPI: mvapich2, version 1.4
Compilers: Intel 11.1 (2009-08-27) - icc and ifort are the same version

Amber: Version 11, patched up to patch 19 (the revised patch 19, that is)

Running on two four-processor nodes, each with a GPU (so a notionally eight-way job, but on two GPUs)

For what it's worth, I had earlier run the cellulose benchmark on the same cluster, and it worked with no problems. These are the differences between the cellulose mdin and the one I'm using now:

nstlim = 400000 (instead of 10000)
dt = 0.00125 (instead of 0.002)
ntpr = 160 (instead of 1000)
ntwr = 800 (instead of 10000)
ntwx = 1600 (instead of 1000)
iwrap = 1 (instead of the default)
ntt = 3 (instead of 1)
temp0 = 310.0 (instead of 300.0)
gamma_ln = 5 (instead of the default)
ig = -1 (instead of the default)
pres0 = 1.01325 (instead of the default)
cut = 10.0 (instead of 8.0)
tautp = default (instead of 10.0)
taup = default (instead of 10.0)

Is it worth my time playing around with any of these settings, or am I better off trying something else? Alternatively, is it possible that the problem lies in my system?


