Re: [AMBER] CUDA PMEMD on Longhorn: MPI_Win_free fatal error

From: Scott Le Grand <varelse2005.gmail.com>
Date: Mon, 12 Dec 2011 10:25:09 -0800

Without the specific value of ig, nothing more can be done...

ig=-1 picks a random random seed so every run is different...


On Mon, Dec 12, 2011 at 10:00 AM, Ben Roberts <ben.roberts.geek.nz> wrote:

> Hi all,
>
> I've been trying to run a simulation on Longhorn, the GPU cluster at the
> University of Texas. When I do so, I hit a wee snag, specifically a fatal
> error in MPI_Win_free:
>
> Fatal error in MPI_Win_free:
> Invalid MPI_Win, error stack:
> MPI_Win_free(120): MPI_Win_free(win=0x2f745e8) failed
> MPI_Win_free(66).: Invalid MPI_Win
> Fatal error in MPI_Win_free:
> Invalid MPI_Win, error stack:
> MPI_Win_free(120): MPI_Win_free(win=0x195985e8) failed
> MPI_Win_free(66).: Invalid MPI_Win
> Exit code -5 signaled from c207-106.longhorn
> MPI process (rank: 1) terminated unexpectedly on c207-106.longhorn
>
> This is the tail end of my mdout file:
>
> <<snip>>
> | PMEMD ewald parallel performance parameters:
> | block_fft = 0
> | fft_blk_y_divisor = 2
> | excl_recip = 0
> | excl_master = 0
> | atm_redist_freq = 320
>
>
> --------------------------------------------------------------------------------
> 3. ATOMIC COORDINATES AND VELOCITIES
>
> --------------------------------------------------------------------------------
>
>
> begin time read from input coords = 100.000 ps
>
>
> Number of triangulated 3-point waters found: 91555
> <<EOF>>
>
> Does anyone have any idea what might be causing this?
>
> These are the conditions:
>
> Operating system: CentOS release 5.6 (so says the file
> /etc/redhat-release, anyway)
> MPI: mvapich2, version 1.4
> Compilers: Intel 11.1 (2009-08-27) - icc and ifort are the same version
>
> Amber: Version 11, patched up to patch 19 (the revised patch 19, that is)
>
> Running on two four-processor nodes, each with a GPU (so a notionally
> eight-way job, but on two GPUs)
>
>
> For what it's worth, I had earlier run the cellulose benchmark on the same
> cluster, and it worked with no problems. These are the differences between
> the cellulose mdin and the one I'm using now:
>
> nstlim = 400000 (instead of 10000)
> dt = 0.00125 (instead of 0.002)
> ntpr = 160 (instead of 1000)
> ntwr = 800 (instead of 10000)
> ntwx = 1600 (instead of 1000)
> iwrap = 1 (instead of the default)
> ntt = 3 (instead of 1)
> temp0 = 310.0 (instead of 300.0)
> gamma_ln = 5 (instead of the default)
> ig = -1 (instead of the default)
> pres0 = 1.01325 (instead of the default)
> cut = 10.0 (instead of 8.0)
> tautp = default (instead of 10.0)
> taup = default (instead of 10.0)
>
> Is it worth my time playing around with any of these settings, or am I
> better off trying something else? Alternatively, is it possible that the
> problem lies in my system?
>
> Thanks,
> Ben
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Dec 12 2011 - 10:30:03 PST
Custom Search