Hi Ben,
I believe this is a known (and fixed in the development tree) bug. A patch
will be released in a bit when I have dealt with some pending deadlines.
If this is the same bug it is actually due to problems with the system
itself. If the code gets a cell with zero atoms in it crashes. This happens
only when you have huge vacuum bubbles in your system. Ideally the code
should probably quit with a message about the system being very
inhomogenous. Either way if it is this bug fixing it probably won't help you
much as your system is probably messed up with vacuum bubbles. Try looking
at the restart file carefully and see if you have some void spaces in your
system.
Then try running a much shorter NVT heating stage before switching on NPT.
Maybe do a short pressure equilibration on the CPU to equilibrate out any
vacuum bubbles and then switch to the GPU code.
If this isn't the issue, I.e. your system looks fine then please post the
necessary files to reproduce it and we can take a look at it.
All the best
Ross
> -----Original Message-----
> From: Ben Roberts [mailto:ben.roberts.geek.nz]
> Sent: Monday, December 12, 2011 10:01 AM
> To: amber
> Subject: [AMBER] CUDA PMEMD on Longhorn: MPI_Win_free fatal error
>
> Hi all,
>
> I've been trying to run a simulation on Longhorn, the GPU cluster at
> the University of Texas. When I do so, I hit a wee snag, specifically a
> fatal error in MPI_Win_free:
>
> Fatal error in MPI_Win_free:
> Invalid MPI_Win, error stack:
> MPI_Win_free(120): MPI_Win_free(win=0x2f745e8) failed
> MPI_Win_free(66).: Invalid MPI_Win
> Fatal error in MPI_Win_free:
> Invalid MPI_Win, error stack:
> MPI_Win_free(120): MPI_Win_free(win=0x195985e8) failed
> MPI_Win_free(66).: Invalid MPI_Win
> Exit code -5 signaled from c207-106.longhorn
> MPI process (rank: 1) terminated unexpectedly on c207-106.longhorn
>
> This is the tail end of my mdout file:
>
> <<snip>>
> | PMEMD ewald parallel performance parameters:
> | block_fft = 0
> | fft_blk_y_divisor = 2
> | excl_recip = 0
> | excl_master = 0
> | atm_redist_freq = 320
>
> -----------------------------------------------------------------------
> ---------
> 3. ATOMIC COORDINATES AND VELOCITIES
> -----------------------------------------------------------------------
> ---------
>
>
> begin time read from input coords = 100.000 ps
>
>
> Number of triangulated 3-point waters found: 91555
> <<EOF>>
>
> Does anyone have any idea what might be causing this?
>
> These are the conditions:
>
> Operating system: CentOS release 5.6 (so says the file /etc/redhat-
> release, anyway)
> MPI: mvapich2, version 1.4
> Compilers: Intel 11.1 (2009-08-27) - icc and ifort are the same version
>
> Amber: Version 11, patched up to patch 19 (the revised patch 19, that
> is)
>
> Running on two four-processor nodes, each with a GPU (so a notionally
> eight-way job, but on two GPUs)
>
>
> For what it's worth, I had earlier run the cellulose benchmark on the
> same cluster, and it worked with no problems. These are the differences
> between the cellulose mdin and the one I'm using now:
>
> nstlim = 400000 (instead of 10000)
> dt = 0.00125 (instead of 0.002)
> ntpr = 160 (instead of 1000)
> ntwr = 800 (instead of 10000)
> ntwx = 1600 (instead of 1000)
> iwrap = 1 (instead of the default)
> ntt = 3 (instead of 1)
> temp0 = 310.0 (instead of 300.0)
> gamma_ln = 5 (instead of the default)
> ig = -1 (instead of the default)
> pres0 = 1.01325 (instead of the default)
> cut = 10.0 (instead of 8.0)
> tautp = default (instead of 10.0)
> taup = default (instead of 10.0)
>
> Is it worth my time playing around with any of these settings, or am I
> better off trying something else? Alternatively, is it possible that
> the problem lies in my system?
>
> Thanks,
> Ben
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Dec 12 2011 - 11:00:03 PST