Re: [AMBER] CUDA PMEMD on Longhorn: MPI_Win_free fatal error from Ben Roberts on 2011-12-30 (Amber Archive Dec 2011)

From: Ben Roberts <ben.roberts.geek.nz>
Date: Fri, 30 Dec 2011 19:08:40 -0500

Hi again,

On 12/12/2011, at 4:19 p.m., Ben Roberts wrote:

> Hi Ross,
>
> On 12/12/2011, at 1:51 p.m., Ross Walker wrote:
>
>> Hi Ben,
>>
>> I believe this is a known (and fixed in the development tree) bug. A patch
>> will be released in a bit when I have dealt with some pending deadlines.
>>
>> If this is the same bug it is actually due to problems with the system
>> itself. If the code gets a cell with zero atoms in it crashes. This happens
>> only when you have huge vacuum bubbles in your system. Ideally the code
>> should probably quit with a message about the system being very
>> inhomogenous. Either way if it is this bug fixing it probably won't help you
>> much as your system is probably messed up with vacuum bubbles. Try looking
>> at the restart file carefully and see if you have some void spaces in your
>> system.
>>
>> Then try running a much shorter NVT heating stage before switching on NPT.
>> Maybe do a short pressure equilibration on the CPU to equilibrate out any
>> vacuum bubbles and then switch to the GPU code.
>>
>> If this isn't the issue, I.e. your system looks fine then please post the
>> necessary files to reproduce it and we can take a look at it.
>
> I'm currently running an NVT simulation to do further equilibration. When I looked at the system, though, there didn't seem to be any dishing or bubbling. The only part of the solvent that looked a bit thin to me was where I might have expected the solute's periodic image to be.

I've done a bit more investigation, incorporating Scott and Ross' suggestions for debugging. Specifically, I ran with a known "ig" value (viz., 123456), and I asked for coordinates and velocities to be saved, and energies to be logged, at every step.

I got the output results shown at the end of this email.

My suspicion, based on what I see there, is that the fundamental problem is that my system is too big ("too many hydrogens"). This complaint triggers a call to gpu_shutdown(), which if I guess correctly in turn calls MPI_Win_free, triggering the message in STDERR shown below. Would that be a fair assessment?

Also, if the problem is that there are simply too many hydrogens in the system, what are the actual limits? Or is that in some way system dependent?

Cheers,
Ben

1. The MDOUT
   ---------

This just stopped being written to, before the simulation even started in earnest. The last entry in the file was this:

Number of triangulated 3-point waters found: 91555

This was followed by an EOL and then an EOF.

2. The trajectories
   ----------------

These contain only header information, as far as I can tell. Their sizes may be measured in bytes.

3. The standard output
   -------------------

After quoting my input script, I see this output from the system:

TACC: Done.
TACC: Starting up job 119853
TACC: Setting up parallel environment for MVAPICH ssh-based mpirun.
TACC: Setup complete. Running job script.
TACC: starting parallel tasks...
Too many hydrogens for a hydrogen network, exiting.
Too many hydrogens for a hydrogen network, exiting.
TACC: MPI job exited with code: 1
TACC: Shutting down parallel environment.
TACC: Shutdown complete. Exiting.
TACC: Cleaning up after job: 119853
TACC: Done.

4. The standard error
   ------------------

Fatal error in MPI_Win_free:
Invalid MPI_Win, error stack:
MPI_Win_free(120): MPI_Win_free(win=0x1aa985e8) failed
MPI_Win_free(66).: Invalid MPI_Win
Fatal error in MPI_Win_free:
Invalid MPI_Win, error stack:
MPI_Win_free(120): MPI_Win_free(win=0x101e75e8) failed
MPI_Win_free(66).: Invalid MPI_Win
MPI process (rank: 1) terminated unexpectedly on c203-124.longhorn
Exit code -5 signaled from c203-124.longhorn

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Dec 30 2011 - 16:30:03 PST