Re: [AMBER] MMPBSA.MPI: IOError: [Errno 9] Bad file descriptor

From: Jason Swails <jason.swails.gmail.com>
Date: Fri, 7 Oct 2011 11:20:29 -0400

On Fri, Oct 7, 2011 at 8:20 AM, Jan-Philip Gehrcke
<jgehrcke.googlemail.com>wrote:

> Hey George (and Jason, who will probably read this with interest),
>
> I can only contribute from a very technical point of view without
> knowing all details of how MMPBSA.py.MPI works. The error you've seen is
> printed, when Python tries to close a file from which it expects that it
> is still open, but it is not, because "something" closed it before *not*
> using the `close()` method of the corresponding file object. And as
> Python tries to close the file anyways, the operating system tells it:
> hey, IOError, there is nothing to close with that file descriptor
> (because it was already closed before).
>

I think this is the appropriate explanation. However, I am tempted to chalk
this up to an mpi4py issue (or perhaps the fact that I didn't read about
some unusual, yet documented, behavior). In a couple places, every thread
reads the same file to extract information (as opposed to having the master
read and broadcast). In at least one instance (when dealing with
MMPBSA_Timer), I've seen an object created by one thread get modified by
another (which is why only the master logs any timing information). This
should *not* happen IMO, since each thread should have its own memory
locations dedicated to it. So it *could* be that each thread is opening a
file for reading with the same pointer to the same location in memory, and
we get a 'race condition' of sorts resulting in the error here.

Unfortunately, altering the code to ensure that all file opening to a single
file name is done by only a single thread will require a bit of ugly
hacking, so I'm hoping that the new approach in the upcoming release will be
in itself a bug fix to this issue (file IO of shared files is handled almost
exclusively by master, but I will look into this a little further).

It could also be that the particular file system in use has an effect on
whether or not this error is seen. If it's a FS designed for parallel use
(i.e. Lustre), I'm guessing this error would be less frequent. However,
this is all speculation. Hopefully the new version fixes stuff.

Thanks!
Jason


> It could be that under some circumstances different MPI processes of
> MMPBSA.py.MPI try to close a file with the same operating system level
> file descriptor. But, how do they do it?
>
> As we don't see a traceback here and the error message tells us "close
> failed in file object destructor", it is likely that the invalid close
> attempt happens during Python's garbage collection. Another theory is
> that the MPI implementation results in calls to `close()` on different
> file objects wrapping the same operating system level file descriptors.
>
> On the other hand, it is unlikely that the various processes call
> `close()` on identical file objects, because this would prevent the issue.
>
> In the end, we probably have some kind of race condition regarding file
> closing attempts. The behavior you've seen fits to the fact that the
> outcome of race conditions is not really predictable.
>
> All this is only a theory based on only some evidence, but in conclusion
> this looks like an issue with MMPBSA.py.MPI's file management. This
> probably does not have a negative effect on its results but should be
> investigated more deeply.
>
> Jan-Philip
>
>
> On 10/07/2011 12:45 AM, George Tzotzos wrote:
> > Hi everybody,
> >
> > I'm running MMPBSA.MPI per residue decomposition.
> >
> > I've used the program 4 times today on different trajectories which
> produced data output as expected.
> >
> > A 5th run on a new trajectory, using the same input parameters as in the
> previous runs gives the following error message.
> >
> > Beginning PB calculations with sander...
> > calculating complex contribution...
> > close failed in file object destructor:
> > IOError: [Errno 9] Bad file descriptor
> >
> > Is there a remedy for this? But most importantly what is the reason? I
> checked the archive and found that a similar problem had been reported
> earlier. I did apply the bugfix patches and as I mentioned earlier the
> program run seamlessly on earlier occasions.
> >
> > I am attaching the _MMPBSA_complex_pb.mdout.11 file for diagnostic
> purposes.
> >
> > Your help will be, as always, appreciated
> >
> > George
> >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Candidate
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Oct 07 2011 - 08:30:04 PDT
Custom Search