[AMBER] PMEMD should halt with error exit after netcdf write failure from Chris Moth on 2017-04-20 (Amber Archive Apr 2017)

From: Chris Moth <cmoth08.gmail.com>
Date: Thu, 20 Apr 2017 17:30:25 -0500

PMEMD is now _so_ fast - that disk write integrity may merit a second
look :)

I just had the unfortunate experience of running out of disk space
allocation on our cluster during a PMEMD run that was generating GBs of
trajectory frames (very fast). PMEMD did not halt, but kept running
while I was deleting files to save space.

Fortunately, I caught the corrupted zero-data frames in my trajectory
file when doing RMSD vs start analysis...

I suggest that if PMEMD cannot write a trajectory frame, it should halt
with a hard error with EXIT_STATUS <> 0. (and I respect this request
may not be trivial in a parallel or GPU environment).

In pmemd's bintraj.F90 there is this kind of code which writes the
trajectory frames

  if (unit .eq. mdcrd) then
*call checkNCerror*(nf90_put_var(mdcrd_ncid, mdcrd_time_var_id, (/ t /), &
                       start=(/ mdcrd_frame /), count=(/ 1 /)), 'write
time')

     call checkNCerror(nf90_sync(mdcrd_ncid))

I found *checkNCerror* in the Amber16 tools fortran condes -a nd copy it
here.

As you can see it puts out an error message - but then returns (ouch).
I can;t think of a time where failure of checkNCerror should return.
It's a pretty serious situation and I would recommend hard exit instead
of return after you output the message:

!--------------------------------------------------------------------
!> MODULE AMBERNETCDF FUNCTION CHECKNCERROR()
!> .brief Passive check for netcdf error.
subroutine *checkNCerror*(err, location)
   use netcdf
   implicit none
   integer, intent(in) :: err
   character(*), optional, intent(in) :: location
   if (err .ne. nf90_noerr) then
     write(mdout, '(a,a)') 'NetCDF error: ', trim(nf90_strerror(err))
     if (present(location)) then
       write(mdout, '(a,a)') ' at ', location
     end if
   end if

   ******* CAN WE PLEASE EXIT HERE INSTEAD OF RETURN ****** ???

end subroutine checkNCerror

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Apr 20 2017 - 16:00:02 PDT