Re: [AMBER] Could not read velocities from restart file

From: Bill Ross <ross.cgl.ucsf.EDU>
Date: Tue, 10 Apr 2012 15:09:48 -0700

My vote is for closing a restrt every time it is written, to
gaurantee that it is flushed. Possibly this could be done by
an independent thread, to avoid slowing everything else down?

Bill

"Ross Walker" <ross.rosswalker.co.uk> wrote:

> Hi Dave et al,
>
> I think the issue is that the restart file is never fully written to disk
> when running MD. I.e. it is not being killed in the middle of a run while it
> was writing the restart file it is that the OS holds approximately 8Kb of
> the file in a buffer that it does not flush until the calculation completes
> and the file is closed as pmemd exits. If you kill PMEMD or it crashes in
> some way the last 8kb in the buffer does not get written and you get left
> with a restart file that is truncated.
>
> Here's a test I ran on my Redhat EL6 machine:
>
> [13:53][caffeine:0.05][rcw:GPU_PMEMD]$ ls -la
> total 4320
> drwxrwxr-x 2 rcw rcw 4096 Apr 10 13:53 .
> drwxrwxr-x 4 rcw rcw 4096 Apr 10 13:48 ..
> -rw------- 1 rcw rcw 1674065 Apr 10 13:41 inpcrd.rst
> -rwx------ 1 rcw rcw 243 Apr 10 13:51 mdin
> -rw------- 1 rcw rcw 2733166 Apr 10 13:41 prmtop
>
> [13:53][caffeine:0.04][rcw:GPU_PMEMD]$
> /server-home/netbin/amber11/bin/pmemd.cuda -O -c inpcrd.rst &
>
> [13:53][caffeine:0.03][rcw:GPU_PMEMD]$ ls -la
> total 4336
> drwxrwxr-x 2 rcw rcw 4096 Apr 10 13:53 .
> drwxrwxr-x 4 rcw rcw 4096 Apr 10 13:48 ..
> -rw------- 1 rcw rcw 1674065 Apr 10 13:41 inpcrd.rst
> -rwx------ 1 rcw rcw 243 Apr 10 13:51 mdin
> -rw-rw-r-- 1 rcw rcw 1191 Apr 10 13:53 mdinfo
> -rw-rw-r-- 1 rcw rcw 9711 Apr 10 13:54 mdout
> -rw------- 1 rcw rcw 2733166 Apr 10 13:41 prmtop
> -rw-rw-r-- 1 rcw rcw 0 Apr 10 13:53 restrt
>
> So at this point we have a 0 byte restart file (which is fine since I set it
> to write every 10,000 steps).
>
> If I wait until the mdout says 10,000 steps I then have:
>
> [13:54][caffeine:0.61][rcw:GPU_PMEMD]$ ls -la
> total 5976
> drwxrwxr-x 2 rcw rcw 4096 Apr 10 13:53 .
> drwxrwxr-x 4 rcw rcw 4096 Apr 10 13:48 ..
> -rw------- 1 rcw rcw 1674065 Apr 10 13:41 inpcrd.rst
> -rwx------ 1 rcw rcw 243 Apr 10 13:51 mdin
> -rw-rw-r-- 1 rcw rcw 1191 Apr 10 13:53 mdinfo
> -rw-rw-r-- 1 rcw rcw 14847 Apr 10 13:54 mdout
> -rw------- 1 rcw rcw 2733166 Apr 10 13:41 prmtop
> -rw-rw-r-- 1 rcw rcw 1674065 Apr 10 13:54 restrt
>
> So now the restrt file matches what the inpcrd.rst file is. Exactly as
> expected. If I tail restrt I get:
>
> 1.3885668 -0.1074871 -0.3129809 0.0681850 -0.2615662 0.0152306
> 69.7703165 60.2059629 54.3607717 90.0000000 90.0000000 90.0000000
>
> As expected. If I wait until 20,000 steps I get similar behavior:
>
> 0.3031101 -0.1397434 0.4848370 -0.9246060 0.7415231 0.3254361
> 69.7948383 60.1037068 54.4839923 90.0000000 90.0000000 90.0000000
>
> And if I do kill -9 at 27,000 steps I get:
>
> 0.3031101 -0.1397434 0.4848370 -0.9246060 0.7415231 0.3254361
> 69.7948383 60.1037068 54.4839923 90.0000000 90.0000000 90.0000000
>
> Which is exactly as expected. So my Redhat EL6 system is working correctly.
> I think though that some Linux systems are setup to be dangerously
> aggressive with the output buffering and thus tailing the restrt file in the
> middle of the job will actually show a truncated file. The issue is what
> exactly is causing this.
>
> It would be useful if the person who reported this problem could try my
> input on their machine and see if it behaves in the way I see. Or if it
> shows the issue with the restart file not being flushed correctly.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Apr 10 2012 - 15:30:04 PDT
Custom Search